1 Introduction

Credit risk evaluation decision is a crucial issue for banking industry because even a one percent improvement in early detection of bad credit account may avoid huge amount of losses (Lee and Chen 2005; He et al. 2010). Credit scoring is the most successful method that helps financial institutions to decide whether to grant or refuse a loan (Chi and Hsu 2012), and it can be formally defined as a statistical method for categorizing applicants as either “good credit” group that is likely to repay the financial obligation, or “bad credit” group, with high probability of defaulting their loan (Chen and Huang 2003). In this method, the information of loan applicants (such as age, education, marriage, occupation, income) are usually modeled by a set of features, and a credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future (Lessmann et al. 2015). The advantages of credit scoring are to reduce the cost and time of credit evaluation decision, improve cash flow and insure proper credit corrections (Lee and Chen 2005; Huang et al. 2007).

The popular methods used in building credit scoring models include:

  • Linear discriminant analysis (LDA): Jo and Han (1997), He et al. (2010) and Kim et al. (2012);

  • Logistic regression (LR): Hand and Henley (1997), Bahnsen and Aouada (2014), Fernandes and Artes (2016);

  • Artificial neural networks (ANN): Nanni and Lumini (2009), Hajek (2011), West (2011), Blanco et al. (2013) and Zhao et al. (2015);

  • Support vector machines (SVM): Martens et al. (2007), Bellotti and Crook (2009), Rezac (2011), Farquad and Bose (2012), Niklis et al. (2014) and Harris (2015);

  • Decision tree (DT): Olson et al. (2012) and Xia et al. (2017);

  • Bagging: Breiman (1996), Wang et al. (2012) and Xiao et al. (2016);

  • Random forest (RF): Yeh et al. (2012) and Lessmann et al. (2015);

  • Evolutionary computation techniques: Huang et al. (2006) and Aliehyaei and Khan (2014).

LDA and LR are two statistics models commonly used in credit scoring. LDA which according to the score of each credit applicant categorizes these applicants by a cutoff is a comprehensible and robust method in the credit scoring domain (Lee and Chen 2005). LR is the industry standard, which is useful to examine how a new classifier compares to this approach (Lessmann et al. 2015). Because of assumptions: (i) linear relationship between dependence and independence variables, (ii) the independence variables included in the model are multivariate and normally distributed (Blanco et al. 2013). Both LDA and LR are designed for linear data sets and lack of enough accuracy, especially for nonlinear classification problems (Thomas 2000).

Compared with LDA and LR, ANN and SVM are reported to fit data well for nonlinear data sets, but they are also criticized for long consuming time for training data and building models (Lee and Chen 2002). Even though the accuracy of the credit scoring model is an important criterion, comprehensive and transportability of the model are also significant (Harris 2015). According to the regulation of banking supervision, financial institutions in some countries are obliged to present a comprehensible justification while the credit application is denied (Tomczak and Ziȩba 2015). Unfortunately ANN and SVM are lack of comprehensive and considered as black boxes (Martens et al. 2007). DT provides a new alternative to ANN and SVM in handling credit scoring tasks, particularly in situations where the scoring models ought to be interpretable. However, the rules of DT will be too complicated to be comprehensive if the number of attributes is immense.

In addition to the above individual models, ensemble methods such as bagging and RF, and evolutionary computation techniques, for example genetic programming (GP) and ant colony optimization (ACO), are recently used to deal with credit scoring problems. Based on the results of the experiments in Huang et al. (2006), two-stage genetic programming (2SGP) can provide better accuracy in credit scoring than LR, ANN and DT. Aliehyaei and Khan (2014) presented a hybrid credit scoring approach (GP-ACO) combining GP and ACO, and compared the performance of GP-ACO with GP and ACO using two real-world data sets. Recently, Lessmann et al. (2015) compared 41 different classification algorithms, and the results suggested that several classifiers outperformed the industry standard LR.

The major contributions and significance of this paper are summarized as follows:

  1. (1)

    We present a higher accuracy credit scoring model based on LDA. To improve the accuracy of our model, two objectives are taken into consideration. The first objective is to minimize the distance from the sample misclassified to the cutoff, and the second is to maximize the distance from the sample classified correctly to the cutoff.

  2. (2)

    According to the score of each credit applicant, our credit scoring model categories “bad credit” customers and “good credit” customers by a cutoff. Compared with ANN, SVM and bagging, our credit scoring approach is more comprehensive and easy to be implemented. By comparison with the other comprehensive credit scoring models, for example, DT and CART, the decision rule of out model is more simple, especially for data sets with a lot of features.

  3. (3)

    In order to solve this problem efficiently, a novel multi-objective particle swarm optimization for credit scoring (MOPSO-CS) is proposed in this work. In MOPSO-CS, a procedure is designed to find the most suitable cutoff for discriminating credit applicants. Then, we provide two versions of MOPSO-CS. One is MOPSO-CS(ACC) which is good at handling balanced data sets, and the other is MOPSO-CS(SEN) which specializes in dealing with imbalanced data sets.

  4. (4)

    Former literature usually employed imbalanced data sets to benchmark their credit scoring models. In this paper, the experiments employ both imbalanced and balanced data sets to test our proposed MOPSO-CS and the counterparts. The results of these experiments demonstrate MOPSO-CS outperforms the counterparts in term of sensitivity while maintaining an acceptable accuracy rate.

The remaining of this paper is organized as follows. In Sect. 2, the related background on linear discriminant analysis and particle swarm optimization is reviewed. Section 3 presents a comprehensive credit scoring model with a higher accuracy. Section 4 provides a novel multi-objective particle swarm optimization for credit scoring. The experimental results and comparisons are reported in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Background information

2.1 Linear discriminant analysis

Linear discriminant analysis (LDA) classifies credit applicants based on their discriminant scores, which are calculated by a discriminant function (Akkoc 2012).

$$\begin{aligned} \hbox {Score}_i=A_iX=a_{i1}x_1+a_{i2}x_2+\cdots +a_{iR}x_R \end{aligned}$$

where \(\hbox {Score}_i\) is the credit score of sample i, R is the number of the attributes, \(X=(x_1,\dots ,x_R)^\mathrm{T}\) is the weight set, and \(A_i=(a_{i1},\ldots ,a_{iR})\) is the value set of sample i.

For two-group classification, if \(\hbox {Score}_i\ge b\), sample i belongs to class “Good” (negative); and if \(\hbox {Score}_i< b\), sample i belongs to class “Bad” (positive), where b is a cutoff (boundary), \(i=(1,\ldots ,N)\), and N is the sample size. The aim of credit scoring method based on LDA is to determine the best coefficients of the variables, denoted by \(X=(x_1,\ldots ,x_R)^\mathrm{T}\), and value b (a scalar) to separate two classes. For credit scoring practice, “Bad” means a group of bankrupt customers, “Good” means a group of normal customers.

There are several measures of classification performance commonly used in the credit industry (Bhattacharyya et al. 2011).

  • \(\hbox {Accuracy} = (\hbox {TP}+\hbox {TN})/(\hbox {TP}+\hbox {FP}+\hbox {TN}+\hbox {FN})\),

  • \(\hbox {Sensitivity} = \hbox {TP}/(\hbox {TP}+\hbox {FN})\),

  • \(\hbox {Specificity} = \hbox {TN}/(\hbox {FP}+\hbox {TN})\),

  • \(\hbox {Precision} = \hbox {TP}/(\hbox {TP}+\hbox {FP})\),

  • \(\hbox {Recall} = \hbox {Sensitivity} = \hbox {TP}/(\hbox {TP}+\hbox {FN})\),

  • \(F1\hbox {-measure} = (2 \cdot \hbox {Precision} \cdot \hbox {Recall})/(\hbox {Precision} + \hbox {Recall})\),

where TP \(=\) true positives, FN \(=\) false negatives, TN \(=\) true negatives and FP \(=\) false positives.

2.2 Particle swarm optimization

Particle swarm optimization (PSO) is one of an evolutionary computation technology, which was proposed by Kennedy and Eberhart (1995). Similarly as the other evolutionary computation algorithms such as genetic algorithms (GA), GP and ACO, PSO is population based and considered as an efficient search and optimization technique (Omran et al. 2006; Das et al. 2008). A particle swarm consist several particles which fly in search space. The position of particle j denoted as vector \(X_j=(x_{j1},\ldots ,x_{jR})\) represents a solution of the optimization problem.

At each iteration t, the velocity of particle j denoted as vector \(V_j=(v_{j1},\ldots ,v_{jR})\) is influenced by the best of position visited by itself (\(\hbox {Pbest}_j\) ) and the best position of the best particle in the swarm (Gbest) using the evolution equation as.

$$\begin{aligned} v_{jr}(t+1)= & {} wv_{jr}(t)+c_{1}r_{1}[\hbox {Pbest}_{jr}-x_{jr}(t)]\\&+c_{2}r_{2}[\hbox {Gbest}_{r}-x_{jr}(t)]. \end{aligned}$$

Then the position of particle j is updated as follow.

$$\begin{aligned} x_{jr}(t+1)=x_{jr}(t)+v_{jr}(t+1). \end{aligned}$$

where \(r=(1,\ldots ,R)\) is the dimension index of particle velocity or position, w is an inertia weight, \(c_{1},c_{2}\) are the acceleration constants and \(r_{1},r_{2}\) are two random real numbers drawn from U(0, 1).

In order to solve multi-objective optimization problem, many researchers changed a PSO to a multi-objective PSO (MOPSO). The popular MOPSO includes (i) Pareto-based (Dehuri and Cho 2009; ii) Coevolutionary approach (Omkar et al. 2008; Goh et al. 2010). Compared with other multi-objective evolutionary algorithms, MOPSO is demonstrated higher convergence speed and efficiency (Goh et al. 2010).

3 A comprehensive credit scoring model with a higher accuracy

3.1 Improving the accuracy of the credit scoring models using LDA

For the credit scoring models using LDA, constructing a linear data set is a useful technique for improving the accuracy. As shown in Fig. 1, there are four samples (\(\hbox {sample}_1,\hbox {sample}_2,\hbox {sample}_3\) and \(\hbox {sample}_4\)) in the data set. Let \(J=(j_1,\ldots ,j_R)^\mathrm{T},K=(k_1,\ldots ,k_R)^\mathrm{T}\) be two wight sets, and \(\hbox {Score}_{ij}, \hbox {Score}_{ik}\) is the score of sample i based on the weight sets J or K.

  • \(\hbox {Score}_{ij}=A_iJ=a_{i1}j_1+a_{i2}j_2+\cdots +a_{iR}j_R\),

  • \(\hbox {Score}_{ik}=A_iK=a_{i1}k_1+a_{i2}k_2+\cdots +a_{iR}k_R\).

As shown in Fig. 1a, b, according to weight set J, credit scoring model builds a linear data set; and in term of weight set K, credit scoring model constructs a nonlinear data set. If \(\hbox {cutoff}=b\), the accuracy of weight set J is the same as the accuracy of weight set K. But as shown in Fig. 2, if we change the cutoff from b to d, the accuracy of the linear data set would be improved from 75 to 100%. However, in terms of the nonlinear data set shown in Fig. 1b, the accuracy is not able to be improved.

Fig. 1
figure 1

Linear data set and nonlinear data set. a Linear data set, b nonlinear data set

Fig. 2
figure 2

Improving the accuracy of credit scoring model. a Accuracy \(=\) 75%. b accuracy \(=\) 100%

To accurately distinguish linear data sets and nonlinear data sets, we defined two variables \(\alpha _i,\beta _i\) for the criteria and constraints as follows, and a graphical representation of these parameters is shown in Fig. 3.

  • \(\alpha _i\): the distance from the score of sample i to the cutoff while sample i is misclassified;

  • \(\beta _i\): the distance from the score of sample i to the cutoff while sample i is classified correctly;

Fig. 3
figure 3

Graphical representation of \(\alpha _i,\beta _i\). a Linear data set, b nonlinear data set

Furthermore, as shown in Fig. 3, we can see that the sum of \(\alpha _i\) in a linear data set is less than that in a nonlinear data set; and the sum of \(\beta _i\) in a linear data set is more than that in a nonlinear data set.

3.2 A comprehensive multi-objective credit scoring model

In this paper, the multi-objective credit scoring problem based on LDA can be described as how to determine the best coefficients \(X=(x_1,\ldots ,x_R)^\mathrm{T}\) to achieve the following objectives:

  1. (1)

    Minimize the sum of distances between the score of each sample misclassified and the cutoff

    $$\begin{aligned} \min \hbox {SMD}= \sum _{i=1}^{N}\alpha _i. \end{aligned}$$
    (1)

    where SMD is the sum of \(\alpha _i\).

    According to the definition of \(\alpha _i\), it is calculated as.

    $$\begin{aligned} \alpha _i=d(i)\sqrt{(A_iX-b)^2}. \end{aligned}$$

    where \(A_i, b\) are given, X is unrestricted. d(i) is a 0-1 variable, which \(d(i)=1\) if \(\hbox {Sample}_i\) is misclassified and \(d(i)=0\) if \(\hbox {Sample}_i\) is classified correctly.

    • \(d(i)=0, if: A_iX< b, Sample_i \in Bad ~or~ A_iX\ge b, Sample_i \in Good.\)

    • \(d(i)=1, if: A_iX\ge b, Sample_i \in Bad ~or~ A_iX< b, Sample_i \in Good.\)

  2. (2)

    Maximizing the sum of distances between the score of each sample classified correctly and the cutoff

    $$\begin{aligned}&\max \hbox {SCD}= \sum _{i=1}^{N}\beta _i. \end{aligned}$$
    (2)

    where SCD is the sum of \(\beta _i\).

    According to the definition of \(\beta _i\), it is calculated as.

    $$\begin{aligned} \beta _i=c(i)\sqrt{(A_iX-b)^2}. \end{aligned}$$

    where c(i) is a 0-1 variable, which \(c(i)=0\) if \(\hbox {Sample}_i\) is misclassified and \(c(i)=1\) if \(\hbox {Sample}_i\) is classified correctly.

    • \(c(i)=0, if: A_iX\ge b, Sample_i \in Bad ~or~ A_iX< b, Sample_i \in Good.\)

    • \(c(i)=1, if: A_iX< b, Sample_i \in Bad ~or~ A_iX\ge b, Sample_i \in Good.\)

These above-mentioned two objectives ensure that our model can obtain higher accuracy solutions than other credit scoring models based on LDA. Meanwhile, similarly as other LDA-based credit scoring models, our model is comprehensible and easy to be implemented. According to weight set X obtained by our model, it is easy for credit manager to explain to credit applicants how their credit scores are derived.

On the other hand, in traditional data classification problem, the accuracy is the main objective to be optimized. With regard to credit card customer classification, sensitivity is more important than other metrics. In this paper, we define a variable MCN to measure the quality of the classification and locate the trade-off between specificity and sensitivity through setting two coefficients \(w_G,w_B\).

$$\begin{aligned} \hbox {MCN}= \sum _{i=1}^{N}[w_Gd_G(i)+w_Bd_B(i)]. \end{aligned}$$
(3)

where \(d_G(i),d_B(i)\) are two 0-1 variables, which \(d_G(i)=0\) if \(\hbox {Sample}_i\) is a bad credit customer or \(\hbox {Sample}_i\) is a good credit customer which is classified correctly; and \(d_G(i)=1\) if \(\hbox {Sample}_i\) is a good credit customer which is classified as bad. \(d_B(i)=0\) if \(\hbox {Sample}_i\) is a good credit customer or \(\hbox {Sample}_i\) is a bad credit customer which is classified correctly; and \(d_B(i)=1\) if \(\hbox {Sample}_i\) is bad credit customer which is classified as good.

  • \(d_G(i)=0, if: Sample_i \in Bad ~or~ A_iX\ge b, Sample_i \in Good.\)

  • \(d_G(i)=1, if: A_iX< b, Sample_i \in Good.\)

  • \(d_B(i)=0, if: Sample_i \in Good ~or~ A_iX< b, Sample_i \in Bad.\)

  • \(d_B(i)=1, if: A_iX\ge b, Sample_i \in Bad.\)

\(w_G\) is the weight of the “Good” samples, and \(w_B\) is the weight of the “Bad” samples.

$$\begin{aligned} w_G+w_B=1,~~w_G,w_B\in (0,1). \end{aligned}$$

Furthermore bigger weight \(w_G\) would result in bigger specificity. Similarly, and bigger weight \(w_B\) would result in bigger sensitivity. Obviously, if \(w_G=w_B=0.5\), MCN is equal to half of the number of misclassified samples. \(MCN=0\) if all samples are classified correctly, and then we define the solution which classifies all samples correctly as the optimal solution.

Define 1 (Optimal solution) the solution \(X_j=(x_{j1},\ldots ,x_{jR})\) which classifies all samples correctly.

$$\begin{aligned}&F=\{X_j|\hbox {MCN}(X_j)=0, X_j\in \Omega \}. \end{aligned}$$
(4)

where F is the set of optimal solutions, \(\Omega \) is the set of all solutions, and \(\hbox {MCN}(X_j)\) is the MCN of solution \(X_j\). According to Objective (1), the SMD of the optimal solution is equal to 0.

Usually, it is not able to obtain an optimal solution if the data set is linear inseparable. Then the objective of classification problem is to find the feasible solution whose MCN is the minimum.

Define 2 (Feasible solution) the solution with the minimum misclassification coefficient.

$$\begin{aligned}&S=\{X_j|\hbox {MCN}(X_j)=\hbox {MCN}_{\min }, X_j\in \Gamma \}. \end{aligned}$$
(5)

where S is the set of feasible solutions, \(\hbox {MCN}_{\min }=\min \{\hbox {MCN}(X_l)|X_l\in \Gamma \}\), and \(\Gamma \) is the set of all solutions obtained by the algorithm. Obviously, all optimal solutions are feasible solutions.

4 Multi-objective particle swarm optimization for credit scoring

With the deep research on evolutionary computation technologies, scholars find that GA is usually used to deal with discrete combinatorial optimization, and PSO is suitable for solving the optimization problem with continuous variables. Because each solution \(X_j\) is continuous in the credit scoring problem, multi-objective particle swarm optimization is used to deal with this problem. Multiple subswarms and coevolutionary approach is able to improve the efficiency and effectiveness of MOPSO (Goh et al. 2010); in this paper, a coevolutionary MOPSO is transformed for credit scoring problem, and then we propose a new multi-objective particle swarm optimization: MOPSO-CS. MOPSO-CS employs two subswarms (\(\hbox {Subswarm}_1\) and \(\hbox {Subswarm}_2\)) with the same population J to probe the search space and information is exchanged between them.

The subswarms are evaluated in an iterative manner: \(\hbox {Subswarm}_1\) evaluates SMD, and \(\hbox {Subswarm}_2\) evaluates SCD. But in terms of data classification, a particle rank in MOPSO-CS is different from Pareto rank which is usually used in traditional MOPSO. In this work, a particle rank rank(j) is given by:

$$\begin{aligned} \hbox {rank}(j)=1+n_j. \end{aligned}$$
(6)

where \(n_j\) is the number of misclassified samples according to the solution of particle j.

4.1 Best position selection and updating process

Let the position of the particle in \(\hbox {Subswarm}_1\) whose value of SMD is minimum be the personal best position (\(\hbox {Pbest}_1\)) of \(\hbox {Subswarm}_1\), and let the position of the particle in \(\hbox {Subswarm}_2\) whose value of SCD is maximum be the personal best position (\(\hbox {Pbest}_2\)) of \(\hbox {Subswarm}_2\).

In order to search optimal solution as soon as possible, we select two feasible solutions from set S, and let the position of the feasible solution with maximum SCD be the global best position (\(\hbox {Gbest}_1\)) of \(\hbox {Subswarm}_1\), and let the position of the feasible solution with minimum SMD be the global best position (\(\hbox {Gbest}_2\)) of \(\hbox {Subswarm}_2\). Through this method (shown in Fig. 4), each swarm can share information from feasible solutions, so that MOPSO-CS is able to guide the particles to optimal solution as soon as possible.

Fig. 4
figure 4

Best position selection and updating process

MOPSO-CS is similar to coevolutionary MOPSO, and at each iteration t, particle j in the kth swarm updates its current position \(x_{kjr}(t)\) and velocity \(v_{kjr}(t)\) through each dimension r by the personal best position \(\hbox {Pbest}_{kjr}\) and the global best position \(\hbox {Gbest}_{kr}\) using evolution equation shown in Eqs. (7) and (8):

$$\begin{aligned} v_{kjr}(t+1)= & {} wv_{kjr}(t)+c_{1}r_{1}[\hbox {Pbest}_{kjr}-x_{kjr}(t)]\nonumber \\&+c_{2}r_{2}[\hbox {Gbest}_{kr}-x_{kjr}(t)]. \end{aligned}$$
(7)
$$\begin{aligned} x_{kjr}(t+1)= & {} x_{kjr}(t)+v_{kjr}(t+1). \end{aligned}$$
(8)

Because there are perhaps many particles with the same particle rank, SMD and SCD are also taken into consideration while each two particles are compared. The detailed criterion of rank for each particle is as follows:

  • Particle a is superior to particle b\(\Longleftrightarrow \)

  • \(\hbox {rank}(a) < \hbox {rank}(b)\)

  • Or \(\hbox {rank}(a) = \hbox {rank}(b)\) and \(\hbox {SMD}(a) < \hbox {SMD}(b)\)

  • Or \(\hbox {rank}(a) = \hbox {rank}(b)\) and \(\hbox {SMD}(a) = \hbox {SMD}(b)\) and \(\hbox {SCD}(a) > \hbox {SCD}(b)\).

where \(\hbox {SMD}(a),\hbox {SCD}(a)\) are the SMD, SCD of particle a.

4.2 The optimization process

For MOPSO-CS, the scope of the solution is R-dimensional search space where R is the number of attributes (variables). The position of particle j in \(\hbox {Subswarm}_k\) is represented by \(x_{kj}=(x_{kj1},\ldots ,x_{kjR})\), which corresponds to a solution for the problem. The rth dimension of the position \(x_{kjr}\) (\(k=1,2;j=1,\ldots ,J;r=1,\ldots ,R\)) denotes the rth coefficient used by each sample. The coding design of particle velocity is similar to the design of particle position, which is denoted by \(v_{kj}=(v_{kj1},\ldots ,v_{kjR})\).

Firstly the algorithm randomly generates two subswarms. Then the positions and velocities of all particles in these subswarms are initialized. Each element \(x_{kjr}\) in particle position \(x_{kj}\) is randomly initialized within [\(-\,100, 100\)]. Each element \(v_{kjr}\) of particle velocity \(v_{kj}\) is randomly initialized within \([-\,10,10]\).

Secondly evaluate these subswarms. According to which has been initialized, calculate SMD and SCD of each particle using objective (1), (2). Then, evaluate the particle rank of each particle using Eq. (6). Moreover, the algorithm initializes the global best position of each subswarm and the personal best position of each particle.

In this paper, MOPSO-CS tries to find optimal solutions of this problem and terminates when the first optimal solution is found, or the maximum iteration number (T) is reached. Before the algorithm finds an optimal solution, all feasible solutions are used as the best found solutions. At each iteration of algorithm, MOPSO-CS updates the best found solutions in term of MCN and reserves the solutions with minimum MCN found by optimization process.

Fig. 5
figure 5

Algorithmic flow of MOPSO-CS

Thirdly update the position and velocity of each particle according to Eqs. (7) and (8). Finally, we select top 2J particles with minimum particle rank from old population and new population, and use them as the population for the next generation. The optimization process can be formally described as Fig. 5.

4.3 Finding the optimal cutoff for linear discriminant

In traditional credit scoring methods using LDA, the cutoff is fixed and assigned by experience. Because finding the optimal cutoff is significant for the accuracy of classification, many methods are proposed to find a better cutoff to improve the precision even if it is time-consuming. In this paper, a misclassification-number-based approach for finding the optimal cutoff is proposed.

In MOPSO-CS, each particle j represents a solution \(X_{j}=(x_{j1},\ldots ,x_{jR})^\mathrm{T}\). Then we can calculate the credit score of each sample i in term of the solution \(X_{j}\), and let \(\hbox {Score}_{ij}=\sum _{r=1}^{R}a_{ir}x_{jr}\) be the credit score of sample i based on particle (solution) j. When the scores of all samples in training set have been calculated, MOPSO-CS can find the optimal cutoff with minimum number of misclassification as follows.

Let the number of records in training set be NTRAIN, then there are up to NTRAIN different scores in training set. Each score can be used as a candidate to the cutoff. By assigning each score to the cutoff one by one, MOPSO-CS is able to test the result of each assignment and select the optimal b with the minimum misclassification number as the cutoff. The detailed approach of finding the optimal cutoff is shown as Fig. 6.

Fig. 6
figure 6

Finding the optimal cutoff with minimum misclassification number. a\(\hbox {Sample}_3\) is a bad credit customer. b\(\hbox {Sample}_4\) is a good credit customer

If sample i is a bad credit customer, for example sample 3 in Fig. 6a, let \(b=\hbox {Score}_{ij}+\xi \); and if sample i is a good credit customer, such as sample 4 in Fig. 6b, let \(b=\hbox {Score}_{ij}-\xi \). \(\xi \) is a positive minimal real number, such as 0.0001. Then the number of misclassification (\(n_j\)) of solution \(X_j\) is able to be calculated in term of \(b=\hbox {Score}_{ij}\pm \xi \). By the same method, each candidate cutoff \(b=\hbox {Score}_{ij}\pm \xi , ~i\in (1,\dots ,NTRAIN)\) is able to be tested.

For the example in Fig. 6a,

  • if \(b=\hbox {Score}_{1j}+\xi \), \(n_j=1\)

  • if \(b=\hbox {Score}_{2j}-\xi \), \(n_j=1\)

  • if \(b=\hbox {Score}_{3j}+\xi \), \(n_j=1\)

  • if \(b=\hbox {Score}_{4j}-\xi \), \(n_j=1\)

  • if \(b=\hbox {Score}_{5j}-\xi \), \(n_j=2\)

Finally, MOPSO-CS selects the optimal cutoff with minimum misclassification number from all candidate b. In the example shown in Fig. 6a, we can let \(b=\hbox {Score}_{1j}+\xi \), and the precision of this classification is 80%. The pseudocode of finding cutoff process is shown in Algorithm 1.

figure a

Obviously, the complexity of this process is \(O({ NTRAIN}^2)\). Since the process of finding the optimal cutoff is complicated and time-consuming, we present two versions of MOPSO-CS. One is MOPSO-CS(ACC) which omits the optimization on the cutoff, and the other is MOPSO-CS(SEN) which includes the optimization on the cutoff. Through computational experiments, we discover that the time of computation of MOPSO-CS(SEN) is longer than MOPSO-CS(ACC), but the speed of convergence of MOPSO-CS(SEN) is faster than MOPSO-CS(ACC).

5 Computational experiments

5.1 The selection and preprocessing of credit data set

In computational experiments, we use various data sets that represent the credit behaviors of people from different countries which include two real-world data sets from UK and two benchmark data sets, i.e., German and Taiwan. All these data sets classify people as Normal or Bankrupt credit risks based on a set of attributes (variables). In order to investigate the effectiveness of the proposed algorithm in imbalanced and balanced data sets, the real-world credit data from UK are separated to two date sets. The first is a UK bankruptcy data set (He et al. 2010), which collects 323 bankrupt and 902 normal customers with 14 variables. The second is a balanced data set derived from the first data set, which consists of 323 normal customers and 323 bankrupt customers. The other two data sets are German (700 normal and 300 bankrupt customers with 24 features) and Taiwan (3000 normal and 3000 bankrupt customers with 23 features), which are all from UCI Machine Learning databases (Xia et al. 2017).

Previous research (Koutanaei et al. 2015) has demonstrated that selecting suitable features can improve the accuracy in credit scoring problems, and then we have preprocessed the data before classification. Firstly, we create two versions of each data set, one is the raw data which is not standardized, the other is standardized. Secondly, we apply five different methods to select features. As shown in Table 1, for each data set, ten experiments are performed. The data in the experiments with odd index are not standardized, and the other experiments with even index are standardized.

Table 1 Selection and preprocessing of credit data set

5.2 Experimental evaluation

5.2.1 Implementation of the classifiers

In this section, our proposed method, MOPSO-CS, is compared with nine classic classification methods: NaiveBayes, logistics regression (LR), support vector machine (SVM), artificial neural network (ANN), decision tree (DT), CART, bagging-DT, bagging-ANN and random forest (RF). In these classifiers, MOPSO-CS, DT and CART are comprehensible. MOPSO-CS is conducted by Visual C++ 6.0, and the implementation software of these counterparts is WEKA 3.8. All experiments were performed on a PC Core i5 with 2.5GHz and 4GB RAM running under the Windows 10 operating system. Table 2 shows the implementation softwares that were explored to run the four data sets.

Table 2 Employed classifiers
Table 3 Parameters of MOPSO-CS
Fig. 7
figure 7

Accuracy of MOPSO-CS of each data set. a MOPSO-CS(ACC). b MOPSO-CS(SEN)

Fig. 8
figure 8

Specificity of MOPSO-CS of each data set. a MOPSO-CS(ACC), b MOPSO-CS(SEN)

Fig. 9
figure 9

Sensitivity of MOPSO-CS of each data set a MOPSO-CS(ACC), b MOPSO-CS(SEN)

Because the real-world data sets are usually linear inseparable, it is hard for MOPSO-CS to obtain an optimal solution. We use the maximum iteration as the termination criteria of MOPSO-CS. The maximum iteration (T) is decided on stabilization principle which is testified by numerous tests. In our experiments, the cutoff is set to \(-\,1.1\) in MOPSO-CS(ACC), and it is variable in the MOPSO-CS(SEN) as shown in Sect. 3.4. The parameters used in MOPSO-CS are presented in Table 3.

5.2.2 The convergence of MOPSO-CS

We conducted tenfold cross-validation to evaluate the performance of these classifiers, and the tenfold cross-validation process is repeated 10 times. In order to investigate the convergence of the proposed algorithm, we record the accuracy, specificity and sensitivity of the feasible solutions obtained by MOPSO-CS at each iteration. Figures 7, 8 and 9 show the accuracy , specificity and sensitivity of each iteration in the first tenfold cross-validation process of the experiment, which contains the highest accuracy.

From Fig. 7, the accuracy obtained by MOPSO-CS is rising with an increasing number of generation, and it shows that the multi-objective credit scoring method we proposed is effective for improving the accuracy of data classification. Compared with MOPSO-CS(SEN), MOPSO-CS(ACC) performs better in terms of accuracy.

The results of Fig. 8 indicate that specificity fluctuates with the iterations of MOPSO-CS. Moreover, because \(w_G\) in MOPSO-CS(ACC) is bigger than \(w_G\) in MOPSO-CS(SEN), the specificity of MOPSO-CS(ACC) is excel than that of MOPSO-CS(SEN).

Figure 9 indicates that sensitivity increases with the iterations of MOPSO-CS, and the sensitivity of MOPSO-CS(SEN) is better than that of MOPSO-CS(ACC). Moreover, the results shown in Figs. 7, 8 and 9 reveal that MOPSO-CS(SEN) has converged to steady solutions within 100 iterations, and MOPSO-CS(ACC) converges more slowly than MOPSO-CS(SEN).

5.2.3 Results and discussion

In this section, we compare the performance of MOPSO-CS with nine counterparts. For each date set, we perform ten experiments and define the experiment which contains the highest accuracy as standard experiment. In Tables 4, 5, 6, and 7, the last column represents the index of standard experiment of each classifier.

Table 4 shows the results of tenfold cross-validation of the UK1 data set (902 normal customers, 323 bankrupt customers). As shown in Table 4, the sensitivity and F1-measure of MOPSO-CS(SEN) excel all counterparts, and its accuracy is more than 73%. Although accuracy is not as important as sensitivity for credit scoring problem, we also try to optimize the precision of classification by MOPSO-CS. The accuracy of MOPSO-CS(ACC) is superior to NaiveBayes, LR, ANN, CART, bagging-DT, bagging-ANN and RF, only below SVM and DT.

Table 4 Results on the UK1 data set over performance measures

Table 5 shows the results of tenfold cross-validation of the UK2 data set (323 normal customers, 323 bankrupt customers). Because the number of bankrupt customers is the same as that of normal customers in a balanced data set, the accuracy of this data set is harder to be improved than that of imbalanced data sets. MOPSO-CS(ACC) achieves the highest accuracy of all classifiers and exhibits the second best F1-measure after NaiveBayes. Moreover, MOPSO-CS(SEN) obtains the second best sensitivity, only slightly below NaiveBayes. DT’s precision is the best, but its F1-measure is the poorest.

Table 5 Results on the UK2 data set over performance measures

Table 6 shows the results of tenfold cross-validation of the German data set. The sensitivity and F1-measure of MOPSO-CS(SEN) excel those of all other counterparts. RF achieves the highest accuracy and best precision. The accuracy of MOPSO-CS(ACC) excel those of NaiveBayes and DT. Moreover, the accuracy of MOPSO-CS(SEN) is above 75%.

Table 6 Results on the German data set over performance measures

In the case of the Taiwan data set, MOPSO-CS(ACC) performs better than all classifies in terms of accuracy, specificity and precision (see Table 7). Meanwhile, MOPSO-CS(SEN) achieves the highest sensitivity, and its accuracy is defeated by most of the counterparts. In this experiment, the accuracy of MOPSO-CS(SEN) is still acceptable because it is better than the industry standard LR.

Table 7 Results on the Taiwan data set over performance measures

Another conclusion emerges from the results of our experiments. For different data sets, data standardization has different effects on improving the accuracy of data classification. For the UK1 data set, data standardization is not able to improve the accuracy, because most of the indexes are odd, except for NaiveBayes, as shown in the last column in Table 4. For the UK2 data set, data standardization is effective for improving the accuracy, because many of the indexes are even as shown in the last column in Table 5. For the German and Taiwan data sets, because the number of odd index is very close to the number of even index as shown in Tables 6 and 7, it is hard to say whether standardization is helpful in improving the accuracy.

5.2.4 Comparing with other algorithms focusing on credit scoring

In this subsection, MOPSO-CS is compared with two competitive approaches focusing on credit scoring proposed by former literatures. For the UK1 data set, we compare MOPSO-CS with multiple criteria and multiple constraint-level programming (MC2) presented by He et al. (2010). Similarly as MOPSO-CS, MC2 is an approach based on linear discriminate model. According to the results shown in Table 8, MOPSO-CS is able to improve the accuracy for the classifiers based on linear discriminate model. The best of accuracy obtained by MC2 (M11) is 65.14%, and MOPSO-CS(ACC) increases it by 9.65%. Although the best of sensitivity of MC2 (M2) is 85.37%, the accuracy of M2 is unacceptable. The sensitivity of MOPSO-CS(SEN) is 31.64%, meanwhile its accuracy is above 70%.

Table 8 Results for MOPSO-CS versus MC2 on the UK1 data set

For the German and Taiwan data sets, we compare MOPSO-CS with XGBoost presented by Xia et al. (2017). From the results shown in Table 9, MOPSO-CS outperforms XGBoost on sensitivity. The best sensitivity of XGBoost (XGBoost-GS) is 50.21%, and MOPSO-CS(SEN) improves it by 7.55%. Meanwhile, the accuracy of MOPSO-CS(SEN) is only 1.38% lower than XGBoost-GS. Compared with XGBoost, MOPSO-CS is the best for predicting “Bad” customers with a satisfied accuracy rate for the German data set.

Table 9 Results for MOPSO-CS versus XGBoost on the German data set

In the case of the Taiwan data set, as shown in Table 10, MOPSO-CS(ACC) performs better than XGBoost on accuracy and specificity. Meanwhile, the accuracy and sensitivity of MOPSO-CS(SEN) are defeated by XGBoost. It provides some evidences that MOPSO-CS(SEN) is not good at handling balanced data sets.

Table 10 Results for MOPSO-CS versus XGBoost on the Taiwan data set

5.2.5 Statistical test and computation time

Nonparametric Friedman test is employed to examine whether the selected eleven classifiers are statistically different. The statistic of Friedman test is calculated as follows:

$$\begin{aligned}&\chi _{F}^{2}=\frac{12D}{K(K+1)}\left[ \sum _{j=1}^{K}\left( \sum _{i=1}^{D}r_j^{i}\right) ^2-\frac{K(K+1)^2}{4}\right] \end{aligned}$$
(9)

where D and K denote the numbers of the experiments and the classifiers and \(D=10\), \(K=11\). \(r_j^{i}\) is the rank of classifier j on experiment i. In this paper, the rank is based on the accuracy that each algorithm achieved, to the algorithm with the highest accuracy, rank 1; to the algorithm with the second highest, rank 2, etc.

If \(\chi _{F}^{2}\) is larger than a critical value, the null hypothesis is rejected. From the results shown in the last line of Table 11, we verify that the classifiers are statistically different.

With regard to average rank, MOPSO-CS(ACC) achieves the best average rank for the UK2 data set, and second best after bagging-ANN for the Taiwan data set. It proves that MOPSO-CS(ACC) performs well in balanced data sets. For the UK1 and German data sets, compared with the other two comprehensible classifiers, MOPSO-CS(ACC) is excel than DT and below CART. Because MOPSO-CS(SEN) is focusing on improving the sensitivity of data classification, the average rank of MOPSO-CS(SEN) is not ideal, and only above NaiveBayes.

Table 11 Average rank of each classifier on each data set

Table 12 shows the computation time of a tenfold cross-validation process for the classifiers. Because MOPSO-CS is an algorithm based on iteration, and it consumes less computation time than ANN and bagging-ANN, but more than the other seven counterparts. Furthermore, similarly as SVM, MOPSO-CS is not sensitive for the number of attributes in data sets. The number of attributes in the German data set is approximately twice of that of the UK1 data set, but the computation time of MOPSO-CS dealing with the German data set is less than that of MOPSO-CS handling with the UK1 data set. Because of process for finding cutoff, the computation time of MOPSO-CS(SEN) is more than that of MOPSO-CS(ACC).

Table 12 Computation time of the classifiers

Through the computation experiments, we can observe that MOPSO-CS is more flexible than all eleven counterparts. By setting weight \(w_G\), \(w_B\), MOPSO-CS is able to trade-off between specificity and sensitivity for data classification. Compared with NaiveBayes, LR, SVM, ANN, DT, CART, bagging, RF, MC2 and XGBoost, MOPSO-CS(SEN) can obtain more superior sensitivity which is important for credit scoring problem. Moveover, MOPSO-CS(ACC) achieves the better accuracy, especially for balanced data sets.

6 Conclusion and future work

Traditional credit scoring models are usually based on distance; in these models, there are two kinds of objectives, one is to minimize the internal distance and the other is to maximize the external distance. But the solution with minimum internal distance or maximum external distance is not able to be proved to classify the samples correctly. In this paper, we define a metric based on misclassification number to measure the quality of the classification, and then a novel multi-objective credit scoring model is proposed. By setting the weight in our model, we can locate the trade-off between specificity and sensitivity of credit scoring problem.

In the linear discriminant credit scoring model, finding the optimal cutoff is a critical issue. It is difficult to solve this problem using linear programming approach (He et al. 2010). But in our credit scoring approach, because the credit score of each sample is able to be calculated before the cutoff is optimized, it is easy to find the optimal cutoff with minimum misclassification number.

To credit scoring problem, it is important for credit scoring model to provide the credit score which is easy to be comprehended by credit applicants. Compared with black box technologies such as ANN and SVM, the credit score function in our approach is more comprehensible. Finally, our example and experimental studies based on benchmark data sets and real-world data sets confirm that our proposed method outperforms the counterparts in term of sensitivity while maintaining acceptable accuracy.

Future work can be carried out in several directions. Firstly, we would like to apply our proposed model to other application domains: such as life sciences and social sciences. Secondly, our model is designed for dealing with a two-group classification problem in this paper, and it could be extended to multi-group classification problems. Moreover, as well as feature selection, feature extraction may lead to better predictive performance that we would like to study and apply.