Keywords

1 Introduction

Credit loan is an unsecured loan model. In recent years, the credit market has been expanding rapidly in China. On one hand, the rapid development of China’s economy has shortened the cycle of capital turnover. On the other hand, because of the improvement of Chinese national consumption capacity, businesses increasingly need high demand for funds, so a large number of P2P Internet inclusive financial platforms came into being. As no complete credit evaluation system like banks in China, P2P platform has small contain ability to non-collateral customers, it obtains better risk prediction results only through the establishment of the corresponding credit risk assessment model. So a large number of platforms are exploring their own methods of credit risk assessment, most of which use data mining approach to try to collect and understand the customer information to better grasp the authenticity and validity of customer information; to evaluation financial situation of customers more reasonable; to predict the business conditions, repayment intention and ability of borrows more accurately.

The establishment of a good credit risk evaluation model is the biggest challenge to the development of P2P platform and credit market. If the model control is too strict to the customer, the platform will lose some high-quality customers and make it passive in the industry competition. On the contrary, the overdue rate of the platform will continued to rise, which makes financial managers difficult to be responsible and lose credibility. Therefore, it is important to establish the credit risk evaluation model to prevent bad debts happening, to promote the speed of capital flow and to maintain the security and stability of capital. In the field of credit risk assessment, artificial neural networks, genetic programming, genetic algorithms, support vector machines, logistic regression and some hybrid models have achieved gratifying results in terms of performance and precision.

In the past few years, many excellent algorithms and research methods have been tested on the basis of customer information data in the field of credit risk assessment. Khashman used artificial neural network algorithm in Germany customer dataset and achieved the accuracy rate of 83.6% [1]. Bekhet and Eletter applied RBF network algorithm to the Jordanian commercial bank data set, and the test sets had accuracy rate of 86.5% [2]. Wang et al. uses the improved BP neural network algorithm and the accuracy rate is 86% [3]. The traditional Artificial Neural Network has the stationary structure, but Flexible neural tree (FNT) has the special structures which called flexible tree structures, with this characteristic, FNT model can get better property from the learning.

In this paper, a new method based on FNT model was proposed for classification of customer information, and the results in 10-fold cross validation shows our method achieved better performance than the other state-of-arts.

2 Data Collection and Variable Definition

Customer information data can be described from many dimensions. In this paper, we randomly took 300 samples of overdue customers and 300 Negative samples of non-overdue customers all of which were from 2,000 customers of Jinan Hengxin Micro-Investment Advisory Co., Ltd. between 2014 and 2016. In this study, the author chooses 13 dimensions to describe and consider the customer information. The standard of selected dimensions are: (1) do not contain the customer’s identity information; (2) exclude the subjective information from the point of view of the actual human audit, such as the use of loans, business models, profits and other objective information which can only be verified by a third party as difficulties to verify and census them.

According to these principles, the selected dimensions can maximize the provided data by customer which objectively and difficulty to forge. The accurate classification based on actual data which can verify and excluding the subjective description. Table 1 shows the variable, values, and definitions of 13 selected dimensions of the study, and the Table 2 shows the examples of datasets.

Table 1. Proposed variables for building dataset
Table 2. Examples

The 600 samples are based on the statistics in Table 1, and then all the data will processed as “Max_Min standardization” for the next step, and get ready to input to the FNT model, the normalized samples are shown in Table 3. The normalization rule is shown in Eq. (1).

$$ P_{ij}^{'} = \frac{{P_{ij} - m_{i} }}{{M_{i} - m_{i} }} $$
(1)

where, \( P_{ij}^{'} \) is the normalized customer data. \( P_{ij} \) is the original customer data. \( M_{i} \) is the maximum value of the dimension i. \( m_{i} \) is the minimum value of the dimension i.

Table 3. Normalized samples

3 Classification Method

3.1 Flexible Neural Tree

Flexible neural tree (FNT) is a special artificial neural network with flexible tree structures. It is proposed by Chen et al. [4, 5] and relatively easy for this model to reach near-optimal structure by using optimization algorithms. The FNT model consists of tree-structural encoding method and specific instruction set, it is also generated by using function set F and terminal instruction set T, described as follows.

$$ {\text{S}} = {\text{F}}\,\mathop \cup \nolimits \,{\text{T}} = \left\{ { +_{2} , +_{3} \cdots +_{N} } \right\}\,\mathop \cup \nolimits \,\left\{ {x_{1} \cdots x_{n} } \right\} $$
(2)

where \( +_{i} (i = 1,2 \cdots N) \) denotes non-leaf nodes with i arguments, the \( x_{1} ,x_{2} \cdots x_{n} \) are leaf nodes with none arguments.

Figure 1 shows the output of a non-leaf node which calculated by FNT model. Instruction \( +_{i} \) is also called a flexible neuron operator with i inputs. The output of a flexible neuron +n is calculated as follows and the total excitation of \( +_{n} \) is given by

$$ net_{n} = \sum\nolimits_{j = 1}^{n} {w_{j} } x_{j} $$
(3)
Fig. 1.
figure 1

Non-leaf node of flexible neural tree with a terminal instruction set \( {\text{T}} = \{ x_{1} ,x_{2} , \cdots ,x_{n} \} \)

In Eq. (3), \( x_{j} (j = 1,2, \cdots ,n) \) are the input elements to node \( +_{n} \). The output of the node \( +_{n} \) is then calculated by

$$ out_{n} = f\left( {a_{n} ,b_{n} ,net_{n} } \right) = e^{{ - (\frac{{net_{n} - a_{n} }}{{b_{n} }})^{2} }} $$
(4)

A typical FNT model is illustrated in Fig. 2. Its overall output can be computed from left to right by a depth-first method recursively.

Fig. 2.
figure 2

Typical representation of FNT with function instruction set \( \{ +_{2} , +_{3} , +_{4} , +_{5} , +_{6} \} \) and terminal set \( \{ x_{1} ,x_{2} ,x_{3} \} \), which has four layers.

General learning algorithm of FNT

  • Step 1. Initialize the values of parameters used in the particle swarm optimization (PSO) algorithms. Set the elitist program as NULL and set the fitness value as the biggest positive real number. Create the initial population.

  • Step 2. Construct optimization using PSO algorithm, in which the fitness function is calculated by root mean square error (RMSE).

  • Step 3. If the better structure has found, then go to step 4, otherwise go to step 2.

  • Step 4. Optimize parameters using PSO algorithm.

  • Step 5. If the maximum number of local search is reached, or no better parameter vector is found for a significantly long time (100 steps), then go to step 6; otherwise go to step 4.

  • Step 6. If the satisfied solution is found, then stop; otherwise go to step 2.

3.2 Prediction Assessment

In statistical analysis, two methods can be used to check the effectiveness of the classifier in applications, namely, independent dataset tests and 10-fold cross validation tests. For 10-fold cross validation, the full training set will be separated equally into 10 subset. Each subset will regarded as test data set to compute the overall accuracy (OA) of the model trained by the rest of full training data set. In addition, Sensitivity (Sens) and Specificity (Spec) are also used to evaluate the performance of classifier.

4 Discussion and Results

In this study, the FNT model was used to perform a 10-fold cross validation of a data set containing 600 sample data, i.e. 540 training samples and 60 testing samples were used for each experiment and were performed on each data set. The results show that the average accuracy of the test set is 88.32% (Table 4). In the Table 4, “T” is abbreviation of “trail”, “D” is abbreviation of “data”, “OA” is abbreviation of “Overall”, “A-acc” is abbreviation of “Average accuracy rate” and “acc” is abbreviation of “accuracy rate”, the values of “A-acc” and “acc” are percentages.

Table 4. The part of results of FNT model in 10-fold cross validation

We compared the average accuracy, sensitivity and specificity between our model and other methods. The results are shown in Table 5, we can see that our method has higher accuracy compared to other method, and the specificity is slightly better than the others. Another point to make is this: the sensitivity value of Improved BP Neutral Network method is 91.6%, and this value was calculated by once experiment result form with 14 positive simples and 6 negative samples, totally 20 simples. The proportion of positive samples is much higher, so the sensitivity value also high, besides the sensitivity index is mentioned there only and no mention of any other place, so this value is included in Table 5 for reference.

Table 5. The comparison of our method and other methods

5 Conclusion

In this study, we proposed a redesigned and redefined customer information feature dimension and FNT model for the field of credit risk assessment. Compared with other methods, the method proposed in this study has different degrees of improvement in various evaluation indexes, while the validity of the FNT model is proved. In the future, we will continue to improve the algorithm method and search for more effective classifiers in order to obtain better classification accuracy in this field.