Keywords

1 Introduction

Even though Logistic Regression (LR) is one of the most common algorithms used in the financial industry [4], different studies have demonstrated that it is not the most accurate estimation for credit risk classification. A reference to this is two benchmarking studies published by Baesen [2, 3] that demonstrate that for the category of individual classifier, Deep Learners are more accurate than LR.

Despite better accuracy of Deep Learners, Financial Institutions use LR for credit scoring due to their Operational efficiency (simplicity), and Interpretability (transparency) in predictions [11]. These two points, and statistical accuracy form part of the five key characteristics of a successful credit risk mode defined for Baesen [4], shown in Table 1. However, due to the complex nature of Deep Learners, they are considered as Black Box Model(BBM), which refers to complex models that are not straightforwardly interpretable by humans [25], making them unviable for the use of financial institutions due to International regulations [7]. However, the application of interpretability methods permits us to give transparency to DL models.

In this work, we propose an explainable Deep Learning model based on a 2D Convolutional Neural Network (CNN) for credit risk classification. The use of CNN for credit risk is not new. However, Our approach uses DeepInsight [31], a methodology proposed by Alok Sharma et al., to transform tabular data into a 2D representation as input for CNN. The classification accuracy of the DeepInsight combined with CNN showed a better performance than Decision Tree (DT), LR, and RF for large datasets (more than 5900 samples), as shown in the results of our paper. Additionally, the use of SHapley Additive exPlanations (SHAP) to explain the model’s prediction gives us an explainable and more accurate model than the LR model for credit risk classification.

Table 1. Key characteristics of successful credit risk model [4]

The rest of the paper is organized as follows. In Sect. 2 we present the motivation for this work, while Sect. 3 discusses the state of the art of Credit risk assessment. In Sect. 4 we introduce the proposed method and in Sect. 5 we provide details about the implementation and experiments made in order to obtain the best model, as well as the comparison with previous models. Finally, in Sect. 6 we summarize our work and discuss some perspectives for future work.

2 Motivation

Why is credit risk so important? First, it is a matter of common knowledge that any economy, no matter how advanced, cannot develop in the absence of credit [5]. On other hand, a relaxed credit policy can become the core of a global financial crisis like 2007–2009.

The credit cycle begins with credit being easily accessible to customers, and they can borrow and spend more. In the same way, enterprises can borrow and make more significant investments. More consumption and investment create jobs and lead to income and profit growth. However, all economic expansion induced by credit ends when critical economic sectors become incapable of paying off their debts [17]. When the credit cycle is broken, there is a strong possibility of crisis (Fig. 1).

Fig. 1.
figure 1

Credit cycle [15]

The production of accurate credit risk tools for financial institutions allows them to make better decisions about granting credits. A reasonable administration of credit is an essential part of the growth of almost all economies. Economic growth is the most powerful instrument for reducing poverty and improving the quality of people’s life. Growth can generate virtuous circles of prosperity and opportunity [12]. In conclusion, the research of credit risk topics profoundly impacts the world and people’s lives.

3 Related Work

Durand [13] gave the bases for statistical credit risk scoring about 80 years ago. Nowadays, thanks to the evolution of statistical classification techniques, computational power, and easy access to sizable and reliable data, financial institutions use the statistical approach for credit risk management [4]. Many different classification models have been developed to address the credit scoring problem during the past few decades. Logistic Regression [6] and Random Forest [4] are the most widely-used model for credit scoring. However, more sophisticated machine learning techniques like Support Vector Machine (SVM) and Artificial Neural Networks (ANN) are also widely applied to credit scoring. Furthermore, ensemble methods that combine the advantage of various single classifiers get good scores like HCES-Bag with the best score in benchmark scoring published by Lessmann and Baesens [3].

Different empirical studies have compared the performance of different classification models for credit scoring. For example, West [33] compares ANN against traditional machine learning techniques. The result showed that ANN has better performance than LR. On the other hand, Cuicui et al. designed a Deep Belief Network (DBN) for credit classification and compared it against SVM, LR, and Multilayer Perceptron on the credit default swaps dataset [23]. The result showed that DBN yields the best performance.

Convolutional Neural Network (CNN) is a representative technique in DL; it first appeared in the work of Yann Lecun et al., designed to handle the variability of data in 2D shape [19]. The impressive achievements of CNN in different areas, including but not limited to Natural Language Processing (NLP) and Computer Vision, attract the attention of industry and academia [21]. Moreover, in the last few years, attracted by the classification ability of CNN, some studies have begun to apply CNN to managing credit risk. Bing Zhu, Wenchuan Yang, and Huaxuan Wang propose a model named “Relief-CCN” [37] that combines CNN and Relief algorithm. The results demonstrate a better performance against LR and RF with the dataset from a Chinese consumer finance company. On the other hand, Xolani Dastale and Turga Gelik [10] propose another CNN model for credit scoring getting better performance than traditional Machine learning methods. However, the last two models change the tabular data into a 2D representation for the CNN input discretizing the data and generating a representation with only ones or zeros in the values with possible data loss.

4 Proposed Approach

Credit risk classification is a data mining problem, thinking about this, we propose a process based on CRoss-Industry Standard Process for Data Mining (CRISP-DM) which is a process model for data mining [9]. Our proposal is a modification in the last step, called implementation on CRISP-DM, and changed by Interpretability where the generation of local and global explanations are generated, as shown in Fig. 2.

Fig. 2.
figure 2

Proposed model phases based on CRISP-DM

However, all the steps of CRISP-DM are essential, we will focus on the steps of Data preparation (especially on the task of Format data), Modeling, and an extra step called Interpretability defined for us after the evaluation step where we make the explanations of the generated mode.

4.1 Data Preparation

Format data [9] is part of the data preparation phase, which refers to all activities needed to transform the initial raw data into the data used as input for Machine Learning algorithms. The data used for financial institutions are generally in tabular [4] form (data displayed in columns or tables). However, The proposed 2D CCN requires image data representation.

Fig. 3.
figure 3

Illustration of the DeepInsight [31] methodology to transform a feature vector to image pixels.

DeepInsight [31] transforms the tabular data with a sequence of steps. First, it generates a feature vector transposing the dataset. Second, it maps each feature into a 2D space using t-SNE. Third, for efficiency, DeepInsight finds the small rectangle area that covers all points to be horizontal framed. Fourth, based on the dimensions of the final image defined for the user DeepInsight make a process of framing and mapping each feature. Finally, each instance is represented using the general feature image generated in the last step modifying the values of each feature for the normalized value of the instance features values, which can be seen as a greyscale image; the Fig. 3 shows the transformation process.

4.2 Modeling

Convolutional Neural Network is one of the most used deep learning architectures for image processing [36]. The basic structure of CNN is shown in Fig. 4. There are two particular types of layers in CNN called the convolutional layer and the pooling layer. The convolutional layer is the basic building block of CNN [26]. It contains a set of learnable filters that slide over the image to extract features. The pooling layer reduces the spatial size representation and the number of parameters giving more efficiency and control overfitting.

Convolutional neural networks differ from traditional neural networks by replacing general matrix multiplication with convolutional, reducing the weights in the network, and allowing the import of an image directly. Additionally, The convolution layer has several main features, two of which are local perception and parameter sharing. Local perception refers to the high relevance of image parts that are close compared to the low relevance of the distant parts [35]. On the other hand, parameter sharing learn one set of parameters throughout the whole process instead of learning different parameter sets at each location [37]. These features help to improve the efficiency of the network.

Fig. 4.
figure 4

The basic structure of CNN.

4.3 Interpretability

Miller defines interpretability as the degree to which a human can understand the cause of a decision [24]. The interpretability of a Machine Learning model is inversely related to its complexity. CCN is considered a Black Box Model (BBM) that refers to complex models that are not straightforwardly interpretable to humans [27]. However, different methods exist to explain BBM, like SHapley Additive exPlanations (SHAP) [22] used in this paper to generate the local and global explanations.

SHAP was proposed by Lundberg and Lee is a unified approach to interpreting model predictions. SHAP is a method to explain individual predictions based on the calculation of Shapley Values [22]. However, SHAP can give us a global explanation of a model based on the average of absolute Shapley values per feature of a random subset of dataset samples.

Shapley Values (SV) proposed by Shapley is a method based on coalitional game theory (or cooperative game theory) [30]. SV explains a prediction assuming that each feature value of a sample is a “player” in a game where the prediction is the goal. In other words, SV is the average marginal contribution of a feature value across all possible coalitions.

A linear model prediction is an explainable model because we can see how a feature affects the prediction.

$$ \hat{f}\left( x\right) =\beta _{0}+\sum _{j=1}^{p}\beta _{j}x_{j} $$

where x is the instance that we want to calculate the contributions. Each \(x_j\) is a feature value, with \(j={1,\ldots ,p}\). \(\beta _j\) is the weight of feature j.

The contribution \(\phi _{j}\) of the j-th feature on the prediction \(\hat{f}\) is:

$$ \phi _{j}\left( \hat{f}\right) =\beta _{j}x_{j}-E\left( \beta _{j}X_{j}\right) =\beta _{j}x_{j}-\beta _{j}E\left( X_{j}\right) $$

The mean effect of feature j is \(E\left( \beta _{j}X_{j}\right) \), and the contribution of j-th feature is the difference between the feature effect minus the average effect. If we sum all the contributions, we get the following result:

$$\begin{aligned} \sum _{j=1}^{p}\phi _{j}\left( \hat{f}\right)&= \sum _{j=1}^{p}\beta _{j}x_{j}-E\left( \beta _{j}X_{j}\right) \\&=\left( \beta _{0}+\sum _{j=1}^{p}\beta _{j}x_{j}\right) -\left( \beta _{0}+\sum _{j=1}^{p}E\left( \beta _{j}X_{j}\right) \right) \\&=\hat{f}\left( x\right) -E\left( \hat{f}\left( X\right) \right) \end{aligned}$$

The result is the predicted value for the instance x minus the average predicted value. To do the same to different models other than linear is necessary to compute feature contribution for a single prediction.

To get the Shapley value of a feature value, we need to calculate the contribution made to the result, weighted and summed over all combinations [25]. Then, the Shapley value is defined via a value function for players contained in set S:

$$\begin{aligned} \phi _{j}\left( val\right)&=\sum _{S\subseteq \{x_{1},...,x_{p}\}\setminus \{x_{j}\}}\frac{\left| S\right| !\left( p-\left| S\right| -1\right) !}{p!}\left( val\left( S\cup \{x_{j}\}\right) -val\left( S\right) \right) \end{aligned}$$
(1)

where S is a subset of model features, p is the number of features and x is the vector of feature values. The result is the contribution of feature j for all feature coalitions.

The exact solution to get the Shapley value requires the evaluation of all coalitions of feature values with and without the \(j\!-\!th\) feature. However, the exact solution becomes problematic for more than a few features because the number of possible coalitions exponentially increases with each added feature. Therefore, Strumbelj et al. (2014) [32] propose an approximation with the use of Monte-Carlo sampling:

$$ \hat{\phi _{j}}=\frac{1}{M}\sum _{m=1}^{M}\left( \hat{f}\left( x_{+j}^{\text {m}}\right) -\hat{f}\left( x_{-j}^{\text {m}}\right) \right) $$

where \( \hat{f}\left( x_{+j} ^m \right) \) is the prediction for the instance x, with a random number of feature values taken from another instance z taken a random, except for the value of feature j, the \( \hat{f}\left( x_{-j} ^m \right) \) is equal to \( \hat{f}\left( x_{+j} ^m \right) \), with the difference that the value of j feature is taken from z. The procedure to approximate the Shapley value is explained next:

figure a

SHAP uses Shapley values to make explanations of BBM. However, SHAP proposes different kernel-based estimation approaches for Shapley values inspired by local surrogate models. KernelSHAP [22] is a model-agnostic based on LIME and Shapley values. On the other hand, TreeSHAP and DeepSHAP are model-specific, the first is an efficient estimation approach for tree-based models and the second for Deep learning models.

SHAP Feature Importance (FI) is one of the global interpretations based on aggregations of Shapley values. SHAP FI order features importance based on the absolute value of Shapley values per feature across the data [25]:

$$ I_{j}=\frac{1}{n}\sum _{i=1}^{n}\left| \phi _{j}^{(i)}\right| $$

After, SHAP order features by decreasing importance. For example, Fig. 5 shows the SHAP FI for a pre-trained CNN for the lending club dataset.

Fig. 5.
figure 5

SHAP Global explanation of first eight important features measured as the mean absolute Shapley values for Lending club dataset

After training the CNN models and judging their performances, we use SHAP to generate local and global explanations of the model. For example, SHAP can generate local explanations of tabular data and images but not the same for the global explanation of images. Additionally, explanations of images are given for the SHAP values of pixels based on the predictions of the trained model, which is not easy to understand for humans an example is shown in Fig. 6. Nevertheless, thanks to DeepInsight, we have the mapping between each pixel and feature. It allows us to return SHAP values to tabular form that generates more interpretable local and global explanations (Fig. 7).

Fig. 6.
figure 6

(left) Example of SHAP local explanation for a Multi-class ResNet50 on ImageNet [29]. (middle) Example of SHAP local explanation of credit image generated from tabular data transformed using DeepInsight. (right) Example of feature matrix generated using DeepInsight.

Fig. 7.
figure 7

(left) Example of SHAP local explanation for DeepInsight credit image. (right) Example of SHAP local explanation for DeepInsight credit image after conversion to tabular form.

Table 2. Datasets

5 Experimental Results

The datasets used in this thesis is about four datasets provided for financial and academic institutions widely used in research of credit scoring. All the datasets are different in almost all their characteristics like the number of samples and features. A resume of the characteristics of the dataset are shown in Table 2:

Since the data contained in the four datasets may contain redundant features that can increase computation and affect the performance, for numerical and categorical features, we use ANOVA and Chi-Squared, respectively [8]. Additionally, when features with high correlation exist, SHAP generates redundant local and global explanations with features with the same SHAP values. Then eliminating high correlated features is need it.

For each dataset, we use cross-validation with ten stratified folds. The training set was used to find the optimal parameter of the CNN model. The metrics of the test set were used to assess the performance of the CNN model. Table 3 shows the optimal architecture of our CNN model for Australian, German, HMEQ, and Lending club datasets, respectively.

Table 3. CNN Architecture: Parameters and architectures of the CNN model in (a) Australian dataset epochs = 500, (b) German dataset epochs = 50, (c) HMEQ dataset epochs = 1000, and (d) Lending club dataset epochs = 50. Legend: Conv (Convolutional Layer), PL (Pooling Layer), FC (Fully Connected Layer).

To compare the performance of our model, we use different studies that use the same datasets. Additionally, we train base models with LR and RF to compare the results. For each dataset, we calculate Accuracy and Area Under the Receiver Operating Characteristics (AUROC). Accuracy is not the best metric for the evaluation of models of credit risk classification. However, many studies only use Accuracy for evaluation. Therefore, Table 4 compares the Accuracy against each model and dataset. On the other hand, a better metric than Accuracy for credit risk classification is AUROC; in Table 5, we compare the AUROC for the different studies containing this information and our results.

Table 4. Accuracy comparison against datasets and models
Table 5. AUROC comparison against datasets and models

6 Conclusions

In this paper, tabular datasets were converted into images using DeepInsight. The images were used to train 2D CNN. The performance of the trained CNN was compared with literature results and base models trained by us using LR and RF for reference. We found that the trained CNN performed better than the literature results, and our base models based on LR and RF when the dataset size was greater than 5,900 samples, getting results that surpassed the Accuracy and AUROC of the second-best model with up to 0.106 and 0.046, respectively.

Additionally, thanks to the mapping generated for DeepInsight when the images were created, we can return SHAP values based on predictions of trained models to the tabular form, allowing us to generate local and global explanations.