Keywords

1 Introduction

With the increase in the amount of data being generated day by day, machine learning algorithms have got stronger, and data analytics have become an integral part of the industry. Various machine learning methods are being applied to solve serious business problems [6].

This research paper is designed to pay attention to one of the greatest challenges that the banking sector is facing currently. Non-repayment of loans has caused major losses to the banking sector. This is a major concern due to which banks have started to invest more and more in developing bank loan risk models that help them in reducing the risk factor of providing the loan to the customer. This is done using machine learning and predictive modeling.

Machine learning techniques that this research paper is using to find the loan prediction defaulter are logistic regression, Rpart decision tree, random forest, and SVM. Now, let us talk about logistic regression—statistic logistic model is used to find the probability of particular class or we can say event. For example, passed/failed, win the event/loss the event, alive/dead or healthy/sick. This may be applied to several event to find whether an image contains cow, bird, or any animal. Each object is detected in the image would be assigned a probability of between 0 and 1 and sum adding to one.

The decision tree is the most efficient and most favored tool which used to classify and predict dataset. A decision tree is like tree structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal) holds a class label [6].

Random forest or can be said as random decision forest is an assemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class, i.e., mode of classes (classification) or mean prediction (regression) of individual trees.

The input/contribution of this research paper.

The overall research process design for the proposed study is shown in Fig. 1.The objectives of the study are as follows:

  • To get to know which factors affect the person to default on his payment

  • To conduct in-depth exploratory data analysis to get insights into the data available

  • To build a predictive model that accurately predicts whether the person would default or not

  • To improve the accuracy of the models implemented in the past work.

    Fig. 1
    figure 1

    Research process

2 Machine Learning

2.1 Logistic Regression

Logistic regression is a classification technique used to solve a classification problem that involves predictions of a factor variable. It comes under a supervised learning algorithm that means the target variable should be known beforehand to use this algorithm.

2.2 Decision Trees

The decision tree is an unsupervised machine learning algorithm used to predict a variable by finding out the most important variables and then creating a tree-based structure using them [1]. It can be used for both regressions as well as classification problems. It can cause a problem of overfitting, but the ease of its implementation is a big factor of its being used in the industry.

2.3 Random Forest

Random forest is an unsupervised machine learning technique that uses an ensemble method to create multiple decision trees and come up with the best model using those. A random forest can also be used for both classification and regression problems. The random forest takes a lot of time to train since the generation of multiple decision trees takes time.

2.4 Support Vector Machine (SVM)

Support vector machine is used to solve regression and classification problems. In this, each data item is plotted in an n-dimensional space. Vectors are used to uniquely identify each group distinctly.

3 Literature Review

Based on the past literature, we have seen many different types of machine learning techniques have been used like logistic regression, decision trees, random forest, and K-NN [2, 11, 12].

The most used technique that we observed was that of decision trees. This is because of the ease of which it can be implemented. It is a technique where we find the most significant variables and make a tree concerned about that [3]. Radom forest is another such technique that was used quite often. It generates many decision trees and ensembles them to create a model with the best accuracy. The best accuracy was found out to be of Sayjdhah [6]. Nowadays, the banking sector uses efficient use of machine learning techniques with several classification techniques to split up the customers to predict the trends [8, 10]. They want to keep the all details of the customers to understand the behavior of payment data which is added to the loan scoring literature to anticipate their defaults [4]. Some researcher used the Bayesian network used for the graphical representation model showing the probability of correlation of variables [7]. Few researchers have proposed the hybrid approaches also such as merging genetic algorithms with neural network approaches to detect the financial frauds [5].

4 Methodology

4.1 Datasets

In this project, the datasets are used, and it is generated by the banking loan operations by the user. The datasets consist of 25 variables with 30,000 samples. This dataset has been used in various researches previously too [9]. So, it is not unsigned yet, the dataset includes a binary variable of Yes equal to 1 and No is equal to 0, for example, default payment outcome [6]. Table 1 shows the basic description of dataset.

Table 1 Explanation of dataset

Default_Payment Next_Month → This is the target variable that has to be predicted. It tells whether the person would default or not. It is a factor variable with 2 levels “Yes” and “No” as shown in Fig.  2.

Fig. 2
figure 2

Snapshot of the dependent variable

4.2 Dataset Pre-Processing and Feature Selection

Firstly, the data was cleaned to build the model. Then, all the NA values were removed. This is done so that the model can run smoothly. The next step is feature selection where only the important variables are kept and all the obsolete variables are removed. In this paper, multicollinearity and correlation have been considered to observe feature importance. Then, the outliers were treated. The interquartile range has been used to remove outliers. Outliers are values that do not follow the pattern, and these values make the model deviate from the correct predictions. Figure 3 shows the variable of bill_amt2 before and after the treatment of outlier. Feature scaling technique is also used to scale the features to a certain range to make the logistic model work fast and efficiently. Z score normalization was used to scale the features.

Fig. 3
figure 3

Before outlier treatment

4.3 Data Visualization

This section provides interesting insights into the data that helps in understanding the relationship between different variables, all the nitty-gritty of the data are understood using different visualization techniques. Also, the demographics about the people using the credit card can be figured out from Fig. 4. Figure 5 shows the split of our dependent variable gender-wise.

Fig. 4
figure 4

After outlier treatment

Fig. 5
figure 5

Gender demographics

4.4 Data Partitioning

Data was split into two parts, train and test data. This is done so that first we can train the model on train data and then we can do the error analysis on test data. This helps us prevent overfitting and also makes the model more flexible to new data points. The data is split in the ratio of 70:30 to ensure we have sufficient data points in the test data to train the model well.

5 Performance and Evaluation

As seen in Fig. 6, it can be observed that the model used in this paper gives better results from the model being used in previous papers. Even though the accuracy of logistics reduces, but the gain in the ROC curve area shows that the model being used in the paper is much more stable and equipped to handle new incoming data.

Fig. 6
figure 6

Model outcome. 1 Yashna Sayjdhah, 2 model used in this paper

6 Model Comparison and Discussion

Four models have been used in this paper. Out of the four, random forest provides the best accuracy and a comparable area under the curve for our ROC curve making it the most stable and best-equipped model to predict the defaulters.

Decision tree was also considered, but since its area under the curve for the ROC curve is way too less, so we choose random forest as the best and most suited model for this paper.

7 Conclusion

At this stage, according to the predicting model, we have concluded that almost half of the population is married. Most of the people are graduate and from the university. Male and female have equal percentages. Bill amounts are skewed which need to be treated. Most of the people are aged between 20 and 60. More men tend to default than women in terms of ratio. Married people and others tend to default more than single. People having school or university education tend to default more. The techniques which are used in this model are random forest, decision tree, and logistic regression. Random forest has the best accuracy in this model with 81.75%. Based model had (Yashna Sayjhda 2018) the accuracy 81%. Main objective of this model is to detect the defaulters who take the loan from the bank and refuse/fail to pay within the given time which was provided by the bank itself. This paper checks on different parameters which customers likely to default more.