Keywords

1 Introduction

Credit default, generally refers to failure to pay interest or principal on a loan when due, is a primary potential source of risk in lending business. Distinguishing credit applicants with high default risk from credit-worthy ones has been identified as a crucial issue for risk management in financial institutions. Over the last decades, researchers and practitioners have sought to develop credit models using modern machine learning techniques [1] for their ability to model complex multivariate functions without rigorous assumptions for the input data.

Despite encouraging successes in recent studies, accurate predictive analysis of credit default through using machine learning techniques is by no means a trivial task. Many of the challenges stem from the fact that the data in the task, i.e., samples of credit applicants, is generally imbalanced and noisy. Defaults, which are often of more focused interests in the credit default prediction task, would only hit a small segment of credit customers in a real credit business [2]. This results in a heavily skewed class distribution of credit data. In addition, credit data is collected from loan records of financial institutions; however, due to privacy issues, system malfunctions or even human error, historical loan records are often incomplete or erroneous, making the credit data noisy. Therefore, in order to facilitate responsible decision-making for credit granting, the problems of class-imbalance and noisiness in credit data should be fully addressed.

Factorization machines (FM), proposed by Rendle [3] in the context of recommendation system, is a novel predictive model that maps a number of predictor variables to some target. The advantages FM offers over traditional classification approaches is that it provides a principled way to model second-order (up to arbitrary order in theory) variable interactions in linear complexity. FM has shown great promise in a number of prediction tasks, such as context-aware recommendation [4, 5] and click-through rate prediction [6,7,8]. However, the potential of exploiting FM in credit risk evaluation has been little investigated so far. We argue that FM is powerful in credit default prediction task for at least the following reasons. First, for the task of credit default prediction, the combinations of predictor variables (e.g., family, age, and salary), usually much more discriminative than single ones, can be naturally modeled through variable interactions in FM. Second, FM embeds features into a low-rank latent space such that variable interactions can be estimated under high sparsity; thus FM can be viewed as a favored formulism for tackling sparse credit data.

In this work, we explore the use of FM for credit default prediction, with an emphasis on the class-imbalanced and noisy natural of credit data. We incorporate a new non-convex loss function into the learning process of FM and give rise to a novel Robust Factorization Machines (RobustFM) model that enhances FM for prediction under class-imbalance and noisiness settings. The new non-convex loss function is essentially a smoothed asymmetric Ramp loss [9] with additional degrees of freedom to tolerate the noise and imbalanced class distribution of credit data. Unlike convex loss functions used in traditional FM, the new loss function is upper bounded so as to enhance the robustness of the learning procedure. Furthermore, asymmetric margins are introduced to push learning towards achieving a larger margin on the rare class (defaulters).

The rest of the paper is organized as follows. In Sect. 2, we present preliminaries of this work, including Credit Default Prediction and Factorization Machines. We then present the details of the proposed RobustFM in Sect. 3. Experiment results are shown in Sect. 4. Finally, we review related work and conclude the paper in Sects. 5 and 6, respectively.

2 Preliminary

2.1 Credit Default Prediction

In point view of machine learning, the credit default prediction task is generally formalized as binary classification. Formally, each credit applicant is represented by a set of features (e.g., applicant’s age, monthly income, education, employment and loan purpose), denoted as \(\mathbf {x} \in \mathbb {R}^d\), where d is the number of features. Each credit applicant belongs to either of the two classes with a label \(y\in \{+1,-1\}\). In this work, we use \(+1\) and \(-1\) to denote credit applicants with high default risk (hereafter, bad applicants) and low default risk (hereafter, good applicants), respectively.

Given a training set \(D=D^{(+)}\cup D^{(-)} \in \mathbb {R}^d \times \{+1,-1\}\), in which \(D^{(+)}\) and \(D^{(-)}\) denote a set of historical bad and good credit applicants, respectively. In general, \(|D^{(+)}| < |D^{(-)}|\). The goal of credit default prediction is to learn a function \(f:\mathbb {R}^d \mapsto \{+1,-1\}\), which is capable of classifying a new credit applicant into one of the two classes.

2.2 Factorization Machines

Factorization Machines (FM) takes as input a real valued vector \(\mathbf {x} \in \mathbb {R}^d\), and estimates the target by modelling pairwise interactions of sparse features using low-rank latent factors. The model equation of FM is formulated as:

$$\begin{aligned} \hat{y}(\mathbf {x};\mathrm{\Theta }) = w_0 + \sum _{j=1}^d w_j x_j + \sum _{j=1}^d\sum _{j'=j+1}^d \langle \mathbf {v}_j,\mathbf {v}_{j'} \rangle x_j x_{j'} \end{aligned}$$
(1)

where the parameters \(\mathrm{\Theta }\) have to be estimated are:

$$ w_0\in \mathbb {R}; \quad \mathbf {w}\in \mathbb {R}^d; \quad \mathbf {V}=(\mathbf {v}_1,\cdots ,\mathbf {v}_d)\in \mathbb {R}^{pd} $$

In Eq. 1, the first two items on the right-hand-side are linear combinations of each features with weights \(w_j \ (1\le j\le d)\) and global bias \(w_0\), and the last item on the right-hand-side is pairwise feature interactions using a factorized weighting schema \(\hat{w}_{jj'}=\langle \mathbf {v}_j,\mathbf {v}_{j'} \rangle =\sum _{k=1}^p{v_{jk}\cdot v_{j'k}}\), where \(\mathbf {v}_j \) is factor vector of the j-th feature. Feature factors in FM are commonly said to be low-rank, due to \(p\ll d\).

In addition to theoretical soundness of low-rank feature factorization, FM is also practically efficient for its linear prediction time complexity. Computing pairwise feature interaction directly requires time complexity of \(O(d^2)\); however, it has been shown that the pairwise feature interaction in FM can be computed in O(pd) using the equivalent formulation of Eq. 1 [3]:

$$\begin{aligned} \hat{y}(\mathbf {x};\mathrm{\Theta }) = w_0 + \sum _{j=1}^d w_j x_j + \frac{1}{2}\sum _{k=1}^p \left( \left( \sum _{j=1}^d v_{jk}x_{j}\right) ^2 - \sum _{j=1}^d v_{jk}^2x_{j}^2 \right) \end{aligned}$$
(2)

The model parameters \(\mathrm{\Theta }\) of FM can be estimated through minimizing empirical risk over training set D, together with regularization of parameters:

$$\begin{aligned} \mathcal {O}_D(\mathrm{\Theta }) = \frac{1}{|D|}\sum _{(\mathbf {x},y)\in D}{\ell \left( y,\hat{y}(\mathbf {x};\mathrm{\Theta })\right) } + \sum _{\theta \in \mathrm{\Theta }}{\lambda _\theta \theta ^2} \end{aligned}$$
(3)

where \(\ell (y,\hat{y})\) is the loss function to evaluate the disagreement of the prediction value \(\hat{y}\) with the actual label y. Without confusion, we sometimes use \(\hat{y}\) to represent the prediction \(\hat{y}(\mathbf {x};\mathrm{\Theta })\) in the rest of the paper.

For binary classification, the most widely adopted loss function in FM is Logistic loss:

$$\begin{aligned} \ell _\mathrm{logit}(y,\hat{y}) = \ln {(1+e^{-y\hat{y}})} \end{aligned}$$

Despite effectiveness in various prediction tasks, FM still suffers from the curse of learning from imbalanced and noisy data. When the one class vastly outnumbers others, the learning objective in Eq. 3 can be dominated by instances from the major class. As such, FM tends to be overwhelmed by the major class, ignoring the minor, yet important ones, which is bad applicants in credit default prediction task. Furthermore, the Logistic loss, as convex loss functions, gives high penalties to those misclassified samples far from the origin, increasing the chances of outliers having a considerable contribution to the global loss. Thus the parameters estimated through optimizing Eq. 3 may be inevitably biased by outliers in noisy datasets, leading to a suboptimal predictive model that attempts to account for these outliers.

In this work, instead of Logistic loss, we incorporate into FM a new smoothed asymmetric Ramp loss allowing for class-dependent and up-bounded penalties for misclassified instances. This results in RobustFM, a new extension of FM that addresses imbalanced and noisy class distribution simultaneously, greatly improving the accuracy of credit default prediction under real-world scenarios.

3 Smooth Asymmetric Ramp Loss

Ramp loss, a non-convex loss function proposed by Collobert [9], is essentially a “truncated” version of Hinge loss used in support vector machines:

$$\begin{aligned} \ell _\mathrm{R}(y,\hat{y};\gamma ) = \left\{ \begin{aligned}&1-\gamma&\mathrm{if}&\ y\hat{y}<\gamma \\&1-y\hat{y}&\mathrm{if}&\ \gamma \le y\hat{y} \le 1\\&0&\mathrm{if}&\ y\hat{y}>1 \end{aligned} \right. \end{aligned}$$

Intuitively, Ramp loss is constructed by flattening Hinge loss when the so-called functional margin \(y\hat{y}\) smaller than a predefined parameter \(\gamma < 0\). In other words, a fixed non-zero penalty \(1-\gamma \), rather than linear penalty \(1-y\hat{y}\) in Hinge loss, is applied to the samples mistakenly predicted far away from the origin (i.e., \(y\hat{y}<\gamma \)).

Studies have proven Ramp loss’s superiority over Hinge loss in terms of robustness to noisy labels [10, 11]. However, Ramp loss applies a unified penalty, either \(1-\gamma \) or \(1-y\hat{y}\), to all samples no matter to which class they belong. Similar as Logistic loss and Hinge loss, the empirical risk based on Ramp loss will be dominated by negative instances if the class distribution is highly imbalanced. Ramp loss thus still suffering from imbalanced class distribution.

One way to address the class-imbalance problem is to apply class-dependent penalties to the training errors. We introduce new parameters in Ramp loss to control the degree of penalty for positive and negative classes, and construct an asymmetric Ramp (aRamp) loss:

$$\begin{aligned} \ell _\mathrm{aR}(y, \hat{y};\gamma ,\tau ^{(+)},\tau ^{(-)}) = \left\{ \begin{aligned}&\tau ^{(y)}-\gamma&\mathrm{if}&\ y\hat{y}<\gamma \\&\tau ^{(y)}-y\hat{y}&\mathrm{if}&\ \gamma \le y\hat{y} \le \tau ^{(y)}\\&0&\mathrm{if}&\ y\hat{y}>\tau ^{(y)} \end{aligned} \right. \quad (y\in \{+1,-1\}) \end{aligned}$$

There are three parameters in asymmetric Ramp loss: \(\gamma \ (\gamma <0)\) is truncation parameter that decides the point to flatten the loss function; \(\tau ^{(+)}\) and \(\tau ^{(-)}\) are the hinge parameters for false negative and false positive errors, respectively. In general, \(\tau ^{(+)}> \tau ^{(-)}\ge 1\), since false negative error is considered more serious than false positive error in imbalanced classification problems.

One should note that the asymmetric Ramp loss is not differentiable at the truncation point (\(y\hat{y}=\gamma \)) and the hinge points (\(y\hat{y}=\tau ^{(y)}\)), whereas smoothness is a desired property for gradient-based optimization techniques, e.g., stochastic gradient descent and alternating coordinate descent, which have been widely used for training FM. Motivated by the smoothing mechanism adopted in designing Huber loss [12], we make use of smooth quadratic function to approximate the asymmetric Ramp loss at the non-smooth points. More specifically, we derive a smooth asymmetric Ramp (saRamp) loss as follows:

$$\begin{aligned} \ell _{\text {saR}}(y, \hat{y};\gamma ,\tau ^{(+)},\tau ^{(-)}, \delta ) = \left\{ \begin{aligned}&\tau ^{(y)}-\gamma&\mathrm{if}&\ y\hat{y}<\gamma -\delta \\&\tau ^{(y)}-y\hat{y}-\frac{(\gamma +\delta -y\hat{y})^2}{4\delta }&\mathrm{if}&\ \gamma -\delta \le y\hat{y} \le \gamma +\delta \\&\tau ^{(y)}-y\hat{y}&\mathrm{if}&\ \gamma +\delta< y\hat{y} < \tau ^{(y)}-\delta \\&\frac{(\tau ^{(y)}+\delta -y\hat{y})^2}{4\delta }&\mathrm{if}&\ \tau ^{(y)}-\delta \le y\hat{y} \le \tau ^{(y)}+\delta \\&0&\mathrm{if}&\ y\hat{y}>\tau ^{(y)}+\delta \\ \end{aligned} \right. \end{aligned}$$

The saRamp loss is quadratic for small interval around the truncation point \([\gamma -\delta , \gamma +\delta ]\) and the hinge point \([\tau ^{(y)}-\delta , \tau ^{(y)}+\delta ]\), and linear for other values. Figure 1 illustrates the aRamp loss and saRamp loss with different interval length \(\delta \). It is easy to verify that \(\lim _{\delta \rightarrow 0}\ell _{\text {saR}}(y, \hat{y};\gamma ,\tau ^{(+)},\tau ^{(-)}, \delta )=\ell _\mathrm{aR}(y, \hat{y};\gamma ,\tau ^{(+)},\tau ^{(-)})\). We omit the proof due to brevity. In practice, we set \(\delta =0.1\). Without ambiguity, we briefly denote saRamp loss \(\ell _{\text {saR}}(y, \hat{y};\gamma ,\tau ^{(+)},\tau ^{(-)}, \delta )\) as \(\ell _{\text {saR}}(y,\hat{y})\) hereafter.

Fig. 1.
figure 1

Asymmetric Ramp loss and smooth asymmetric Ramp loss

The derivative of the saRamp loss w.r.t. the functional margin can be easily derived as follows:

$$\begin{aligned} \frac{\partial \ell _{\text {saR}}(y,\hat{y})}{\partial (y\hat{y})} = \left\{ \begin{aligned}&0&\mathrm{if}&\ y\hat{y}<\gamma -\delta \\&\frac{\gamma +\delta -y\hat{y}}{2\delta }-1&\mathrm{if}&\ \gamma -\delta \le y\hat{y} \le \gamma +\delta \\&-1&\mathrm{if}&\ \gamma +\delta< y\hat{y} < \tau ^{(y)}-\delta \\&-\frac{\tau ^{(y)}+\delta -y\hat{y}}{2\delta }&\mathrm{if}&\ \tau ^{(y)}-\delta \le y\hat{y} \le \tau ^{(y)}+\delta \\&0&\mathrm{if}&\ y\hat{y}>\tau ^{(y)}+\delta \\ \end{aligned} \right. \end{aligned}$$
(4)

4 Parameter Estimation

To solve the highly non-convex problem in Eq. 3 in a large scale, iterative optimization methods are usually preferred, due to the simplicity nature and flexibility in the choices of loss function. In this work, we employ Stochastic Gradient Descent (SGD), one of the most popular optimization method in factorization models, to estimate the parameters of RobustFM. Simply put, SGD updates parameters iteratively until convergence. In each iteration, an instance \((\mathbf {x},y)\) is randomly drawn for training data D, and the update is performed towards the direction of negative gradient of the objective w.r.t. each parameter \(\theta \in \mathrm{\Theta }\):

$$\begin{aligned} \theta ^{(t)} = \theta ^{(t-1)} - \eta \cdot \left( \frac{\partial \mathcal {O}_{\{(\mathbf {x},y)\}}(\mathrm{\Theta }^{(t-1)})}{\partial \theta } \right) \end{aligned}$$
(5)

where \(\eta > 0\) is the learning rate of gradient descendent.

Plugging the learning objective Eq. 3 into Eq. 5, we derive the parameter updating formula:

$$\begin{aligned} \theta ^{(t)} = \theta ^{(t-1)} - \eta \cdot \left( \frac{\partial \ell _{\text {saR}}(y,\hat{y}(\mathbf {x};\mathrm{\Theta }^{(t-1)}))}{\partial \theta } + 2\lambda _\theta \theta ^{(t-1)} \right) \end{aligned}$$

Applying the chain rule to Eq. 4 yields the derivative of the saRamp loss w.r.t. model parameters:

$$\begin{aligned} \begin{aligned} \frac{\partial \ell _{\text {saR}}(y,\hat{y})}{\partial \theta }&= \frac{\partial \ell _{\text {saR}}(y,\hat{y})}{\partial (y\hat{y})} \cdot \frac{\partial (y\hat{y})}{\partial \theta } \\&= \left\{ \begin{array}{ll} y\cdot \left( \frac{\gamma \,+\,\delta \,-\,y\hat{y}}{2\delta }-1\right) \cdot \frac{\partial \hat{y}}{\partial \theta } &{} \text {If } \gamma -\delta \le y\hat{y} \le \gamma +\delta \\ -y\cdot \frac{\partial \hat{y}}{\partial \theta } &{} \text {If } \gamma +\delta< y\hat{y} < \tau ^{(y)}-\delta \\ -y\cdot \frac{\tau ^{(y)}\,+\,\delta \,-\,y\hat{y}}{2\delta }\cdot \frac{\partial \hat{y}}{\partial \theta } &{} \text {If } \tau ^{(y)}-\delta \le y\hat{y} \le \tau ^{(y)}+\delta \\ 0 &{} \text {Otherwise} \\ \end{array} \right. \end{aligned} \end{aligned}$$

where \(\frac{\partial \hat{y}}{\partial \theta }\) is the partial derivatives of model equation of FM w.r.t. each parameters. According to Eq. 2, it can be written as follows:

$$\begin{aligned} \begin{aligned} \frac{\partial \hat{y}}{\partial w_0} =&\, 1 \\ \frac{\partial \hat{y}}{\partial w_j} =&\, x_j \ (1\le j\le d) \\ \frac{\partial \hat{y}}{\partial v_{jk}} =&\, x_j \sum _{j'\ne j}{v_{j'k}x_{j'}} \ (1\le j\le d, 1\le k\le p) \end{aligned} \end{aligned}$$

Given the above equations, the parameter estimation procedure for RobustFM is summarized in Algorithm 1. Note that, for each instance, the runtime complexity of Algorithm 1 remains the same as traditional FM, i.e., \(O(p\cdot N_0(\mathbf {x}))\) where \(N_0(\mathbf {x})\) denotes the number of non-zero features of the instance. Even so, the learning procedure of RobustFM is more computational efficient than that of traditional FM, because Algorithm 1 only iterates over the instances with non-zero gradient (line 5) whereas all instances in training data are to be handled in each iterations of traditional FM learning procedure.

figure a

5 Experiments

5.1 Experimental Settings

Datasets. Several real-world credit datasets, including four public datasets and one private dataset, are used for empirical evaluation of the proposed RobustFM. A summary of the five datasets is illustrated in Table 1.

Australian, German and Taiwan are public credit datasets available from UCI Machine Learning Repository that have been widely used in the literature. SomeCredit is the dataset of Kaggle competition Give Me Some Credit that aims to predict the probability of future financial distress of loan borrowers. Besides these public datasets, we used a private dataset, denoted as SD-RCB, in this experiment. The dataset is sourced from a regional bank of China which provides micro-credit services to self-employed workers and farmer households. In total, more than 50 thousands historical credit records are collected from the credit scoring system of the bank. The attributes of each credit records include custom demographics, credit application information, historical repayment behavior, and etc.

From Table 1, it has to be noted that the number of defaulters is always less than that of non-defaulters in all these datasets, and the default rate is even less than 10% in SomeCredit and SD-RCB, the two large-scale real-world credit datasets. This provides practical evidence of class-imbalance problem in real-world credit default prediction tasks.

Table 1. Statistics of credit datasets

Evaluation Measures. In order to compare the performance among different approaches, we employ the following three types of measures in this experiment:

  • Accuracy (Acc). Accuracy aims to evaluate the correctness of categorical predictions: \(Acc=\frac{1}{N}\sum _{i=1}^N{\mathbb {I}_{[y_i\ne \mathrm{sgn}(\hat{y}_i)]}}\)

  • Brier Score (BS). Most classifiers give probabilistic predictions \(\hat{y}=\hat{p}(\pm 1|\mathbf {x})\), rather than category predictions \(\hat{y}=\pm 1\), making Accuracy calculated in an unnatural way – a categorical prediction can be only inferred by assigning a manually-tuned threshold to raw predictions. Unlike Accuracy, Brier Score aims to evaluate the correctness of the raw probabilistic predictions: \(BS=\frac{1}{N}\sum _{i=1}^N{\left( y_i-\hat{p}(y_i|\mathbf{x}_i)\right) ^2}\)

  • Precision (Pre), Recall (Rec) and F-measure (\(F_1\)). One important shortage of Acc and BS is that a classifier is evaluated without taking into account the variations between classes. However, in the credit default prediction tasks, the correctness of predictions on defaulters is much more important than that on non-defaulters. We thus employ three performance measures: Precision, Recall and F-measure. Essentially, Precision and Recall measure the type-I error (non-defaulter classified as defaulter) and the type-II error (defaulter classified as non-defaulter), respectively, and F-measure is the harmonic average of Precision and Recall.

Baselines. To verify the advantage of the proposed RobustFM, we compare it with several state-of-the-art credit default prediction models. First, the most widely used classification techniques in the task of credit default prediction [1], including logistic regression (LR), neural networks (NN) and support vector machines (SVM), are selected as baselines. Second, the traditional FM, as well as its extensions with traditional Hinge loss and Ramp loss, are applied to the task of credit default prediction and selected as baselines.

5.2 Performance Evaluation

In this experiment, five-fold cross validation is performed on each dataset and the average performance on the five folds is reported. Table 2 presents the experimental results and comparisons on each dataset.

Table 2. Prediction performance of each approaches

It can be seen from Table 2 that the proposed RobustFM, compared to baseline methods, achieves the highest \(F_1\) score on all the five datasets. We perform statistical significance test to check whether the improvements are significant. More specifically, paired t-tests is applied on the predicted results obtained by RobustFM and the nearest counterpart. The results indicate that the improvements of RobustFM are significant with \(p\text {-value} \le 0.05\) on datasets German and Taiwan and \(p\text {-value} \le 0.01\) on datasets SomeCredit and SD-RCB, marked with single and double asterisks in Table 2, respectively. In fact, the imbalanced ratio of the datasets SomeCredit and SD-RCB is much higher than that of others. The comparison results prove the effectiveness of RobustFM in dealing with imbalanced data, especially with high imbalanced ratio and large size.

From Table 2, we have the following more observations:

  1. i.

    Besides \(F_1\), RobustFM achieves the highest Recall on four of the five datasets. This is in fact favored in real-world credit default prediction tasks in which missing a true defaulter in predictions is typically perceived as a more severe error than misclassifying a non-defaulter as a defaulter.

  2. ii.

    Achieving highest \(F_1\) doesn’t necessarily result in highest Accuracy and BS. As a matter of fact, RobustFM only achieves the highest Accuracy on dataset Australian which is a rather balanced dataset of small size. However, it is well recognized that Accuracy is not suitable for evaluating performances on class-imbalanced setting.

  3. iii.

    Among all the baselines, FM-based methods (FM, FM\(_{\text {Hinge}}\), and FM\(_{\text {Ramp}}\)) perform better than most traditional methods (LR, NN, and SVM) in terms of almost all performance measures. This result coincides with the findings of previous studies on FM from a variety of tasks such as click-through rate prediction and context-aware recommendation, and further verifies our intuition of the advantages of FM when applying to the credit prediction tasks described before.

5.3 Hyper-parameter Study

Compared with traditional FM, there are two additional hyper-parameters: \(\gamma \) and \(\tau ^{(+)}\)Footnote 1. Experiments are performed to study how the prediction performance of RobustFM is affected by these parameters. More specifically, by fixing one of the parameters, we vary the other one and record the prediction performance in terms of Precision, Recall and \(F_1\) (see Fig. 2(a) and (b)). Due to space limitation, only the results on the dataset SomeCredit are reported, and the results on other datasets are similar.

Fig. 2.
figure 2

The effects of hyper-parameters

To choose the truncation parameter \(\gamma \), one may typically start from a number slightly smaller than 0 and then decrease \(\gamma \) to tune the level of learning insensitivity of RobustFM. From Fig. 2(a), it can be seen that the prediction performance of RobustFM varies slightly when \(\gamma \in (2,\,4)\) and the optimal \(F_1\) score is achieved around \(\gamma =3.3\). When \(\gamma \) is getting smaller and smaller, the robustness of RobustFM is reduced, causing performance to degrade as depicted in Fig. 2(a). Similarly, to choose the margin parameter \(\tau ^{(+)}\), one may typically start from a number slightly larger than 1.0 and increase \(\tau ^{(+)}\). While \(\tau ^{(+)}\) is increasing, the classification hyperplane is moving towards to the major class. This process can be illustrated in Fig. 2 in which the Recall increases constantly as \(\tau ^{(+)}\) increases from 2.0 to 3.0. From Fig. 2(b), it also can be seen that the prediction performance of RobustFM varies slightly when \(\tau ^{(+)} \in (2,\,5)\) and the optimal \(F_1\) score is achieved around \(\tau ^{(+)}=2.9\). Overall, these experiments indicate that the proposed RobustFM performs quite steadily with wide-range values of truncation parameter and margin parameter.

6 Related Work

Credit default prediction has long been a central concern of financial risk management research. More recently, emerging machine learning techniques, instead of simple statistical methods, have been widely applied in the literature. Extensive studies have already demonstrated that machine learning techniques outperform classical statistical methods on various credit risk evaluation tasks. Until recently, almost all of the popular machine learning algorithms, e.g., support vector machines [13, 14], decision tree [15] and neural networks [16, 17] have been employed to construct credit risk model. Recent studies show that ensemble method that integrates predictions of several individual classifiers is a promising approach for credit risk modeling. A number of ensemble strategies have been proposed to construct more powerful credit risk models [18,19,20].

The class-imbalance problem in credit data has drawn attention in the literature. Several experimental studies have shown that most machine learning algorithms (e.g., decision tree, neural networks, and etc) perform significantly worse on imbalanced credit datasets [21, 22]. Recently, A few studies have tried to tackling the class-imbalance problem in credit data by developing specific feature selection and ensemble strategies [23].

7 Conclusion and Future Work

In this paper, we propose a novel approach RobustFM for the credit default prediction task. Compared with existing machine learning based credit risk models, the main advantage of RobustFM is to address the issues of class-imbalance and noisiness in the credit data simultaneously. We demonstrate RobustFM’s effectiveness on credit default prediction task via experimental evaluations on several real credit application datasets. It can be concluded that the proposed RobustFM is a worthwhile choice for the credit default prediction task.

Several issues could be considered for future work. For example, there are additional hyper-parameters in RobustFM that need to be tuned to yield good predictions. Further study should be continued to apply automated machine learning techniques to derive the optimal hyper-parameters automatically.