Keywords

1 Introduction

With the development of financial and capital markets, credit operations have become more voluminous and complex, implying the need for advances in mechanisms and models for risk measurement and management.

Given the increasing importance and sophistication of credit transactions and the consequent vulnerability of the financial system to systemic crises, international and local regulatory bodies are developing guidelines and establishing rules concerning exposure to credit risk by financial institutions.

For example, the Basel Committee on Banking Supervision (BCBS) has published several guidelines to be adopted by banks worldwide, including mechanisms for credit risk management (Schooner and Talor 2010).

More specifically, the BCBS establishes as a relevant pillar the need for equity capital to cope with the degree of exposure to different types of risk (BIS 2006), including market and credit risk. Based on the BCBS guidelines, central banks in many countries are requiring regulatory capital for financial institutions in order to support financial losses due to defaults by borrowers and the degradation of credit quality of bank’s assets.

In particular, retail credit risk plays a relevant role to financial institutions (Burns 2002), since risk in retail business could be seem as homogeneous due to diversification, and may result in significant savings in regulatory capital. In addition, banks that comply to their proprietary models of default probability estimation may also be allowed to adopt internal mechanisms to calculate regulatory capital requirements, reducing capital charge.

This study aims to analyze effective decision models for credit risk analysis of retail portfolios. Using machine learning algorithms, this chapter assesses computationally intensive algorithms to classify an individual as good or bad borrower.

In this study, algorithms could be adopted to analyze credit risk of wholesale portfolios, which provide more data and are more commonly prone to automated process for credit application of small loan amounts. However, computational learning mechanisms are most useful for retail portfolios.

Considering the various types of machine learning algorithms, this research studies the applicability of two ensemble methods, bagging and boosting, in credit risk analysis. Ensemble methods are computational mechanisms based on machine learning meant to improve traditional classification models. For instance, according to Freund and Schapire (1999), boosting, a traditional ensemble method combined with simple discrimination techniques (hit rate slightly higher than 55 %), could reach up to 99 % of correct classifications.

The adoption of ensemble methods in credit has been analyzed, for instance, by (Lai et al. 2006; Alfaro et al. 2008; Hsieh and Hung 2010). Their results have verified the efficacy of machine learning methods in real-life problems.

This chapter analyzes one example that shows how different classification techniques can be adopted by comparing the hit ratio of traditional and ensemble methods on a set of credit applications.

2 Theoretical Background

According to Johnson and Wichern (2007), discrimination and classification correspond to multivariate techniques that seek to separate distinct sets of objects or observations and that allow to allocate new objects or observations into predefined groups.

Although the concepts of discrimination and classification are similar, Johnson and Wichern (2007) establish that discrimination is associated with describing different characteristics from the observation of distinct populations known in an exploratory approach. Classification, on the other hand, is more related to allocating observations in classes, with a less exploratory perspective. According to Klecka (1980), classification is an activity in which discriminant variables or discriminant functions are used to predict the group to which a given observation is most likely to belong.

Therefore, the usefulness of discrimination and classification in credit analysis is evident. It provides not only an understanding of the characteristics that discriminate, for example, good from bad borrowers, but also models that allocate potential borrowers in groups. In this case, a priori assumptions on relationships between specific characteristics of borrowers and default risk are unnecessary.

The seminal study by Fisher (1936) associated with the identification of discriminant functions of species of flowers has given rise to relevant works on credit risk. For example, Durand (1941) focused on the analysis of automobile credit loans, and Altman (1968) associated it to predict business failures.

The discussion in this study focuses on general techniques that might improve credit analysis and that do not need to distinguish discrimination from classification. However, from a practical point of view, the ultimate goal of automated credit scoring models, more particularly the analysis of application scoring, is associated with classification, since the decision to grant the loan depends on the group to which a potential borrower is rated.

2.1 Traditional Discrimination Techniques

From the discrimination point of view, credit analysis aims to study possible relationships between variables \({\mathbf{X}}\) representing \(n\) characteristics \(X_{1} ,X_{2} , \cdots , X_{n}\) of borrowers and a variable \(Y\) representing their credit quality.

In loan application studies, credit quality is commonly defined by a variable having a numerical scale score or a rating score, with ordinal variables; or by good/bad credit indicators, with nominal or categorical variables.

The main objective of this research is to develop a system for automated decision processes. Therefore, this study focuses on problems in which Y is a dichotomous variable, with (i) good borrower and (ii) bad borrower categories or groups. Of the several multivariate statistic-oriented classification techniques currently available (Klecka 1980), this study discusses briefly discriminant analysis, logistic regression, and recursive partitioning algorithm.

2.1.1 Discriminant analysis

Discriminant analysis aims to determine the relationship between a categorical variable and a set of interval scale variables (Jobson 1992). By developing a multivariate linear function discriminant analysis shows variables that segregate or distinguish groups of observations through scores (Klecka 1980).

According to credit-related studies, discriminant analysis generates one or more functions in order to better classify potential borrowers. From the mathematical point of view, the analysis of two groups (e.g., performing and non-performing loans) might require a discriminant function expressed as:

$$Y = a_{0} + a_{1} x _{1} + a_{2} x_{2} + a_{3} x_{3} + \; \cdots \; + a_{\text{n}} x_{\text{n}}$$

where

  • \(Y\) is the dependent variable, i.e., the score obtained by an observation;

  • \(a_{0} ,\; \ldots \;, a_{n}\) are coefficients that indicate the influence of each independent variable in the classification of an observation; and

  • \(x_{1} ,\; \ldots \;, x_{n}\) are independent variable values associated with a given observation.

Thus, based on the coefficient values and the associated independent variables, a discriminant function determines a more accurate score for any particular group. Portfolio credit analysis of retail loan applications adopts variables to encompass registration data or other information or characteristics of a potential borrower. Individuals with higher scores tend to have better ratings, indicating better credit quality and lower default probability.

The main assumptions of discriminant analysis include the following: (i) A discriminant variable cannot be a linear combination of other independent variables; (ii) the variance–covariance matrices of each group must be equal; and (iii) the independent variables have a multivariate normal distribution (Klecka 1980).

It is worthy noting that discriminant analysis is one of the most common borrower classification techniques in application scoring models, after the studies of Altman (1968) verified its efficacy.

2.1.2 Logistic regression

Many social phenomena are discrete or qualitative, in contrast to situations that require an ongoing measurement process of quantitative data (Pampel 2000). Credit quality classification focusing on good or bad borrowers is typically qualitative and represents a binary phenomenon.

In a dichotomous model, logistic regression is an alternative to discriminant analysis in order to classify of potential borrowers. In logistic regression, the dependent variable \(Y\) is defined as a binary variable with 0 or 1 values, and the independent variables \({\mathbf{X}}\) are associated with the characteristics or events of each group.

Without loss of generality, group 0 could be defined as good-borrowing individuals, and group 1 as non-payers or bad borrowers. A logistic function shows the default probability of a given individual:

$$P_{i} [Y = 1 | {\mathbf{X}} = {\mathbf{x}}_{{\mathbf{i}}} ] = \frac{1}{{1 + e^{ - Z} }}$$

where

  • \(P_{i}\) is the probability of individual i belong to the default group;

  • \(Z = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \; \cdots \; + b_{n} x\) is a score in which the coefficients can be estimated from a sample, for instance.

Considering the use of logistic regression analysis for credit analysis, \(P_{i}\) is the probability of a counterpart \(i\) be a bad borrower. It is also subject to several independent variables \({\mathbf{X}}\) related to relevant characteristics that may affect credit quality.

When the assumptions of discriminant analysis and logistic regression are observed, both methods give comparable results. However, when the normality assumptions of the variables or variance–covariance matrix equality between groups are not observed, results might differ considerably. Logistic regression, given its less restrictive assumptions, is a technique widely used by the market for credit analysis.

2.1.3 Recursive partitioning algorithm

A less traditional technique for discrimination between groups, the recursive partitioning algorithm involves a classification tree-based non-parametric modeling (Thomas et al. 2002).

According to Feldesman (2002), classification trees have several advantages compared to parametric models: (i) they do not require data transformations, such as logit function in logistic regression analysis; (ii) missing observations do not require special treatment; and (iii) a successful classification does not depend on normality assumptions of variables or equal variance–covariance matrices between groups, such as in discriminant analysis.

The foundations of recursive partitioning algorithm lie in the subdivision of a set of observations into two parts, like branches of a tree, so that subsequent subgroups are increasingly homogeneous (Thomas et al. 2002). The subdivision is based on reference values by variables that explain the differences among the groups. Observations with higher values than the reference values are allocated in a group, while observations with lower values are classified into another group.

Thus, for each relevant variable, the algorithm sets a reference value that will define the subgroup. For example, if the discriminant variable \(X\) is continuous, its algorithm generates a cutoff value \(k\). As a result, both groups are comprised by observations with a value \(X < k\) and \(X \ge k\), respectively. The definition of the cutoff value \(k\) is relevant in the classification tree model.

When the discriminant variable \(X\) is categorical, the algorithm checks all the possible splits into two categories and defines a measurement to classify the groups (Thomas et al. 2002). By repeating this procedure for several relevant variables, one can build a set of simple rules based on higher or lower values compared to a reference value for each discriminant variable. Observation can be classified into a final group according to this set of rules.

Classification trees allow an intuitive and easy representation of the elements that explain each group (Breiman et al. 1984). Credit analysis studies that adopt the classification tree model are not as common as the parametric model-based ones, but are found, for example, in Coffman (1986).

For discrimination among groups, discriminant analysis and logistic regression are parametric statistical techniques; the possible relationships between the borrower’s characteristics and credit quality are likely to be analyzed by means of the independent variables’ coefficients in the model. In the case of partition algorithms or decision trees, which adopt mainly non-parametric techniques, the explanatory variable-associated cutoff identifies the good and the bad borrowers. However, depending on how complex the recursive partitioning model is, assessing the influence of each variable to explain credit quality might be difficult.

2.2 Classification Techniques

Considering the distinction suggested by Johnson and Wichern (2007), one could argue that discrimination has the merit of allowing, under a more exploratory aspect, the evaluation of specific characteristics that may explain the inclusion of a observation within a particular group.

However, in some situations, explaining reasons for a variable to influence credit quality is less relevant than the actual rating itself. For example, under a practical perspective, if a given financial institution needs to analyze a large number of credit applications, it might need to develop an automated mechanism for quick and accurate classification rather than a discrimination pattern to explain how variables influence a possible default.

Regarding classification applicability and guidance, machine learning is an artificial intelligence field that aims to develop algorithms for computer programs or systems to learn from experience or data (Langley 1995).

Machine learning techniques, such as neural network algorithms and decision trees, are an alternative to traditional statistical methods, which often rely on mechanisms with extremely restrictive assumptions, such as normality, linearity, and independence of explanatory variables (Kuzey et al. 2014).

It is worth mentioning that recursive partitioning-based algorithms (e.g., decision trees within certain limits, especially related to a small number of variables and to the simplicity of the model), could also create discrimination mechanisms. Chien et al. (2006), for example, establish a classification tree model based on discriminant functions. In contrast, traditional neural networks are typical observation-based classification techniques, as their underlying model is encapsulated in a black box (Ugalde et al. 2013).

From a more focused paradigm to pattern recognition for classification, the machine learning approach is, according to computer science literature, a set of algorithms specifically designed to assess computationally intensive problems, exploring extremely large databases of banks (Khandani et al. 2010).

From the credit analysis perspective, therefore, the machine learning methods are increasingly useful, given the computers’ increasing processing power that, in turn, speeds up pattern recognition of good and bad payers. It is worth noting that loan databases of financial institutions could surpass ten million transactions, each one involving several variables, including borrower registration and transaction-related data.

This study focuses on machine learning techniques known as ensemble methods. According to Opitz and Maclin (1999), an ensemble consists of a set of individually trained functions whose predictions are combined to classify new observations. That is, the basic idea of the ensemble construction approach is to make predictions from an overall mechanism by integrating multiple models, which generates more accurate and reliable estimates (Rokach 2009).

According to Bühlmann and Yu (2003), Tukey (1977) introduces a linear regression model applied first to the original data, and then applied to errors, as the source of ensemble methods. Thus, applying a technique several times to the data and errors is an example of ensemble method. Considering the development of statistical theory and increasingly powerful computational machines, model combinations might be deployed in more complex applications.

Several authors, such as Breiman (1996), Bauer and Kohavi (1999), and Maclin and Opitz (1997), pointed substantial improvements in classification using ensemble methods. Considering its performance gains for classification, ensemble methods or ensemble learning methods are one of the mostly accepted streams of research in supervised learning (Mokeddem and Belbachir 2009).

Hsieh and Hung (2010) mention that ensemble methodology has been used in many areas of knowledge. For example, Tan et al. (2003) apply ensemble methods in bioinformatics and protein classification problems in several classes. In geography and sociology, Bruzzone et al. (2004) detect the land cover by combining image classification functions. Maimon and Rokach (2004) use ensemble decision tree techniques for mining manufacturing data.

The number of finance studies that adopt ensemble methods has also increased. For example, Leigh et al. (2002) make predictions on New York Stock Exchange values through technical analysis pattern recognition, neural networks, and genetic algorithms. Lai et al. (2007) study value-at-risk positions in crude oil gathering through ensemble methods that adopt wavelet analysis and artificial neural networks.

Regarding ensemble methods for credit applications, Lai et al. (2006) adopted neural reliability-based networks, Alfaro et al. (2008) adopted neural networks in bankruptcy analysis, and Hsieh and Hung (2010) assessed credit scores by combining neural networks, Bayesian networks, and support vector machines.

This study analyzes two traditional ensemble-based algorithms: bagging and boosting. According to Dietterich (2000), the two most popular ensemble techniques are bagging or bootstrap aggregation, developed by Breiman (1996); and boosting, first proposed by Freund and Schapire (1998). The best known algorithms are based on the AdaBoost family of algorithms. Boosting is also known as arcing (resampling and combining adaptive), due to Breiman’s work (1998) that brought new ways of understanding and using boosting algorithms.

Within the context of ensemble methods, bagging and boosting are two general mechanisms aimed to enhance the performance of a particular learning algorithm called basic algorithm (Freund and Schapire 1998). These methods reduce estimation error variances (Tumer and Ghosh 2001) but do not necessarily increase bias (Rokach 2005), providing gains both from the statistical theory perspective and the real-world applicability perspective. Bartlett and Shawe-Taylor (1999) reported that such methods may even reduce bias.

According to Freund and Schapire (1998), bagging and boosting algorithms are similar in the sense that they incorporate modified versions of the basic algorithm subject to disturbances in the sample. Both methods are based on resampling techniques that obtain different training datasets for each of the model classifiers (Opitz and Maclin 1999). In the case of classification problems, the set of training data allows establishing matching or classification rules derived from a majority vote, for example.

The algorithms may also show significant differences. The main difference implies that, in bagging, disturbances are introduced randomly and independently, while boosting shows serial and deterministic disturbances. The best choice depends heavily on all other previously generated rules (Freund and Schapire 1998).

Next, this work introduces the fundamentals of bagging and boosting methods for credit score. Similar ensemble method applications have been assessed by other authors, e.g., Paleologo et al. (2010), who study credit score for bagging, and Xie et al. (2009), who analyze boosting applied with logistic regression.

2.2.1 Bagging

Bagging is a technique developed to reduce variance and has called the attention due to its simple implementation and due to the popular bootstrap method.

The bagging algorithm follows the discussion in Breiman (1996).

  1. 1.

    Consider initially a classification model, based on pairs\(\left( {X_{i} ,Y_{i} } \right)\), \(i = 1\;,..,\;n\), representing the observation, and where \(X_{i} \in {\mathbb{R}}^{d}\) indicates the \(d\) independent variables that explain the classification of a given group.

  2. 2.

    The target function is \(P \left[ {Y = j | X = x} \right] \left( {j = 0, 1, \ldots \;, J - 1} \right)\) in the case of a classification problem in \(J\) groups, \(Y_{i} \in \left\{ {0,1,\; \ldots \;,\;J - 1} \right\}\). The classification function estimator is \(\hat{g}\left( . \right) = h_{n} \left( {\left( {X_{1} , Y_{1} ,} \right),\left( {X_{2} ,Y_{2} } \right),\; \ldots \;,\;\left( {X_{\text{n}} , Y_{\text{n}} } \right)} \right)\left( . \right) : {\mathbb{R}}^{\text{d}} \to {\mathbb{R}},\) where \(h_{n}\) is a model used to classify the observation into the groups.

  3. 3.

    The classification function can be, for instance, a traditional discrimination technique, e.g., discriminant analysis, logistic regression, or recursive partitioning model.

  4. 4.

    Build a random bootstrap sample \(\left( {X_{1}^{ *, } ,Y_{1}^{ *} } \right)\;,\; \ldots ,\; \left( {X_{n}^{*} ,Y_{n}^{ *} } \right)\) from the original sample \(\left( {X_{1} , Y_{1} } \right)\;, \ldots,\; \left( {X_{n} , Y_{n} } \right).\)

  5. 5.

    Calculate the bootstrap estimator \(\hat{g}^{ *} \left( . \right)\) using the plug-in principle, i.e., \(\hat{g}^{ *} = h_{n}\) \(\left( {\left( {X_{1}^{ *} ,Y_{1}^{ *} } \right),\; \ldots \;, \left( {X_{n}^{ *} ,Y_{n}^{ *} } \right)} \right)\left( . \right)\)

  6. 6.

    Repeat steps 2 and 3 \(M\) times. Frequently, \(M\) is chosen to be 50 or 100, implying that \(M\) estimators are \(\hat{g}^{ *k} \left( . \right) \left( {k = 1,\; \ldots \;,M} \right)\).

  7. 7.

    The bagged estimator is given by \(\hat{g}_{\text{Bag}} \left( . \right) =\) \(M^{ - 1} \mathop \sum \limits_{k = 1}^{M} \hat{g}^{ *k} \left( . \right),\) which is an estimate of \(\hat{g}_{\text{Bag}} \left( . \right) = E^{ *} \left[ {\hat{g}^{ *} \left( . \right)} \right]\).

    In application scoring problems, each bootstrapped sample implies coefficient estimates when the bagging procedure is coupled with discriminant analysis or logistic regression, or estimates of cutoff values in a decision tree when bagging and recursive partitioning algorithm are coupled. Since \(M\) different classifications are generated in bagging due to differences in the bootstrapped samples, one common mechanism to classify a new individual is by majority votes of the classification derived from the many \(\hat{g}^{ *k} \left( . \right)\) classification functions.

2.2.2 Boosting

Boosting is an ensemble technique that aggregates a series of simple methods, known as weak classifiers, due to their low performance in classifying objects, thus generating a combination that leads to a classification rule with a better performance (Freund and Schapire 1998).

In contrast with bagging, boosting relies on classifiers and subsamples that are sequentially obtained. In every step, training data are rebalanced to give more weight to incorrectly classified observations (Skurichina and Duin 2002). Therefore, the algorithm rapidly focuses on observations that could be more difficult to be analyzed our classified.

The description of the AdaBoost algorithm here is based on Freund and Schapire (1999) study. Consider \(Y = \left\{ { - 1, + 1} \right\}\) as possible classification problem values. In a credit application context, for instance, a negative value may represent a bad borrower, and a positive value may represent a good borrower.

Boosting implies a repeated execution of a weak learning mechanism, e.g., discriminant analysis, logistic regression, or a decision tree approach, using subsamples of the original set. Differently from bagging, which generates uniform random samples with reposition, choosing new subsamples in boosting depends on a probability distribution that is different for each step, reflecting the mistakes and successes from the weak classification functions.

A boosting algorithm can be described as in Freund and Schapire (1999).

  1. 1.

    Define weights \(D_{t} \left( i \right)\) of the training sample. Initially, the weights, i.e., the probability of choosing any observation, are equal. Thus, given \(\left( {x_{i} , y_{i} } \right),\; \ldots \;, \left( {x_{m} ,y_{m} } \right)\), so \(x_{i} \in X, y_{i} \in Y = \left\{ { - 1, + 1} \right\}\), \(D_{1} \left( i \right) = \frac{1}{m}\).

  2. 2.

    Establish a weak hypothesis or function \(h_{t}\) that allows a simple classification of a given element in \(- 1\) or \(+ 1\), i.e., \(h_{t} :X \to \left\{ { - 1, + 1} \right\}\). This function can be, for instance, a traditional statistical technique such as recursive partitioning algorithm.

  3. 3.

    The classification function has an error \(\varepsilon_{t} = { \Pr }_{{i\sim D_{t} }} \left[ {h_{t} \left( {x_{i} } \right) \ne y_{i} } \right] = \mathop \sum \limits_{{i:h_{t} \left( {x_{i} } \right) \ne y_{i} }} D_{t} \left( i \right)\), i.e., the error is the total sum of probabilities in which the weak function leads to wrong classifications in relation to the true values in the sample. It is important to emphasize that the error is measured by this distribution \(D_{t}\), in which the weak function was used.

  4. 4.

    Once the weak hypothesis \(h_{t}\) has been established, boosting defines a parameter \(\alpha_{t }\) that measures the relative importance of \(h_{t}\). The higher is the error \(\varepsilon_{t}\), the lower is \(\alpha_{t }\) and less important \(h_{t}\) is in the classification problem. In boosting, the relative importance for each weak classification function is given by \(\alpha_{t} = \frac{1}{2} \ln \left( {\frac{{1 - \varepsilon_{t} }}{{\varepsilon_{t} }}} \right)\).

  5. 5.

    The distribution \(D_{t}\) is updated by increasing the weight of the observations that are wrongly classified by \(h_{t}\), and by decreasing the weight of the observations that are correctly classified, following the equation

    \(D_{{{\text{t}} + 1}} \left( i \right) = \frac{{D_{\text{t}} \left( {\text{i}} \right)}}{{Z_{\text{i}} }} \times \left\{ {\begin{array}{*{20}c} {{\text{e}}^{{ -\upalpha\,t}} \;\quad{\text{if}}\;{\text{h}}_{\text{t}} \left( {x_{i} } \right) = y_{i} } \\ {{\text{e}}^{{\upalpha\,t}} \;\quad{\text{if}}\;{\text{h}}_{\text{t}} \;\left( {x_{i} } \right) \ne y_{i} } \\ \end{array} } \right.,\) where \(Z_{t}\) is a normalization factor, so that \(D_{t + 1}\) is a probability distribution. Therefore, for each successive boosting step, the observations that are not correctly classified will be more likely to be selected in the new subsample.

  6. 6.

    The last hypothesis or classification function is \(h_{t} .\) The final classification model \(H\) is defined by the weak function in each step weighted by \(\alpha_{t}\), i.e., \(H\left( x \right) = {\text{sign }}\left( {\mathop \sum \limits_{t = 1}^{T}\upalpha_{t} h_{\text{t}} (x)} \right).\)

    For the retail loan application, an individual is considered a good borrower if \(H\left( x \right)\) has a positive sign. A bad borrower shall have a negative value for \(H\left( x \right)\).

3 Results

In order to show how these ensemble methods of machine learning work, a credit transaction database in the UCI Machine Learning Repository of the Center for Machine Learning and Intelligent Systems at the University of California at Irvine (Bache and Lichman 2013) was used. This database, also used by Quinlan (1987) and Quinlan (1992), encompasses credit card applications in Australia and consists of 690 observations of 15 variables.

Given the confidentiality of information, the database provides only the values and information on the scale of the variables. The individuals are not identified, as well as the observation or variable meaning. The credit quality-related variable has two categories: good borrower (G) and bad borrower (B). Limited information ensures data confidentiality, but does not affect the analysis, considering the research objective associated with the classification of observations using various quantitative techniques.

After analysis of the database, missing data were eliminated, resulting in 653 valid observations in the final sample. In order to run the analysis, we focused on 7 variables to classify individuals: 6 continuous and 1 nominal comprising two categories. We aim to study how the machine learning mechanisms behave in classification problems with a limited number of information.

The final sample observations were divided randomly into two subsamples (training and validation sets), with virtually the same amount of elements. A script written in R was used, taking into account the characteristics of each technique, and confusion matrices were generated for both the training and the testing subsamples.

The classification results were introduced through (i) discriminant analysis, (ii) logistic regression analysis, (iii) recursive partitioning algorithm, (iv) bagging, and (v) boosting, for different number of iterations (\(N\)). The ensemble methods analyzed were coupled with recursive partitioning algorithm.

Tables 1 and 2 show the classification results, in absolute and in percentage terms, for the training and validation samples, respectively. Table 3 shows the overall classification results, with hit and error ratios.

Table 1 Classification results—absolute numbers
Table 2 Classification results—percentage
Table 3 Overall classification results

This study’s dataset implies some relevant results. Discriminant analysis and logistic regression results were identical, in accordance with Press and Wilson’s (Press and Wilson 1978) argument that, for most studies, the two methods are unlikely to lead to significantly different results.

Interestingly, for good borrowers, discriminant analysis and logistic regression show better classification results (25 %) in the testing subsample, vis-à-vis the training subsample (21 %). Therefore, for the good borrower group, the traditional parametric models are more consistent with the validation sample when compared to the calibration sample.

However, for the bad borrower group, accuracy levels decrease for all techniques. Recursive partitioning algorithm, bagging, and boosting mechanisms show a lower hit ratio for the good borrower group as well.

An overall analysis shows that all techniques, with the exception of discriminant analysis and logistic regression, are subject to performance loss when the classification rule using the training subsample is applied to the testing subsample.

In the training dataset, classification results from the recursive partitioning algorithm, bagging, and boosting are quite superior to the discriminant analysis and logistic regression outcomes. Whereas the traditional parametric models lead to an overall 74 % hit ratio, the non-parametric methods correspond to at least 83 % of the correct classifications. This accuracy increase, resulting from an automated computational procedure, may strongly affect banks, since loan application analysis, using just computational resources, could be significantly improved.

Regarding boosting, the higher the number of allowed iterations, the better the classification results for the training, i.e., the calibration dataset. Results show an accuracy rate of 93 %, which is much higher than the traditional statistical technique accuracy rate, 74 %.

However, it is important to highlight that the performance of the models did not vary significantly in the testing sample for any technique. Hit ratio is quite insensitive to the method or the number of iterations in the ensemble models. Moreover, forecasting results are compatible to those using more simple techniques.

Even worse, in the testing sample, ensemble methods showed poor performance, especially when bad borrowers were predicted as good borrowers. This misclassification can lead to significant credit losses, since the automated decision would suggest the approval of a loan to a borrower who would default.

These results suggest that, in the case of the Australian credit card database, although ensemble methods could be seen as an improved model of an existing dataset, their contribution to predict credit quality in an out-of-sample analysis is not clear.

4 Final Comments

This chapter aimed to discuss decision models for retail credit risk. In particular, potential uses of two ensemble methods, bagging and boosting, to application scoring were assessed. Based on unsupervised machine learning algorithms, these ensemble methods could implement decision models for automated response to loan applications.

Using a dataset of credit card applications and compared to traditional discriminant analysis and logistic regression, decision models that rely on computational algorithms such as ensemble methods could enhance the accuracy rate of borrower classification.

Results show that, specifically for the training subsample, bagging and especially boosting significantly improve the classification hit ratio. However, for the testing subsample, ensemble techniques coupled with recursive partitioning algorithm convey only marginally better classifications. The error rate for classifying bad borrowers as good ones showed significant problems in the ensemble methods used in this study. Thus, although these machine learning techniques are likely to be more accurate in the training dataset, their impact for analyzing new loans applications is not robust.

Even though the computational techniques studied here did not significantly improve the hit ratio, it is important to highlight that even a minimum increase in the rate of correct classifications might result in relevant savings for a financial institution with millions of trades in its retail portfolio.

Therefore, automated decision models, especially for large banks, could result in economic value and a simpler analysis of credit applications. This study assessed bagging and boosting, two of the most common ensemble methods. Several other machine learning mechanisms, such as neural networks, support vector machines, and Bayesian networks, might also be adopted to analyze credit risk.

Due to the complex default process and the financial market dynamics, managers and decision makers could take advantage of innovations in both computational performance and quantitative methods, eventually developing automated decision models that could contribute to the credit analysis process.