Abstract
Despite the advances in machine learning (ML) methods which have been extensively applied in credit scoring with positive results, there are still very important unresolved issues, pertaining not only to academia but to practitioners and the industry as well, such as model drift as an inevitable consequence of population drift and the strict regulatory obligations for transparency and interpretability of the automated profiling methods. We present a novel adaptive behavioral credit scoring scheme which uses online training for each incoming inquiry (a borrower) by identifying a specific region of competence to train a local model. We compare different classification algorithms, i.e., logistic regression with state-of-the-art ML methods (random forests and gradient boosting trees) that have shown promising results in the literature. Our data sample has been derived from a proprietary credit bureau database and spans a period of 11 years with a quarterly sampling frequency, consisting of 3,520,000 record-months observations. Rigorous performance measures used in credit scoring literature and practice (such as AUROC and the H-Measure) indicate that our approach deals effectively with population drift and that local models outperform their corresponding global ones in all cases. Furthermore, when using simple local classifiers such as logistic regression, we can achieve comparable results with the global ML ones which are considered “black box” methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Information asymmetry has far reaching and well-studied consequences in the operation of financial markets, such as the impact on financial inclusion, financial intermediation and financial risk; see [1,2,3,4,5]. Thus, credit bureaus have emerged as the means to diminish information asymmetry and support the efficiency of credit institutions in their decision-making processes, and in tasks such as credit limit management, debt collection, cross-selling, risk-based pricing, prevention of fraud, etc. [6,7,8]. Credit scoring, as a principal tool of credit bureaus to identify good prospective borrowers, began as early as 1941 [9]. However, the automated and widespread application of credit scoring did not take place until the 1980s, when computing power to perform sophisticated statistical calculations became affordable. One definition of credit scoring is “the use of statistical models to transform relevant data into numerical measures that guide credit decisions” [10]. According to Thomas et al. [11], credit scoring has been vital in the “…phenomenal growth in the consumer credit over the last five decades. Without (credit scoring techniques, as) an accurate and automatically operated risk assessment tool, lenders of consumer credit could not have expanded their loan (effectively).”
However, credit scoring modeling and methodologies face theoretical issues as well as practical ones (as operated in practice by all credit bureaus):
-
As with all predictive models, credit scoring suffers from population (or concept) drift, i.e., changes in the socio-economic environment cause the underlying distribution of the modeled population to change over time. [12,13,14,15,16]. To tackle this problem in practical terms, credit bureaus implement continuous monitoring cycles and periodic re-calibration or re-development of their models [10, 17, 18]. The calibration of credit scoring models or the lack thereof, has been mentioned in the literature as one reason (among others) for the subprime mortgage crisis of 2008 [19]. Specifically, FICO scores have been shown to having become a worse predictor of default between 2003 to 2006 [20, 21]. During that period, despite the rapid and severe deterioration of subprime portfolio quality, corresponding scores remained fairly stable [22].
-
Development of credit scoring models require historical data of at least 1–2 years. Without counting the monetary cost incurred by such operations, adding the time to implement and put into production a new generation of models, sometimes results in a difference of three or more years between actual data that reflect the current population dynamics and the data used to build the models. This lag between data at model development time and actual time to be put into production, has become more obvious as data are generated in an ever-increasing pace and this acceleration puts an equally pressing pace in operations.
-
Moreover, as credit scoring models depend on pre-defined sets of predictor (input) variables, when their weights are updated from time to time, they may lose their relevance and end up with a weight zero or close to zero. These predictors are called omitted variables and it has been shown that the omission of variables related to local economic conditions seriously bias and weaken scoring models [23].
-
Credit bureaus do not use a single scoring model (sometimes referred to as “scorecard”) for a specific purpose (such as estimation of the probability of default), but rather split the population into various segments using either demographic criteria, or risk-based ones. This happens for various reasons such as data availability (e.g., new accounts versus existing customers), policy issues (e.g., different credit policies for mortgages), inherently different risk-groups, etc., in order to (a) capture significant interactions between variables among the sub-population that are not statistically important within the entire population or cause the relevance of predictors to change between groups [24], (b) capture non-linear relationships (especially on untransformed data) and increase the performance of generalized linear models [24], which are even today the “golden standard” in the credit scoring industry (although to a far lesser extent than in past decades). Despite the fact that there is not enough academic consensus about the effects of segmentation in scorecards’ performance [25], segmentation is a de facto approach throughout the credit scoring industry for another reason: robustness.
In this work, we investigate the use of local classification models for dynamic adaptation in consumer credit risk assessment aiming to handle the population drift and avoid the time-consuming endeavor of continuous monitoring and re-calibration/re-development procedures. The proposed adaptive scheme, searches the feature space for each candidate borrower (“query instance”) to construct a “micro-segment” or local region of competence, using the K nearest neighbors algorithm (kNN). Thus, a region of competence is exploited as a localized training set to feed a classification model for the specified individual. Such a specialized local model serves as an instrument to achieve the desired adaptation for the classification process. We compare various classifiers (logistic regression as well as ML methods such as random forests and gradient boosting trees). All the explored algorithms are fed to training features extracted from a credit bureau proprietary database and evaluated in an out-of-sample/out-of-time validation setting in terms of performance measures including AUC and H-Measure [26]. Specifically, we explore three hypotheses:
H1: Do local methods outperform their corresponding global ones?
H2: Do results using ML methods differ significantly from logistic regression in the global as well as in the local setup?
H3: Does the choice of kNN-based local neighborhoods affect model performance over choosing randomly selected regions?
The results demonstrate the competitiveness of the proposed approach as opposed to the established methods. Thus, our contributions can be summarized as follows:
-
Our analysis is using a real-world, pooled cross-sectional data set spanning a period of 11 years, including an economic recession, and containing 3,520,000 record-months observations and 125 variables. Availability of adequate, real-world credit related data is extremely scarce in the literature. In a very extensive benchmark study by [27] 28 papers were surveyed in terms of data sets used; the mean number of records/variables of all datasets was 6167/24, whereas the biggest dataset used in the study had 150,000 observation and 12 independent variables. Also, small datasets have been noted in the literature that may introduce unwanted artifacts and models built upon them do not scale up when put into practice [28, 29].
-
Using local classification methods there is no need for continuous calibration of the models; adaptation to concept drift is part of the dynamic and automated model building process.
-
Predictive models are always trained on the latest available data. The predictors used in the models are not fixed but they are always picked up to fit the changing conditions, thus bypassing the problem of omitted variables.
-
For each query, a specialized micro-segment or region of competence is created dynamically, thus reaping the benefits of segmentation.
-
Last by not least, the proliferation of ML/artificial intelligence methods for predictive modelling created a paradigm shift for the credit scoring as well [30,31,32,33,34,35,36,37,38]. The issue of performance improvement is but one side of the discussion, the other one being related to issues such as transparency, bias and fairness [39,40,41,42,43,44], which in the context of credit scoring have received special attention [45,46,47] due to the statutory and regulatory constraints (cf. GDPR, EU AI Act: COM/2021/206 final). In our work, we focus on the performance aspect and we compare statistical classification models versus well-advertised ML methods.
The rest of this paper is organized as follows. In Sect. 2, we present the theoretical background; Sect. 3 provides a formulation of the problem; Sect. 4 describes the experimental setup and all its parameters; Sect. 5 provides the empirical results; and Sect. 6 concludes with discussion of these results and possible directions of future work.
2 Background and Related Theoretical Work
2.1 Local Classification
Usually, the classification process is a two-phase approach that is separated between processing training and test instances:
-
Training phase: a model is constructed from the training instances.
-
Testing phase: the model is used to assign a label to an unlabeled test instance.
In global or eager learning, the first phase creates pre-compiled abstractions or models for learning tasks, which describe the relationship between the input variables and the output over the whole input domain [48]. In instance-based learning (also called lazy or local learning), the specific test instance (also called query), which needs to be classified, is used to create a model that is local to that instance. Thus, the classifier does not fit the whole dataset but performs the prediction of the output for a specific query [49,50,51,52].
The most obvious local model is a k-nearest neighbor classifier (kNN). However, there are other possible methods of lazy learning, such as locally-weighted regression, decision trees, rule-based methods, and SVM classifiers [53,54,55]. Instance-based learning is related to but not quite the same as case-based reasoning [56,57,58,59], in which previous examples may be used in order to make predictions about specific test instances. Such systems can modify cases or use parts of cases in order to make predictions. Instance-based methods can be viewed as a particular kind of case-based approach, which uses specific kinds of algorithms for instance-based classification.
Inherent to the local learning methods is the problem of prototype or instance selection where it can be defined as the search for the minimal set \(S\) in the same vector space as the original set of instances \(T\), subject to \({\text{accuracy}}(S)\ge {\text{accuracy}}(T)\), where the constraint means that the accuracy of any classifier trained with \(S\) must be at least as good as that of the same classifier trained with \(T\) [60,61,62]. Instance selection methods can be distinguished based on their properties such as the direction of search for defining \(S\) (e.g., incremental search, where search begins with an empty \(S\)) and wrapper versus filter methods, where the selection criterion is based on the accuracy obtained by a classifier such as kNN, versus not relying on a classifier to determine the instances to be classified [60].
However, we shall distinguish instance selection from instance sampling de Haro-Garcia et al. [63], where the purpose is to formulate a suitable sampling methodology for constructing the training and test datasets from the entire available population. In particular, instance sampling deals with issues such as sample size and sample distribution (balancing; [64,65,66] and has been shown to be of major importance for credit scoring due to the inherent imbalance in the credit scoring data [67].
There are three primary components in all local classifiers [48, 49]:
-
1.
Similarity or distance function: This computes the similarities between the training instances, or between the test instance and the training instances. This is used to identify a locality around the test instance.
-
2.
Classification function: This yields a classification for a particular test instance with the use of the locality identified with the use of the distance function. In the earliest descriptions of instance-based learning, a nearest neighbor classifier was assumed, though this was later expanded to the use of any kind of locally optimized model.
-
3.
Concept description updater: This typically tracks the classification performance and makes decisions on the choice of instances to include in the concept description.
A specific mention shall also be made to the concept of local weighted regression [53, 68,69,70] where the core idea lies on local fitting by smoothing: the dependent variable is smoothed as a function of the independent variables in a moving fashion analogous to a moving average. In similar manner kernel regression uses a kernel as a weighting function to estimate the parameters of the regression, i.e., the Nadaraya-Watson estimator [71, 72].
Local classification methods have not been studied extensively specifically in the context of credit scoring. Simple models such as basic kNNs expectedly do not yield satisfying results [27] and thus have not drawn much of the interest of the academic community nor of the practitioners for that matter. Some effort using advanced and/or hybrid methodologies such as self-organizing maps for clustering [73], combining kNN with LDA and decision trees [74], clustered support vector machines [75], fuzzy-rough instance selection [76], instance-based credit assessment using kernel weights [77], have shown somewhat promising results, albeit bearing into consideration the issues airing from the datasets used (size, relevance, real-world applicability).
2.2 Local Regions of Competence
Ensemble methods also known as Multiple Classifier Systems (MCS) combine several base classifiers through a conceptual three-phase process [78,79,80,81]:
-
1.
Pool generation, where a diverse pool of classifiers is generated,
-
2.
Selection, where one or a subset of these classifiers is selected, and
-
3.
Integration, where a final prediction is made based on fusing the results of the selected classifiers.
The selection phase can be static or dynamic. Static selection consists of selecting base models once and using the resulting ensemble to predict all test samples, whereas in dynamic selection specific classifiers are selected for each test instance through evaluation of their competence in the neighborhood or otherwise on a local region of the feature space where the test instance is located. Thus, the neighbors of the test instance define a local region which is used to evaluate the competence of each base classifier of the ensemble.
The definition of the local region has been shown to be of importance to the final performance of dynamic selection methods [82, 83, 103] and there are papers pointing out that this performance can be improved by better defining these regions and selecting relevant instances [83,84,85,86]. One of the most common methodologies for defining local regions is kNNs (including its variations such as extended kNNs, especially for imbalanced data, which are of particular importance to credit scoring). Methods such as clustering [87, 88] can also be found in the literature.
Dynamic selection techniques in the context of credit scoring have received some attention in the literature [89,90,91,92,93,94]. In a recent paper, Melo Junior et al. [95] proposed a modification of the kNN algorithm, called reduced minority kNNs (RMkNN), which aims to balance the set of neighbors used to measure the competence of the base classifiers. The main idea is to reduce the distance of the minority samples from the predicted instance. As mentioned, imbalancing of the distribution of the classes is an important factor when considering sampling for credit scoring [64, 67, 85, 86, 93, 96, 97]. This issue becomes even more important when dynamic selection techniques are applied.
A related approach is the Mixture of Experts, which is composed of many separate neural networks, each of which learns to handle a subset of the complete set of training cases [98,99,100,101]. This method is established based on a divide-and-conquer principle [102], where the feature space is partitioned stochastically into several subspaces through a special employed error function and “experts” become specialized on each subspace. However, only multilayer perceptron neural networks are used as the base classifier [78, 103]. Mixture of Experts has not been extensively applied in the context of credit scoring and there are only a few studies on the subject [104, 105].
3 Problem Formulation and Parameters
Assuming a classification training set\(\{({\mathbf{x}}_{1}, {y}_{1}), \dots ,({\mathbf{x}}_{n},{y}_{n})\}\),\(\mathbf{x}\in {\mathbb{R}}^{n}\),\(y\in \{0, 1\}\), M is a global model trained on all \({\left\{\left({\mathbf{x}}_{i}, {y}_{i}\right)\right\}}_{i=1}^{n},\) the local region of competence for a given test instance \(\mathbf{x}\) (assuming its k-nearest neighbors) is denoted by \({N}_{x}=\{{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots , {\mathbf{x}}_{k}\}\) and the learning set for the local classifier \({M}_{x}\) is\({\left\{\left({\mathbf{x}}_{i}, {y}_{i}\right)\right\}}_{{\mathbf{x}}_{i}\in {N}_{x}}\).
Specifically, for the credit scoring binary classification problem \(\{{\mathbf{x}}_{i}\}\), \(i=1, \dots , n\), is considered the feature or variable space, denoting the characteristics of each borrower \(i\) and \({y}_{i}\) is the corresponding objective or target variable denoting the class label (non-default or default sometimes referred also as “Good” or “Bad”). Each feature vector \({\mathbf{x}}_{i}\) is observed at a point in time \({T}_{0}\), called observation point, whereas the corresponding response \({y}_{i}\) is recorded at a subsequent performance point \({T}_{1}={T}_{0}+\tau\), where \(\tau \ge 1\) is usually defined in months. The collected input data span an observation time window (or observation window) covering the period \([{T}_{0}-{\tau }^{\mathrm{^{\prime}}}, {T}_{0}]\) (\({\tau }^{\mathrm{^{\prime}}}\ge 1\) denoting months), whereas the outcome window refers to the period \(({T}_{0}, {T}_{1}]\) where the class label of \({y}_{i}\) is defined. For the context of behavioral credit scoring, the feature space contains variables related the financial performance and behavior of borrowers such as credit amounts, delinquency status, etc.
The credit scoring literature has not provided definitive answers to defining optimally these parameters (default definition, observation window, outcome window). The recommendations in the literature vary the length of observation and outcome windows from 6 to 24 months [8, 11, 106].
Regarding the definition of default, Anderson [10] designated that financial institutions choose between: (a) a current status definition that classifies an account as good or bad based on its status at the end of the outcome window, and (b) a worst status approach that uses a time-period during the outcome window. Regulatory requirements are also of paramount importance and must be taken into consideration, such as a 90 days past due worst status approach that is commonly used in practice in behavioral scorecards and complies regulatory requirements, such as the Basel Capital Accords and the new definition of default by the European Banking Authority (EBA). Kennedy et al. [107] presented a comparative study of various values for these parameters. Their results indicated that behavioral credit scoring models using:
-
default definitions based on a worst status approach outperformed those with current status.
-
a 12-month observation window outperformed the ones with 6- and 18-month windows in combination with shorter (12 months or less) outcome windows.
-
6-months outcome window and a current status definition of default outperformed longer outcome windows; for the worst status approach the degradation occurs when outcome window extends beyond 12 months.
Finally, it should also be noted that credit scoring data sets are highly imbalanced, since the objective of all financial institutions is a low-default portfolio. There are quite a few studies and approaches in the literature analyzing the impact of imbalancing in classification, in general [108,109,110,111,112,113,114], as well as in the context of credit scoring [64, 67, 84, 93, 96, 115].
4 Experimental Setup and Methodology
4.1 Data and Variables
Our data set (pooled cross-sectional data) has been derived from a proprietary credit bureau database in Greece and spans a period of 11 years (2009q1 to 2019q4), resulting in total 44 snapshots (11 years by 4 quarters). At each snapshot, a random sample of 80,000 borrowers was retrieved with all their credit lines, including paid off and defaulted, resulting in 3,520,000 record-months observations.
In total, 125 proprietary credit bureau behavioral variables were calculated at the borrower level which fall within the following dimensions:
-
Type of credit (consumer loans, mortgages, revolving credit such as overdrafts, credit cards, restructuring loans, etc.).
-
Delinquencies (months in arrears, delinquent amount, etc.).
-
Amounts (Outstanding balance, disbursement amount, credit limit, etc.).
-
Time (months since approval, time from delinquencies, etc.).
-
Inquiries made to the credit bureau database.
-
Derogatory events, such as write-offs or events from public sources such as courts.
Besides “elementary” variables such as the ones described above, other derivative/combinatory variables along various dimensions were calculated, such as various ratios (ratio of delinquent balance over current balance for the last \(X\) months for a specific type of credit line), utilizations and their rate of their increase or decrease over a specific time-window (e.g., consecutive increase over last \(X\) months), giving the total of 125 variables.
4.2 Scoring Parameters
Our scoring parameters are defined as follows:
-
Observation window: Time windows of 12 months prior to each observation point \({\rm T}_{0}\). Our initial observation point has been at 2009q1 and every subsequent quarter thereafter up to 2018q4.
-
Scorable population: At each observation point \({T}_{0}\), the following cases are excluded from the analysis: a) borrowers already having delinquency of 90 days past due (dpd) or more at \({T}_{0}\), b) cases lacking sufficient historical data i.e., less than 6 months of credit history, credit cards which are inactive balance within the observation window. The remaining observations constitute the scorable population for the specific \({T}_{0}\). The last \({T}_{0}\) is taken at 2018q4.
-
Outcome window: a 12-month window after the observation point. For each observation point \({T}_{0}\), the period \({T}_{1}={T}_{0}+12\) is used as the outcome window. Thus, the last \({T}_{1}\) is taken at 2019q4.
-
Default definition: The labeling of the scorable population at \({T}_{0}\) either as GOOD = 0 (majority class), BAD = 1 (minority or “default” class), depending on the information available during \({T}_{1}\), takes place using a worst status approach for each outcome window, i.e., the maximum (worst) delinquency over all accounts or a new derogatory event, is measured for the specific outcome window. Thus the corresponding classes are defined as: (a) \(y=1\) for cases with worst delinquency \(\ge\) 90 dpd or a derogatory event occurs during the outcome period, otherwise (b) \(y=0\) is assigned to all other cases.
4.3 Methodology
Our approach is based on training local and global classifiers on the same sample and comparing their performance. Local classifiers are trained for each instance \(\mathbf{x}\) of the test data set of each snapshot using the feature space defined by its neighborhood or region of competence within the training data set. A local model \({M}_{x}\) is then used to predict the probability and the class label of the specific instance for which it was trained. Correspondingly, global classification models are trained on the entire training set and then used to predict the class probabilities of each instance on the test data set. For better simulating a real-world scenario, we retrain global classifiers every 2 years. The classifiers used both in the global as well as in the local scheme are logistic regression, random forests (RF), and extreme gradient boosting machines (XGB). The choice of the specific ML models was made based on recent credit scoring literature findings where they seem to be on par or outperform other machine learning and deep learning methods [32]. Specifically, Gunnarsson et al. [33] found that XGBoost and RF outperformed deep belief networks (DBN), Hamori et al. [34] found XGB to be superior to deep neural networks (DNN) and RF. Marceau et al. [35] found that XGB performed better than DNN, and Addo et al. [30] concluded that both XGB and RF outperform DNN.
For implementation we used Microsoft R Open v3.5.1 and the corresponding R libraries: speedglm 0.3–2, randomForest 4.6–14 and xgboost 0.71.2. In all cases, default parameter values were used and no hyper-parameter optimization was performed other than internally used by the methods.
During the training phase, the input data have been pre-processed using an expert-based process flow to:
-
handle missing values, by excluding variables with greater than 70% missing values and filling the remaining blanks with a constant (since the variables are missing at random (MAR), in this work we use − 1 as constant value),
-
retain only the useful variables, by removing those with zero variance or near zero variance,
-
isolating non-correlated variables using an exclusion threshold of 0.7, and
-
select the most discriminative among the remaining variables using the Information Value (IV) criterion. The exclusion thresholds were selected to match a practitioner’s rule mentioned in the literature [18], where a variable is removed in case of having an IV lower than 0.3 and greater than 2.5.
Finally, as it has been noted in Sect. 3, credit scoring data are inherently imbalanced. In our case, the imbalancing is also observed in the regions of competence, which are used to build the local classification models. Such a fact, inevitably yields in some cases to non-convergence errors, when local logistic regression is used as a classification algorithm and the local region of competence contains very few minority class (default) cases for the algorithm to converge. In our experiments we found this non-convergence error to be on average 1.9% over all executions.Footnote 1 To address the non-convergence issue, in this work, we use a simple heuristic rule: anytime logistic regression algorithm fails to predict a class label for a test instance, the algorithm assigns the majority class from test instance’s region of competence.
4.4 Local Classification
As detailed below, for each snapshot, the k nearest neighbors (k-NN) algorithm is used to define the local region of competence \({N}_{x}\) for each test instance \(\mathbf{x}\). A local model \({M}_{x}\) is trained on this specific region \({N}_{x}\), which serves as an instrument to achieve the desired adaptation for the classification process. Figure 1 shows the overall flow for the proposed scheme:
The setup procedure is as follows: for each snapshot, the scorable population is defined as a random set (of 80,000 instances), sampled without replacement from the total population and the resulting data set is separated through a 50–50 split into training and test sets, to form the training and test sub-spaces of the original feature space. The distance metric used to define the local region of competence for each test instance, is determined using the Euclidean distance. Such a region of competence serves as a borrower-specific localized training set that will be used to build a local classification model for that borrower.
Regarding the size of the \(k\) parameter required by the nearest neighbors algorithm, it is worth to note a common rule of thumb that defines the selection of 1500 to 2000 examples per class, dating from the very beginning of credit scoring model development [116] and mentioned in many works thereafter [18, 24, 117]. Although the subject is not extensively researched, recent academic studies pointed to the direction that larger samples can improve the performance of linear models [67, 117] but there seems to be a plateau after 6000 goods/bads and almost no further benefit above 10,000. As a result, aiming to evaluate both claims, in this work we selected a \(k\) parameter that ranges from 2000 to 6000 examples (k \(\in\){2000, 4000, 6000}). The resulting region of competence is used to train a local classification model, \({M}_{x}\), which is specialized for the corresponding test instance/borrower. In this study, local classification models are built using the classification algorithms considered in the analysis (i.e., logistic regression, random forests, gradient boosting trees). Figure 2 depicts the training phase for the proposed scheme (pre-processing refers to the flow described in Sect. 4.3).
To assess the performance of each local classification model \({M}_{{x}_{i}}\), which had been built for each test instance \({\mathbf{x}}_{\mathbf{i}}\) on its specific region of competence \({N}_{{x}_{i}}\), i = {1,…|TS_L|} (where i is the number of the data points in the test set #L) is used to predict the probability of default (PD) for the considered test instance/candidate borrower and assign a GOOD or BAD class label. This is compared to the actual labels available for the test instances.
4.5 Global Classification
As a baseline to benchmark our proposed local classifiers, we implement and evaluate a standard credit scoring classification scheme commonly used both by the scientific community and practitioners alike. In the global classification approach, the adaptation to population drift is achieved by retraining the models using new data from the contextual snapshot. Figure 3 shows the overall flow for the global scheme.
It should be noted that in order to have a real-word and realistic comparison of model performance we re-train our global models every two years (as retraining is applied in practice to all commercial credit scoring models). The performance of global models over all snapshots would degrade significantly in case of training only once for the initial snapshot data (indicatively: mean AUC = 0.8213 with standard deviation = 0.04 for global LR models when the training took place only at the first snapshot 2009q1 versus mean AUC = 0.8746 with standard deviation = 0.014 when the re-training of global LR occurs every 2 years).
4.6 Performance Measures and Comparison of Classifiers
There is a keen interest of the scientific research community regarding the appropriateness of the established performance measures used to evaluate classification models and especially those which are used in credit scoring applications, also considering the inherent imbalance of the credit scoring datasets [118,119,120]. Specifically, the credit scoring setup gives rise to methodological problems such as the accuracy paradox [121] and the different misclassification cost between type I and type II errors [26]. As a result, the most used approach avoids accuracy as a scorecard performance metric, adopting instead measures such as the area under the ROC (AUC), the GINI index, and the Kolmogorov–Smirnov distance or the F-measure. However, in the literature there has been a skepticism over their appropriateness and especially of the widely used AUC measure [122]. A coherent alternative namely the H-measure [26, 122, 123] has been proposed in the literature, which handles different misclassification costs and is indicated to be a better suited performance metric for the credit scoring context [120]. Thus, in this work, we use both AUC and the H-measure (using default values for the parameters for the calculation of H-measure as defined in the corresponding R package).
Comparisons among several classification algorithms on several datasets arise in machine learning when a new proposed algorithm is compared with the existing state of the art. From a statistical point of view, the correct way to deal with multiple hypothesis testing is by, first, comparing all the classification algorithms together by means of an omnibus test to decide whether all the algorithms have the same performance. Then, if the null hypothesis is rejected, we can compare the classification algorithms by pairs using post-hoc tests. In these kinds of comparisons, common parametric statistical tests such as ANOVA are generally not adequate as the omnibus test. The arguments are similar to those against the use of the t-test: The scores are not commensurable among different application domains and the assumptions of the parametric tests (normality and homoscedasticity in the case of ANOVA) are hardly fulfilled [124,125,126]. In this paper we use the non-parametric tests of Nemenyi post hoc and Friedman’s aligned ranks. The selection of non-parametric tests is made because the underlying data distribution is not known. Since multiple classifiers should be compared, the Nemenyi test is selected for the pairwise comparisons among scheme and algorithm combinations, as proposed by Demsar [124]. Furthermore, Friedman’s aligned rank test is utilized to correct the p values for multiple testing.
5 Empirical Results
For tackling the hypothesis regarding the superiority of local models over their global counterparts, we started by examining whether the size of the local region impacts the classification performance. Figure 4 summarizes the performance of local LR models for various k’s whereas Table 2 in the Appendix provides the detailed results over all snapshots.
As evidenced the choice of \(k\) does not have a significant impact performance of logistic regression. Specifically, we observe that when using the H-measure, the performance results are slightly and non-significantly decreasing as \(k\) increases (mean = 0.6360, 0.6298, 0.6270 for \(k\) = 2000, 4000, 6000, correspondingly), whereas the opposite holds when using AUC as performance measure (mean = 0.9256, 0.9259, 0.9265 for corresponding \(k\)’s). Thus, for the rest of our process we choose to use \(k\) = 2000 for local models since model performance is not significantly affected, whereas computational performance and memory requirements are considerably improved with lower \(k\)’s.
Comparing visually the results of the local classifiers with their corresponding global ones, we get a mixed picture (see Tables 3 and 4 in the Appendix for detailed results): whereas local LR models outperform their global counterparts, for XGB and RF the differences between global and local classifiers do not appear to be significant (Fig. 5).
To test for statistical differences between all classifiers (i.e., the case of multiple methods on multiple data sets as noted in [124], we use Friedman’s aligned rank test [125] to assess all the pairwise differences between algorithms and then correct the p values for multiple testing (Fig. 6 visualizes the results in matrix format). We observe that in both measures (AUC and H-Measure) LR-G differs significantly from all other classifiers. Going in more details, in the AUC-based matrix two “clusters” of classifiers emerge for which the null hypothesis of not been equal cannot be rejected: a) XGB-G, RF-G, RF-L_2k and b) LR-L_2k and XGB-L_2k. For the H-measure-based p value matrix, the analogous “clusters” observed are as follows: (a) RF-L_2k, RF-G and (b) XGB-G, XGB-L_2k, LR-L_2k. Thus, there seems to be an “interlacing” between the performance of all ML models (both local and global) and LR-L_2k which cannot be statistically rejected and strengthens the evidence that local models are at least on par with their global counterparts. Especially for LR-L it is clearly evidenced that it outperforms LR-G with statistical significance.
As a next step, we use the Nemenyi post hoc test that is designed to check the statistical significance between the differences in the average rank of a set of predictive models. In the resulting critical distance (CD) graph (Fig. 7), the horizontal axis represents the average rank position of the respective model. The null hypothesis is that the average ranks of each pair of predictive models do not differ with statistical significance of 0.05. Horizontal lines connect the lines of the models for which we cannot exclude the hypothesis that their average ranks are equal. Any pair of models whose lines are not connected with a horizontal line can be seen as having an average rank that is different with statistical significance. On top of the graph a horizontal line is shown with the required difference between the average ranks (known as the critical distance or difference) for two pair of models to be considered significantly different.
Thus, it is further evidenced that the case of local LR consistently and statistically significantly outperforms global LR although the same conclusion does not seem to hold for RF and XGB, despite the minor difference in favor of the local methods when comparing average performance. This becomes more apparent upon examining the average AUC and the H-Measure over all snapshots (Fig. 8).
It is also noteworthy that although RF outranks XGB (in all cases; differences not statistically significant), the performance of Local LR does not differ statistically from the ML algorithms, contrasting the case of global LR which is vastly outranked and outperformed. The gain, when comparing these classifiers to the “baseline” global LR, is within the range of 6–8% (Table 1), which is well within the empirical range observed in other studies [32] when comparing ML algorithms to the basic logistic regression in credit scoring.
Finally, to examine whether the choice of a specific local region based on kNNs vs random sub-sampling plays a role in the performance, we trained a series of models LR-L_2k_rnd where for each test instance \(\mathbf{x}\) its local region \({N}_{x}\) is a set of randomly selected training cases, instead of employing the kNN scheme. Detailed results are provided in.
Table 5 (Appendix) whereas the following Fig. 9 highlights the fact that selecting local regions through kNNs does makes a difference and a performance gain with respect when to a random choice of regions. It should be noted here that the performance of LR-L_2k_rnd appears somewhat similar to the global one LR-G. This is of no surprise, since the attributes of a random sample are, by selection, more similar to the overall population from which the sample is drawn than from a sub-region with specific characteristics.
6 Conclusions and Future Work
The development of reliable models for credit scoring remains a challenge for researchers and practitioners. Technological advances in ML/AI provide new capabilities in this field, enabling the exploitation of large amounts of data. However, as conditions in the economic and business environment are in constant change, credit scoring models require regular updating. Motivated by this finding, this paper presented an adaptive behavioral credit scoring scheme which uses online training to provide estimates for the probability of default through an instance-specific basis.
Going back to our research hypotheses we can draw our conclusions:
H1: With respect to the potential gain of local methods vis-a-vis their global counterparts our results indicate clearly that local logistic regression outperforms and outranks the baseline global logistic regression. This does not seem to hold for the ML methods we used (RF and XGB) where the differences between local and global models are not statistically significant.
H2: Concerning the superiority of ML methods over baseline LR-G our results fall within a range of performance improvement of 2–8% observed in various credit scoring applications of ML/AI found in literature [30,31,32,33,34, 127]. However, it is quite important to observe that the performance of Local LR is on par with RF and XGB.
H3: Finally, our analysis clearly indicates that the performance of a local model is affected by the selection of a region of competence based on similar characteristics with the queried test instance. A random selection of points from the feature space provides inferior results compared to the kNN approach adopted in this study.
Bearing into consideration the volume of the real-world data used and the extensive out-of-sample validation performed, thus safeguarding for overfitting, our work clearly indicates that using local LR methods can provide real-time adaptation therefore providing a solution to the problem of population drift and the need for continuous re-calibration (which holds for LR and ML models alike), yielding comparable results with complex state-of-the-art ML algorithms. Additionally, LR per se is not a “black box” model which is extremely beneficial for regulatory purposes. However, dealing with the complexities of model risk management and governance [128,129,130] in the case of using real-time, adaptive local models may pose equal or even greater challenges for their practical application.
Another issue that yields further examination is the reason that the tested ML methods do not get the benefit of applying the same local regions as in LR. One possible answer tends towards the direction of the intrinsic way that RF and XGB are working by exploiting combinations of predictors within the feature space, thus better capturing the specific dynamics of a sub-region. This needs to be further examined.
Further work can also be performed towards the direction of:
-
exploring advanced balancing techniques such as SMOTE [131] or RUSBoost [132] for local sampling considering the highly imbalancing nature of credit datasets [64, 93] where balancing may affect not only performance in terms of misclassification errors but also non-convergence errors when using local LR,
-
usage of penalized methods such as LASSO or Ridge [113, 133],
-
usage of different distance metrics (e.g., Manhattan or Mahalanobis) or even different algorithms for choosing local regions instead of the basic kNNs, such as Reduced Minority kNNs [95].
Data Availability
Data subject to third party restrictions.
Code Availability
Not applicable.
Notes
In total we executed 120 runs for local LR models (one run over all 40 snapshots for each k, where k = {2000,4000,6000} the size of kNNs.
References
Barci G, Andreeva G, Bouyon S (2019) “Data sharing in credit markets: does comprehensiveness matter?”, European Credit Research Institute, Research Report no. 23, available at: https://bit.ly/3xfiW3v
Besanko D, Thakor AV (1987) Competitive equilibrium in the credit market under asymmetric information. Journal of Economic Theory 42(1):167–182
Jappelli T, Pagano M (1993) Information sharing in credit markets. J Financ 48(5):1693–1718
Morscher C, Horsch A, Stephan J (2017) Credit information sharing and its link to financial inclusion and financial intermediation. Financial Markets, Institutions and Risks 1(3):22–33
Stiglitz JE, Weiss A (1981) Credit rationing in markets with imperfect information. Am Econ Rev 71(3):393–410
Breeden J, Thomas L, McDonald J III (2007) Stress testing retail load portfolios with dual-time dynamics. Journal of Risk Model Validation 2(2):1–19
Hand DJ, Henley WE (1997) Statistical classification methods in consumer credit scoring: a review. J R Stat Soc A Stat Soc 160(3):523–541
Thomas LC, Malik M (2010) Comparison of credit risk models for portfolios of retail loans based on behavioral scores. In: Rausch D, Scheule H (eds) Model Risk in Financial Crises. Risk Books, pp 209–232
Durand D (1941) Credit-rating formulae. In Risk Elements in Consumer Installment Financing 83–91. NBER
Anderson R (2007) The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press
Thomas LC, Edelman DB, Crook JN (2002) Credit scoring & its applications (monographs on mathematical modeling and computation) (1st edition). Soc Ind Appl Math
Adams, NM, Tasoulis DK, Anagnostopoulos C, Hand DJ (2010) Temporally-adaptive linear classification for handling population drift in credit scoring. Lechevallier, Y. αnd Saporta.(Eds), COMPSTAT2010, Proceedings of the 19th International Conference on Computational Statistics 167–176
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In Advances in Artificial Intelligence–SBIA 2004 286–295. Springer
Gama J, Žliobaite Ie, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput. Surv 46(4):44:1–44:37
Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8(3):281–300
Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. In N. Japkowicz & J. Stefanowski (Eds.), Big Data Analysis: New Algorithms for a New Society (Vol. 16, pp. 91–114). Springer International Publishing
Jung KM, Thomas LC, So MC (2015) When to rebuild or when to adjust scorecards. Journal of the Operational Research Society 66(10):1656–1668
Siddiqi N (2005) Credit risk scorecards: developing and implementing intelligent credit scoring. Wiley, New York
Rona-Tas A, Hiss S (2008) Consumer and corporate credit ratings and the subprime crisis in the US with some lessons for Germany. SCHUFA, Wiesbaden
Ashcraft AB, Schuermann T (2008) Understanding the securitization of subprime mortgage credit. Foundations and Trends® in Finance 2(3):191–309
Demyanyk Y, Van Hemert O (2011) Understanding the subprime mortgage crisis. Review of Financial Studies 24(6):1848–1880
Breeden J (2014) Reinventing retail lending analytics—2nd impression. Risk Books
Avery RB, Bostic RW, Calem PS, Canner GB (2000) Credit scoring: statistical issues and evidence from credit bureau files. Real Estate Economics 28(3):523–547
Anderson R (2022) Credit intelligence and modelling: many paths through the forest. Oxfrod University Press
Bijak K, Thomas LC (2012) Does segmentation always improve model performance in credit scoring? Expert Syst Appl 39(3):2433–2442
Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77(1):103–123
Lessmann S, Lyn C, Thomas Hsin-Vonn Seow, Baesens B (2013) Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update. Credit Scoring and Credit Control XIII
Jamain A, Hand DJ (2009) Where are the large and difficult datasets? Adv Data Anal Classif 3(1):25–38
Perlich C, Provost F, Simonoff JS (2003) Tree induction vs. logistic regression: a learning-curve analysis. J Mach Learn Res 4:211–255
Addo P, Guegan D, Hassani B (2018) Credit risk analysis using machine and deep learning models. Risks 6(2):38
Albanesi S, Vamossy DF (2019) Predicting consumer default: a deep learning approach (Working Paper No. 26165; Working Paper Series). Nat Bur Econom Res
Alonso A, Carbó JM (2020) Machine learning in credit risk: measuring the dilemma between prediction and supervisory cost. Banco de España Working Paper No. 2032, available at: https://ssrn.com/abstract=3724374
Gunnarsson BR, Broucke S, Baesens B, Óskarsdóttir M, Lemahieu W (2021) Deep learning for credit scoring: do or don’t? Eur J Oper Res 295(1):292–305
Hamori S, Kawai M, Kume T, Murakami Y, Watanabe C (2018) Ensemble learning or deep learning? Application to default risk analysis. Journal of Risk and Financial Management 11(1):12
Marceau L, Qiu L, Vandewiele N, Charton E (2019) A comparison of deep learning performances with others machine learning algorithms on credit scoring unbalanced data. ArXiv:1907.12363
Petropoulos A, Siakoulis V, Stavroulakis E, Klamargias A (2019) A robust machine learning approach for credit risk analysis of large loan level datasets using deep learning and extreme gradient boosting. IFC Bulletins chapters, in: Bank for International Settlements (ed.), The use of big data analytics and artificial intelligence in central banking, volume 50, Bank for International Settlements
Sirignano J, Cont R (2018) Universal features of price formation in financial markets: perspectives from deep learning. Quantitative Finance 19(9):1449–1459
Sirignano J, Sadhwani A, Giesecke K (2016) Deep learning for mortgage risk. Available at SSRN 2799443. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2799443
Bussmann N, Giudici P, Marinelli D, Papenbrock J (2020) Explainable AI in fintech risk management. Frontiers in Artificial Intelligence 3:26
Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D (2018) A survey of methods for explaining black box models. ACM Computing Surveys (CSUR) 51(5):1–42
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Adv Neutral Inf Proces Syst 29
Suresh H, Guttag JV (2019) A framework for understanding unintended consequences of machine learning. ArXiv Preprint https://arxiv.org/abs/1901.10002
Gilpin LH, Bau D, Yuan BZ, Bajwa A, Specter M, Kagal L (2018) Explaining explanations: an overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) 80–89. IEEE
Zafar MB, Valera I, Rodriguez MG, Gummadi KP (2019) Fairness constraints: mechanisms for fair classification. J Mach Learn Res 20(75):1–42
Aggarwal N (2021) The norms of algorithmic credit scoring. The Cambridge Law Journal 80(1):42–73
Hurlin C, Pérignon C, Saurin S (2021) The fairness of credit scoring models (SSRN Scholarly Paper ID 3785882). Soc Sci Res Net
Kozodoi N, Jacob J, Lessmann S (2022) Fairness in credit scoring: assessment, implementation and profit implications. Eur J Oper Res 297(3):1083–1094
Aggarwal, C (2014) Instance-based learning: a survey. In Charu Aggarwal (Ed), Data classification: Algoth Appl CRC Press
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Bontempi G, Bersini H, Birattari M (2001) The local paradigm for modeling and control: From neuro-fuzzy to lazy learning. Fuzzy Sets Syst 121(1):59–72
Bontempi G, Birattari M, Bersini H (2002) Lazy learning: a logical method for supervised learning. In: Jain LC, Kacprzyk J (eds) New learning Paradigms in Soft Computing. Springer, Heidelberg, pp 97–136
Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900
Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11(1–5):11–73
Domeniconi C, Peng J, Gunopulos D (2002) Locally adaptive metric nearest-neighbor classification. IEEE Trans Pattern Anal Mach Intell 24(9):1281–1285
Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). 2:126–2136
Aamodt A, Plaza E (1994) Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun 7(1):39–59
Jo H, Han I, Lee H (1997) Bankruptcy prediction using case-based reasoning, neural networks, and discriminant analysis. Expert Syst Appl 13(2):97–108
Vukovic S, Delibasic B, Uzelac A, Suknovic M (2012) A case-based reasoning model that uses preference theory functions for credit scoring. Expert Syst Appl 39(9):8389–8395
Xu R, Nettleton D, Nordman DJ (2016) Case-specific random forests. J Comput Graph Stat 25(1):49–65
Garcia S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435
Leyva E, González A, Pérez R (2015) Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recogn 48(4):1523–1537
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143
de Haro-García A, Cerruela-García G, García-Pedrajas N (2019) Instance selection based on boosting for instance-based learners. Pattern Recogn 96:106959
Bischl B, Kühn T, Szepannek G (2016) On class imbalance correction for classification algorithms in credit scoring. In: Lübbecke M, Koster A, Letmathe P, Madlener R, Peis B, Walther G (eds) Operations Research Proceedings 2014. Springer, Cham, pp 37–43
Kuncheva LI, Arnaiz-González Á, Díez-Pastor J-F, Gunn IAD (2019) Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progress in Artificial Intelligence 8(2):215–228
More A (2016) Survey of resampling techniques for improving classification performance in unbalanced datasets. https://arxiv.org/abs/1608.06048
Crone SF, Finlay S (2012) Instance sampling in credit scoring: an empirical study of sample size and balancing. Int J Forecast 28(1):224–238
Cleveland WS, Devlin SJ, Grosse E (1988) Regression by local fitting: methods, properties, and computational algorithms. J Econom 37(1):87–114
Loader C (1999) Local regression and likelihood. Springer Science & Business Media
Schaal S, Atkeson CG (1998) Constructive incremental learning from only local information. Neural Comput 10(8):2047–2084
Nadaraya EA (1964) On estimating regression. Theory of Probability & Its Applications 9(1):141–142
Watson GS (1964) Smooth regression analysis. Sankhyā: Ind J Stat Ser A 359–372
Schwarz A, Arminger G (2005) Credit scoring using global and local statistical models. In: Weihs C, Gaul W (eds) Classification—The Ubiquitous Challenge. Springer, Berlin Heidelberg, pp 442–449
Li F-C (2009) The hybrid credit scoring strategies based on KNN classifier. Sixth International Conference on Fuzzy Systems and Knowledge Discovery 2009:330–334
Harris T (2015) Credit scoring using the clustered support vector machine. Expert Syst Appl 42(2):741–750
Liu Z, Pan S (2018) Fuzzy-rough instance selection combined with effective classifiers in credit scoring. Neural Process Lett 47(1):193–202
Guo Y, Zhou W, Luo C, Liu C, Xiong H (2016) Instance-based credit risk assessment for investment decisions in P2P lending. Eur J Oper Res 249(2):417–426
Britto AS, Sabourin R, Oliveira LES (2014) Dynamic selection of classifiers—a comprehensive review. Pattern Recogn 47(11):3665–3680
Dietterich TG (2000) Ensemble methods in machine learning. In: Multiple Classifier Systems. MCS 2000. Lect Notes Comput Sci 1857:1–15. Springer, Berlin, Heidelberg
Kuncheva LI (2004) Classifier ensembles for changing environments. In F. Roli J, Kittler, T Windeatt (eds) Multiple Classifier Systems (Vol. 3077, pp. 1–15). Springer Berlin Heidelberg
Kuncheva LI (2008) Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. Proceedings of the 2nd Workshop SUEMA, 2008 5–10
Cruz RM. O, Cavalcanti GDC, Ren TI (2011) A method for dynamic ensemble selection based on a filter and an adaptive distance to improve the quality of the regions of competence. The 2011 International Joint Conference on Neural Networks 1126–1133
Cruz RM O, Zakane HH, Sabourin R, Cavalcanti GDC (2017) Dynamic ensemble selection VS K-NN: why and when dynamic selection obtains higher classification performance? 2017Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) 1–6
García V, Marqués AI, Sánchez JS (2012) Improving risk predictions by preprocessing imbalanced credit data. In T. Huang, Z. Zeng, C. Li, & C. S. Leung (eds) Neural Information Processing 7664:68–75. Springer Berlin Heidelberg
García V, Marqués AI, Sánchez JS (2019) Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Information Fusion 47:88–101
García V, Sánchez JS, Ochoa-Ortiz A, López-Najera A (2019) Instance selection for the nearest neighbor classifier: connecting the performance to the underlying data structure. In: Morales A, Fierrez J, Sánchez JS, Ribeiro B (eds) Pattern Recognition and Image Analysis. Springer International Publishing, pp 249–256
Kuncheva LI (2000) Clustering-and-selection model for classifier combination. KES’2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516), 1:185–188
Soares RGF, Santana A, Canuto AMP, de Souto MCP (2006) Using accuracy and diversity to select classifiers to build ensembles. Proc Int Jt Conf Neural Netw 1310–1316
Abellán J, Castellano JG (2017) A comparative study on base classifiers in ensemble methods for credit scoring. Expert Syst Appl 73:1–10
Ala’raj M, Abbod MF (2016) Classifiers consensus system approach for credit scoring. Knowl-Based Syst 104:89–105
Ala’raj M, Abbod MF (2016) A new hybrid ensemble credit scoring model based on classifiers consensus system approach. Expert Syst Appl 64:36–55
Feng X, Xiao Z, Zhong B, Qiu J, Dong Y (2018) Dynamic ensemble classification for credit scoring using soft probability. Appl Soft Comput 65:139–151
He H, Zhang W, Zhang S (2018) A novel ensemble method for credit scoring: adaption of different imbalance ratios. Expert Syst Appl 98:105–117
Lessmann S, Baesens B, Seow H-V, Thomas LC (2015) Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur J Oper Res 247(1):124–136
Melo Junior L, Nardini FM, Renso C, Trani R, Macedo JA (2020) A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems. Expert Syst Appl 152:113351
Marqués AI, García V, Sánchez JS (2012) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070
Zhang H, Liu Q (2019) Online learning method for drift and imbalance problem in client credit assessment. Symmetry 11(7):890
Lasota T, Londzin B, Telec Z, Trawiński B (2014) Comparison of ensemble approaches: mixture of experts and AdaBoost for a regression problem. In N. T. Nguyen B, Attachoo B, Trawiński K, Somboonviwat (eds), Intelligent Information and Database Systems (Vol. 8398, pp. 100–109). Springer International Publishing
Masoudnia S, Ebrahimpour R (2014) Mixture of experts: a literature survey. Artif Intell Rev 42(2):275–293
Xu L, Amari S (2009) Combining classifiers and learning mixture-of-experts. In: Dopico JRD, Dorado J, Pazos A (eds) Encyclopedia of artificial intelligence. IGI Global, Hershey, PA, pp 318–326
Titsias MK, Likas A (2002) Mixture of experts classification using a hierarchical mixture model. Neural Comput 14(9):2221–2244
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87
Cruz RMO, Sabourin R, Cavalcanti GDC (2018) Dynamic classifier selection: Recent advances and perspectives. Information Fusion 41:195–216
Liang T, Zeng G, Zhong Q, Chi J, Feng J, Ao X, Tang J (2021) Credit risk and limits forecasting in e-commerce consumer lending service via multi-view-aware mixture-of-experts Nets. Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 229–237
West D (2000) Neural network credit scoring models. Comput Oper Res 27(11–12):1131–1152
Mays E (2005) Handbook of credit scoring. Publishers Group Uk
Kennedy K, Mac Namee B, Delany SJ, O’Sullivan M, Watson N (2013) A window of opportunity: assessing behavioural scoring. Expert Syst Appl 40(4):1372–1380
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR) 49(2):1–50
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2(4):42–47
Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv 52(4):1–36
Rahman MM, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput 224–228
Sun Y, Wong AKC, Kamel MS (2009) Classification οf imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM. Comput Intell Neurosci 2017:1–11
Wang S, Minku LL, Yao X (2018) A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems 29(10):4802–4821
Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453
Lewis EM (1992) An Introduction to Credit Scoring (2nd ed edition). Fair, Isaac and Co
Finlay S (2010) Credit scoring, response modelling and insurance rating. Palgrave Macmillan UK
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press
Luque A, Carrasco A, Martín A, de las Heras, A. (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231
Parker C (2011) An analysis of performance measures for binary classifiers. 2011 IEEE 11th International Conference on Data Mining, 517–526
Valverde-Albacete FJ, Peláez-Moreno C (2014) 100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS ONE 9(1):e84217
Hand DJ, Anagnostopoulos C (2013) When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recogn Lett 34(5):492–495
Hand DJ, Anagnostopoulos C (2021) Notes on the H-measure of classifier performance. Adv Data Anal Classif 1–16
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Garcıa S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all Pairwise Comparisons. J Mach Learn Res 9:18
Kvamme H, Sellereite N, Aas K, Sjursen S (2018) Predicting mortgage default using convolutional neural networks. Expert Syst Appl 102:207–217
Guégan D, Hassani B (2018) Regulatory learning: how to supervise machine learning models? An application to credit scoring. The Journal of Finance and Data Science 4(3):157–171
Kiritz N, Sarfati P (2018) Supervisory guidance on model risk management (SR 11–7) versus enterprise-wide model risk management for deposit-taking institutions (E-23): a detailed comparative analysis. Available at SSRN 3332484
Morini M (2011) Understanding and managing model risk: a practical guide for quants, traders and validators. John Wiley & Sons
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40(1):185–197
Wang H, Xu Q, Zhou L (2015) Large unbalanced credit scoring using lasso-logistic regression ensemble. PLoS ONE 10(2):e0117844
Funding
Open access funding provided by HEAL-Link Greece.
Author information
Authors and Affiliations
Contributions
DN implemented the models and the computational framework, analyzed the results, and prepared the manuscript. MD contributed to the design of the experimental analysis and the writing of the manuscript. All authors provided critical feedback and helped shape the research, analysis, and the manuscript.
Corresponding author
Ethics declarations
Ethics Approval
Not applicable
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Conflicts of Interest
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nikolaidis, D., Doumpos, M. Credit Scoring with Drift Adaptation Using Local Regions of Competence. Oper. Res. Forum 3, 67 (2022). https://doi.org/10.1007/s43069-022-00177-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43069-022-00177-1