1 Introduction

Information asymmetry has far reaching and well-studied consequences in the operation of financial markets, such as the impact on financial inclusion, financial intermediation and financial risk; see [1,2,3,4,5]. Thus, credit bureaus have emerged as the means to diminish information asymmetry and support the efficiency of credit institutions in their decision-making processes, and in tasks such as credit limit management, debt collection, cross-selling, risk-based pricing, prevention of fraud, etc. [6,7,8]. Credit scoring, as a principal tool of credit bureaus to identify good prospective borrowers, began as early as 1941 [9]. However, the automated and widespread application of credit scoring did not take place until the 1980s, when computing power to perform sophisticated statistical calculations became affordable. One definition of credit scoring is “the use of statistical models to transform relevant data into numerical measures that guide credit decisions” [10]. According to Thomas et al. [11], credit scoring has been vital in the “…phenomenal growth in the consumer credit over the last five decades. Without (credit scoring techniques, as) an accurate and automatically operated risk assessment tool, lenders of consumer credit could not have expanded their loan (effectively).”

However, credit scoring modeling and methodologies face theoretical issues as well as practical ones (as operated in practice by all credit bureaus):

  • As with all predictive models, credit scoring suffers from population (or concept) drift, i.e., changes in the socio-economic environment cause the underlying distribution of the modeled population to change over time. [12,13,14,15,16]. To tackle this problem in practical terms, credit bureaus implement continuous monitoring cycles and periodic re-calibration or re-development of their models [10, 17, 18]. The calibration of credit scoring models or the lack thereof, has been mentioned in the literature as one reason (among others) for the subprime mortgage crisis of 2008 [19]. Specifically, FICO scores have been shown to having become a worse predictor of default between 2003 to 2006 [20, 21]. During that period, despite the rapid and severe deterioration of subprime portfolio quality, corresponding scores remained fairly stable [22].

  • Development of credit scoring models require historical data of at least 1–2 years. Without counting the monetary cost incurred by such operations, adding the time to implement and put into production a new generation of models, sometimes results in a difference of three or more years between actual data that reflect the current population dynamics and the data used to build the models. This lag between data at model development time and actual time to be put into production, has become more obvious as data are generated in an ever-increasing pace and this acceleration puts an equally pressing pace in operations.

  • Moreover, as credit scoring models depend on pre-defined sets of predictor (input) variables, when their weights are updated from time to time, they may lose their relevance and end up with a weight zero or close to zero. These predictors are called omitted variables and it has been shown that the omission of variables related to local economic conditions seriously bias and weaken scoring models [23].

  • Credit bureaus do not use a single scoring model (sometimes referred to as “scorecard”) for a specific purpose (such as estimation of the probability of default), but rather split the population into various segments using either demographic criteria, or risk-based ones. This happens for various reasons such as data availability (e.g., new accounts versus existing customers), policy issues (e.g., different credit policies for mortgages), inherently different risk-groups, etc., in order to (a) capture significant interactions between variables among the sub-population that are not statistically important within the entire population or cause the relevance of predictors to change between groups [24], (b) capture non-linear relationships (especially on untransformed data) and increase the performance of generalized linear models [24], which are even today the “golden standard” in the credit scoring industry (although to a far lesser extent than in past decades). Despite the fact that there is not enough academic consensus about the effects of segmentation in scorecards’ performance [25], segmentation is a de facto approach throughout the credit scoring industry for another reason: robustness.

In this work, we investigate the use of local classification models for dynamic adaptation in consumer credit risk assessment aiming to handle the population drift and avoid the time-consuming endeavor of continuous monitoring and re-calibration/re-development procedures. The proposed adaptive scheme, searches the feature space for each candidate borrower (“query instance”) to construct a “micro-segment” or local region of competence, using the K nearest neighbors algorithm (kNN). Thus, a region of competence is exploited as a localized training set to feed a classification model for the specified individual. Such a specialized local model serves as an instrument to achieve the desired adaptation for the classification process. We compare various classifiers (logistic regression as well as ML methods such as random forests and gradient boosting trees). All the explored algorithms are fed to training features extracted from a credit bureau proprietary database and evaluated in an out-of-sample/out-of-time validation setting in terms of performance measures including AUC and H-Measure [26]. Specifically, we explore three hypotheses:

H1: Do local methods outperform their corresponding global ones?

H2: Do results using ML methods differ significantly from logistic regression in the global as well as in the local setup?

H3: Does the choice of kNN-based local neighborhoods affect model performance over choosing randomly selected regions?

The results demonstrate the competitiveness of the proposed approach as opposed to the established methods. Thus, our contributions can be summarized as follows:

  • Our analysis is using a real-world, pooled cross-sectional data set spanning a period of 11 years, including an economic recession, and containing 3,520,000 record-months observations and 125 variables. Availability of adequate, real-world credit related data is extremely scarce in the literature. In a very extensive benchmark study by [27] 28 papers were surveyed in terms of data sets used; the mean number of records/variables of all datasets was 6167/24, whereas the biggest dataset used in the study had 150,000 observation and 12 independent variables. Also, small datasets have been noted in the literature that may introduce unwanted artifacts and models built upon them do not scale up when put into practice [28, 29].

  • Using local classification methods there is no need for continuous calibration of the models; adaptation to concept drift is part of the dynamic and automated model building process.

  • Predictive models are always trained on the latest available data. The predictors used in the models are not fixed but they are always picked up to fit the changing conditions, thus bypassing the problem of omitted variables.

  • For each query, a specialized micro-segment or region of competence is created dynamically, thus reaping the benefits of segmentation.

  • Last by not least, the proliferation of ML/artificial intelligence methods for predictive modelling created a paradigm shift for the credit scoring as well [30,31,32,33,34,35,36,37,38]. The issue of performance improvement is but one side of the discussion, the other one being related to issues such as transparency, bias and fairness [39,40,41,42,43,44], which in the context of credit scoring have received special attention [45,46,47] due to the statutory and regulatory constraints (cf. GDPR, EU AI Act: COM/2021/206 final). In our work, we focus on the performance aspect and we compare statistical classification models versus well-advertised ML methods.

The rest of this paper is organized as follows. In Sect. 2, we present the theoretical background; Sect. 3 provides a formulation of the problem; Sect. 4 describes the experimental setup and all its parameters; Sect. 5 provides the empirical results; and Sect. 6 concludes with discussion of these results and possible directions of future work.

2 Background and Related Theoretical Work

2.1 Local Classification

Usually, the classification process is a two-phase approach that is separated between processing training and test instances:

  • Training phase: a model is constructed from the training instances.

  • Testing phase: the model is used to assign a label to an unlabeled test instance.

In global or eager learning, the first phase creates pre-compiled abstractions or models for learning tasks, which describe the relationship between the input variables and the output over the whole input domain [48]. In instance-based learning (also called lazy or local learning), the specific test instance (also called query), which needs to be classified, is used to create a model that is local to that instance. Thus, the classifier does not fit the whole dataset but performs the prediction of the output for a specific query [49,50,51,52].

The most obvious local model is a k-nearest neighbor classifier (kNN). However, there are other possible methods of lazy learning, such as locally-weighted regression, decision trees, rule-based methods, and SVM classifiers [53,54,55]. Instance-based learning is related to but not quite the same as case-based reasoning [56,57,58,59], in which previous examples may be used in order to make predictions about specific test instances. Such systems can modify cases or use parts of cases in order to make predictions. Instance-based methods can be viewed as a particular kind of case-based approach, which uses specific kinds of algorithms for instance-based classification.

Inherent to the local learning methods is the problem of prototype or instance selection where it can be defined as the search for the minimal set \(S\) in the same vector space as the original set of instances \(T\), subject to \({\text{accuracy}}(S)\ge {\text{accuracy}}(T)\), where the constraint means that the accuracy of any classifier trained with \(S\) must be at least as good as that of the same classifier trained with \(T\) [60,61,62]. Instance selection methods can be distinguished based on their properties such as the direction of search for defining \(S\) (e.g., incremental search, where search begins with an empty \(S\)) and wrapper versus filter methods, where the selection criterion is based on the accuracy obtained by a classifier such as kNN, versus not relying on a classifier to determine the instances to be classified [60].

However, we shall distinguish instance selection from instance sampling de Haro-Garcia et al. [63], where the purpose is to formulate a suitable sampling methodology for constructing the training and test datasets from the entire available population. In particular, instance sampling deals with issues such as sample size and sample distribution (balancing; [64,65,66] and has been shown to be of major importance for credit scoring due to the inherent imbalance in the credit scoring data [67].

There are three primary components in all local classifiers [48, 49]:

  1. 1.

    Similarity or distance function: This computes the similarities between the training instances, or between the test instance and the training instances. This is used to identify a locality around the test instance.

  2. 2.

    Classification function: This yields a classification for a particular test instance with the use of the locality identified with the use of the distance function. In the earliest descriptions of instance-based learning, a nearest neighbor classifier was assumed, though this was later expanded to the use of any kind of locally optimized model.

  3. 3.

    Concept description updater: This typically tracks the classification performance and makes decisions on the choice of instances to include in the concept description.

A specific mention shall also be made to the concept of local weighted regression [53, 68,69,70] where the core idea lies on local fitting by smoothing: the dependent variable is smoothed as a function of the independent variables in a moving fashion analogous to a moving average. In similar manner kernel regression uses a kernel as a weighting function to estimate the parameters of the regression, i.e., the Nadaraya-Watson estimator [71, 72].

Local classification methods have not been studied extensively specifically in the context of credit scoring. Simple models such as basic kNNs expectedly do not yield satisfying results [27] and thus have not drawn much of the interest of the academic community nor of the practitioners for that matter. Some effort using advanced and/or hybrid methodologies such as self-organizing maps for clustering [73], combining kNN with LDA and decision trees [74], clustered support vector machines [75], fuzzy-rough instance selection [76], instance-based credit assessment using kernel weights [77], have shown somewhat promising results, albeit bearing into consideration the issues airing from the datasets used (size, relevance, real-world applicability).

2.2 Local Regions of Competence

Ensemble methods also known as Multiple Classifier Systems (MCS) combine several base classifiers through a conceptual three-phase process [78,79,80,81]:

  1. 1.

    Pool generation, where a diverse pool of classifiers is generated,

  2. 2.

    Selection, where one or a subset of these classifiers is selected, and

  3. 3.

    Integration, where a final prediction is made based on fusing the results of the selected classifiers.

The selection phase can be static or dynamic. Static selection consists of selecting base models once and using the resulting ensemble to predict all test samples, whereas in dynamic selection specific classifiers are selected for each test instance through evaluation of their competence in the neighborhood or otherwise on a local region of the feature space where the test instance is located. Thus, the neighbors of the test instance define a local region which is used to evaluate the competence of each base classifier of the ensemble.

The definition of the local region has been shown to be of importance to the final performance of dynamic selection methods [82, 83, 103] and there are papers pointing out that this performance can be improved by better defining these regions and selecting relevant instances [83,84,85,86]. One of the most common methodologies for defining local regions is kNNs (including its variations such as extended kNNs, especially for imbalanced data, which are of particular importance to credit scoring). Methods such as clustering [87, 88] can also be found in the literature.

Dynamic selection techniques in the context of credit scoring have received some attention in the literature [89,90,91,92,93,94]. In a recent paper, Melo Junior et al. [95] proposed a modification of the kNN algorithm, called reduced minority kNNs (RMkNN), which aims to balance the set of neighbors used to measure the competence of the base classifiers. The main idea is to reduce the distance of the minority samples from the predicted instance. As mentioned, imbalancing of the distribution of the classes is an important factor when considering sampling for credit scoring [64, 67, 85, 86, 93, 96, 97]. This issue becomes even more important when dynamic selection techniques are applied.

A related approach is the Mixture of Experts, which is composed of many separate neural networks, each of which learns to handle a subset of the complete set of training cases [98,99,100,101]. This method is established based on a divide-and-conquer principle [102], where the feature space is partitioned stochastically into several subspaces through a special employed error function and “experts” become specialized on each subspace. However, only multilayer perceptron neural networks are used as the base classifier [78, 103]. Mixture of Experts has not been extensively applied in the context of credit scoring and there are only a few studies on the subject [104, 105].

3 Problem Formulation and Parameters

Assuming a classification training set\(\{({\mathbf{x}}_{1}, {y}_{1}), \dots ,({\mathbf{x}}_{n},{y}_{n})\}\),\(\mathbf{x}\in {\mathbb{R}}^{n}\),\(y\in \{0, 1\}\), M is a global model trained on all \({\left\{\left({\mathbf{x}}_{i}, {y}_{i}\right)\right\}}_{i=1}^{n},\) the local region of competence for a given test instance \(\mathbf{x}\) (assuming its k-nearest neighbors) is denoted by \({N}_{x}=\{{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots , {\mathbf{x}}_{k}\}\) and the learning set for the local classifier \({M}_{x}\) is\({\left\{\left({\mathbf{x}}_{i}, {y}_{i}\right)\right\}}_{{\mathbf{x}}_{i}\in {N}_{x}}\).

Specifically, for the credit scoring binary classification problem \(\{{\mathbf{x}}_{i}\}\), \(i=1, \dots , n\), is considered the feature or variable space, denoting the characteristics of each borrower \(i\) and \({y}_{i}\) is the corresponding objective or target variable denoting the class label (non-default or default sometimes referred also as “Good” or “Bad”). Each feature vector \({\mathbf{x}}_{i}\) is observed at a point in time \({T}_{0}\), called observation point, whereas the corresponding response \({y}_{i}\) is recorded at a subsequent performance point \({T}_{1}={T}_{0}+\tau\), where \(\tau \ge 1\) is usually defined in months. The collected input data span an observation time window (or observation window) covering the period \([{T}_{0}-{\tau }^{\mathrm{^{\prime}}}, {T}_{0}]\) (\({\tau }^{\mathrm{^{\prime}}}\ge 1\) denoting months), whereas the outcome window refers to the period \(({T}_{0}, {T}_{1}]\) where the class label of \({y}_{i}\) is defined. For the context of behavioral credit scoring, the feature space contains variables related the financial performance and behavior of borrowers such as credit amounts, delinquency status, etc.

The credit scoring literature has not provided definitive answers to defining optimally these parameters (default definition, observation window, outcome window). The recommendations in the literature vary the length of observation and outcome windows from 6 to 24 months [8, 11, 106].

Regarding the definition of default, Anderson [10] designated that financial institutions choose between: (a) a current status definition that classifies an account as good or bad based on its status at the end of the outcome window, and (b) a worst status approach that uses a time-period during the outcome window. Regulatory requirements are also of paramount importance and must be taken into consideration, such as a 90 days past due worst status approach that is commonly used in practice in behavioral scorecards and complies regulatory requirements, such as the Basel Capital Accords and the new definition of default by the European Banking Authority (EBA). Kennedy et al. [107] presented a comparative study of various values for these parameters. Their results indicated that behavioral credit scoring models using:

  • default definitions based on a worst status approach outperformed those with current status.

  • a 12-month observation window outperformed the ones with 6- and 18-month windows in combination with shorter (12 months or less) outcome windows.

  • 6-months outcome window and a current status definition of default outperformed longer outcome windows; for the worst status approach the degradation occurs when outcome window extends beyond 12 months.

Finally, it should also be noted that credit scoring data sets are highly imbalanced, since the objective of all financial institutions is a low-default portfolio. There are quite a few studies and approaches in the literature analyzing the impact of imbalancing in classification, in general [108,109,110,111,112,113,114], as well as in the context of credit scoring [64, 67, 84, 93, 96, 115].

4 Experimental Setup and Methodology

4.1 Data and Variables

Our data set (pooled cross-sectional data) has been derived from a proprietary credit bureau database in Greece and spans a period of 11 years (2009q1 to 2019q4), resulting in total 44 snapshots (11 years by 4 quarters). At each snapshot, a random sample of 80,000 borrowers was retrieved with all their credit lines, including paid off and defaulted, resulting in 3,520,000 record-months observations.

In total, 125 proprietary credit bureau behavioral variables were calculated at the borrower level which fall within the following dimensions:

  • Type of credit (consumer loans, mortgages, revolving credit such as overdrafts, credit cards, restructuring loans, etc.).

  • Delinquencies (months in arrears, delinquent amount, etc.).

  • Amounts (Outstanding balance, disbursement amount, credit limit, etc.).

  • Time (months since approval, time from delinquencies, etc.).

  • Inquiries made to the credit bureau database.

  • Derogatory events, such as write-offs or events from public sources such as courts.

Besides “elementary” variables such as the ones described above, other derivative/combinatory variables along various dimensions were calculated, such as various ratios (ratio of delinquent balance over current balance for the last \(X\) months for a specific type of credit line), utilizations and their rate of their increase or decrease over a specific time-window (e.g., consecutive increase over last \(X\) months), giving the total of 125 variables.

4.2 Scoring Parameters

Our scoring parameters are defined as follows:

  • Observation window: Time windows of 12 months prior to each observation point \({\rm T}_{0}\). Our initial observation point has been at 2009q1 and every subsequent quarter thereafter up to 2018q4.

  • Scorable population: At each observation point \({T}_{0}\), the following cases are excluded from the analysis: a) borrowers already having delinquency of 90 days past due (dpd) or more at \({T}_{0}\), b) cases lacking sufficient historical data i.e., less than 6 months of credit history, credit cards which are inactive balance within the observation window. The remaining observations constitute the scorable population for the specific \({T}_{0}\). The last \({T}_{0}\) is taken at 2018q4.

  • Outcome window: a 12-month window after the observation point. For each observation point \({T}_{0}\), the period \({T}_{1}={T}_{0}+12\) is used as the outcome window. Thus, the last \({T}_{1}\) is taken at 2019q4.

  • Default definition: The labeling of the scorable population at \({T}_{0}\) either as GOOD = 0 (majority class), BAD = 1 (minority or “default” class), depending on the information available during \({T}_{1}\), takes place using a worst status approach for each outcome window, i.e., the maximum (worst) delinquency over all accounts or a new derogatory event, is measured for the specific outcome window. Thus the corresponding classes are defined as: (a) \(y=1\) for cases with worst delinquency \(\ge\) 90 dpd or a derogatory event occurs during the outcome period, otherwise (b) \(y=0\) is assigned to all other cases.

4.3 Methodology

Our approach is based on training local and global classifiers on the same sample and comparing their performance. Local classifiers are trained for each instance \(\mathbf{x}\) of the test data set of each snapshot using the feature space defined by its neighborhood or region of competence within the training data set. A local model \({M}_{x}\) is then used to predict the probability and the class label of the specific instance for which it was trained. Correspondingly, global classification models are trained on the entire training set and then used to predict the class probabilities of each instance on the test data set. For better simulating a real-world scenario, we retrain global classifiers every 2 years. The classifiers used both in the global as well as in the local scheme are logistic regression, random forests (RF), and extreme gradient boosting machines (XGB). The choice of the specific ML models was made based on recent credit scoring literature findings where they seem to be on par or outperform other machine learning and deep learning methods [32]. Specifically, Gunnarsson et al. [33] found that XGBoost and RF outperformed deep belief networks (DBN), Hamori et al. [34] found XGB to be superior to deep neural networks (DNN) and RF. Marceau et al. [35] found that XGB performed better than DNN, and Addo et al. [30] concluded that both XGB and RF outperform DNN.

For implementation we used Microsoft R Open v3.5.1 and the corresponding R libraries: speedglm 0.3–2, randomForest 4.6–14 and xgboost 0.71.2. In all cases, default parameter values were used and no hyper-parameter optimization was performed other than internally used by the methods.

During the training phase, the input data have been pre-processed using an expert-based process flow to:

  • handle missing values, by excluding variables with greater than 70% missing values and filling the remaining blanks with a constant (since the variables are missing at random (MAR), in this work we use − 1 as constant value),

  • retain only the useful variables, by removing those with zero variance or near zero variance,

  • isolating non-correlated variables using an exclusion threshold of 0.7, and

  • select the most discriminative among the remaining variables using the Information Value (IV) criterion. The exclusion thresholds were selected to match a practitioner’s rule mentioned in the literature [18], where a variable is removed in case of having an IV lower than 0.3 and greater than 2.5.

Finally, as it has been noted in Sect. 3, credit scoring data are inherently imbalanced. In our case, the imbalancing is also observed in the regions of competence, which are used to build the local classification models. Such a fact, inevitably yields in some cases to non-convergence errors, when local logistic regression is used as a classification algorithm and the local region of competence contains very few minority class (default) cases for the algorithm to converge. In our experiments we found this non-convergence error to be on average 1.9% over all executions.Footnote 1 To address the non-convergence issue, in this work, we use a simple heuristic rule: anytime logistic regression algorithm fails to predict a class label for a test instance, the algorithm assigns the majority class from test instance’s region of competence.

4.4 Local Classification

As detailed below, for each snapshot, the k nearest neighbors (k-NN) algorithm is used to define the local region of competence \({N}_{x}\) for each test instance \(\mathbf{x}\). A local model \({M}_{x}\) is trained on this specific region \({N}_{x}\), which serves as an instrument to achieve the desired adaptation for the classification process. Figure 1 shows the overall flow for the proposed scheme:

Fig. 1
figure 1

High-level flow for the proposed local classification scheme (|S| denotes the cardinality of a set S)

The setup procedure is as follows: for each snapshot, the scorable population is defined as a random set (of 80,000 instances), sampled without replacement from the total population and the resulting data set is separated through a 50–50 split into training and test sets, to form the training and test sub-spaces of the original feature space. The distance metric used to define the local region of competence for each test instance, is determined using the Euclidean distance. Such a region of competence serves as a borrower-specific localized training set that will be used to build a local classification model for that borrower.

Regarding the size of the \(k\) parameter required by the nearest neighbors algorithm, it is worth to note a common rule of thumb that defines the selection of 1500 to 2000 examples per class, dating from the very beginning of credit scoring model development [116] and mentioned in many works thereafter [18, 24, 117]. Although the subject is not extensively researched, recent academic studies pointed to the direction that larger samples can improve the performance of linear models [67, 117] but there seems to be a plateau after 6000 goods/bads and almost no further benefit above 10,000. As a result, aiming to evaluate both claims, in this work we selected a \(k\) parameter that ranges from 2000 to 6000 examples (k \(\in\){2000, 4000, 6000}). The resulting region of competence is used to train a local classification model, \({M}_{x}\), which is specialized for the corresponding test instance/borrower. In this study, local classification models are built using the classification algorithms considered in the analysis (i.e., logistic regression, random forests, gradient boosting trees). Figure 2 depicts the training phase for the proposed scheme (pre-processing refers to the flow described in Sect. 4.3).

Fig. 2
figure 2

Training phase for the proposed local classification scheme (|S| denotes the cardinality of a set S)

To assess the performance of each local classification model \({M}_{{x}_{i}}\), which had been built for each test instance \({\mathbf{x}}_{\mathbf{i}}\) on its specific region of competence \({N}_{{x}_{i}}\), i = {1,…|TS_L|} (where i is the number of the data points in the test set #L) is used to predict the probability of default (PD) for the considered test instance/candidate borrower and assign a GOOD or BAD class label. This is compared to the actual labels available for the test instances.

4.5 Global Classification

As a baseline to benchmark our proposed local classifiers, we implement and evaluate a standard credit scoring classification scheme commonly used both by the scientific community and practitioners alike. In the global classification approach, the adaptation to population drift is achieved by retraining the models using new data from the contextual snapshot. Figure 3 shows the overall flow for the global scheme.

Fig. 3
figure 3

Global classification scheme (|S| denotes the cardinality of a set S)

It should be noted that in order to have a real-word and realistic comparison of model performance we re-train our global models every two years (as retraining is applied in practice to all commercial credit scoring models). The performance of global models over all snapshots would degrade significantly in case of training only once for the initial snapshot data (indicatively: mean AUC = 0.8213 with standard deviation = 0.04 for global LR models when the training took place only at the first snapshot 2009q1 versus mean AUC = 0.8746 with standard deviation = 0.014 when the re-training of global LR occurs every 2 years).

4.6 Performance Measures and Comparison of Classifiers

There is a keen interest of the scientific research community regarding the appropriateness of the established performance measures used to evaluate classification models and especially those which are used in credit scoring applications, also considering the inherent imbalance of the credit scoring datasets [118,119,120]. Specifically, the credit scoring setup gives rise to methodological problems such as the accuracy paradox [121] and the different misclassification cost between type I and type II errors [26]. As a result, the most used approach avoids accuracy as a scorecard performance metric, adopting instead measures such as the area under the ROC (AUC), the GINI index, and the Kolmogorov–Smirnov distance or the F-measure. However, in the literature there has been a skepticism over their appropriateness and especially of the widely used AUC measure [122]. A coherent alternative namely the H-measure [26, 122, 123] has been proposed in the literature, which handles different misclassification costs and is indicated to be a better suited performance metric for the credit scoring context [120]. Thus, in this work, we use both AUC and the H-measure (using default values for the parameters for the calculation of H-measure as defined in the corresponding R package).

Comparisons among several classification algorithms on several datasets arise in machine learning when a new proposed algorithm is compared with the existing state of the art. From a statistical point of view, the correct way to deal with multiple hypothesis testing is by, first, comparing all the classification algorithms together by means of an omnibus test to decide whether all the algorithms have the same performance. Then, if the null hypothesis is rejected, we can compare the classification algorithms by pairs using post-hoc tests. In these kinds of comparisons, common parametric statistical tests such as ANOVA are generally not adequate as the omnibus test. The arguments are similar to those against the use of the t-test: The scores are not commensurable among different application domains and the assumptions of the parametric tests (normality and homoscedasticity in the case of ANOVA) are hardly fulfilled [124,125,126]. In this paper we use the non-parametric tests of Nemenyi post hoc and Friedman’s aligned ranks. The selection of non-parametric tests is made because the underlying data distribution is not known. Since multiple classifiers should be compared, the Nemenyi test is selected for the pairwise comparisons among scheme and algorithm combinations, as proposed by Demsar [124]. Furthermore, Friedman’s aligned rank test is utilized to correct the p values for multiple testing.

5 Empirical Results

For tackling the hypothesis regarding the superiority of local models over their global counterparts, we started by examining whether the size of the local region impacts the classification performance. Figure 4 summarizes the performance of local LR models for various k’s whereas Table 2 in the Appendix provides the detailed results over all snapshots.

Fig. 4
figure 4

Average performance for local LR on different k = {2000, 4000, 6000}

As evidenced the choice of \(k\) does not have a significant impact performance of logistic regression. Specifically, we observe that when using the H-measure, the performance results are slightly and non-significantly decreasing as \(k\) increases (mean = 0.6360, 0.6298, 0.6270 for \(k\) = 2000, 4000, 6000, correspondingly), whereas the opposite holds when using AUC as performance measure (mean = 0.9256, 0.9259, 0.9265 for corresponding \(k\)’s). Thus, for the rest of our process we choose to use \(k\) = 2000 for local models since model performance is not significantly affected, whereas computational performance and memory requirements are considerably improved with lower \(k\)’s.

Comparing visually the results of the local classifiers with their corresponding global ones, we get a mixed picture (see Tables 3 and 4 in the Appendix for detailed results): whereas local LR models outperform their global counterparts, for XGB and RF the differences between global and local classifiers do not appear to be significant (Fig. 5).

Fig. 5
figure 5

Pairwise visual comparison between local/global classifiers (different y-axis scales, * = training snapshot for global classifiers) (LR = logistic regression, RF = random forrest, XGB = gradient boosting; solid blue line denotes local classifier, red line with markers global classifier)

To test for statistical differences between all classifiers (i.e., the case of multiple methods on multiple data sets as noted in [124], we use Friedman’s aligned rank test [125] to assess all the pairwise differences between algorithms and then correct the p values for multiple testing (Fig. 6 visualizes the results in matrix format). We observe that in both measures (AUC and H-Measure) LR-G differs significantly from all other classifiers. Going in more details, in the AUC-based matrix two “clusters” of classifiers emerge for which the null hypothesis of not been equal cannot be rejected: a) XGB-G, RF-G, RF-L_2k and b) LR-L_2k and XGB-L_2k. For the H-measure-based p value matrix, the analogous “clusters” observed are as follows: (a) RF-L_2k, RF-G and (b) XGB-G, XGB-L_2k, LR-L_2k. Thus, there seems to be an “interlacing” between the performance of all ML models (both local and global) and LR-L_2k which cannot be statistically rejected and strengthens the evidence that local models are at least on par with their global counterparts. Especially for LR-L it is clearly evidenced that it outperforms LR-G with statistical significance.

Fig. 6
figure 6

p Values of the pairwise differences from Friedman’s aligned rank test (LR = logistic regression, RF = random forest, XGB = gradient boosting, L = local classifier, G = global classifier, 2 k = 2000 for kNN)

As a next step, we use the Nemenyi post hoc test that is designed to check the statistical significance between the differences in the average rank of a set of predictive models. In the resulting critical distance (CD) graph (Fig. 7), the horizontal axis represents the average rank position of the respective model. The null hypothesis is that the average ranks of each pair of predictive models do not differ with statistical significance of 0.05. Horizontal lines connect the lines of the models for which we cannot exclude the hypothesis that their average ranks are equal. Any pair of models whose lines are not connected with a horizontal line can be seen as having an average rank that is different with statistical significance. On top of the graph a horizontal line is shown with the required difference between the average ranks (known as the critical distance or difference) for two pair of models to be considered significantly different.

Fig. 7
figure 7

Critical distances between local and global classifiers (LR = logistic regression, RF = random forest, XGB = gradient boosting, L = local classifier, G = global classifier, 2 k = 2000 for kNN)

Thus, it is further evidenced that the case of local LR consistently and statistically significantly outperforms global LR although the same conclusion does not seem to hold for RF and XGB, despite the minor difference in favor of the local methods when comparing average performance. This becomes more apparent upon examining the average AUC and the H-Measure over all snapshots (Fig. 8).

It is also noteworthy that although RF outranks XGB (in all cases; differences not statistically significant), the performance of Local LR does not differ statistically from the ML algorithms, contrasting the case of global LR which is vastly outranked and outperformed. The gain, when comparing these classifiers to the “baseline” global LR, is within the range of 6–8% (Table 1), which is well within the empirical range observed in other studies [32] when comparing ML algorithms to the basic logistic regression in credit scoring.

Table 1 Gain in AUC/H-Measure with respect to LR-G (LR = logistic regression, RF = random forest, XGB = gradient boosting, L = local classifier, G = global classifier, 2 k = 2000 for kNN)

Finally, to examine whether the choice of a specific local region based on kNNs vs random sub-sampling plays a role in the performance, we trained a series of models LR-L_2k_rnd where for each test instance \(\mathbf{x}\) its local region \({N}_{x}\) is a set of randomly selected training cases, instead of employing the kNN scheme. Detailed results are provided in.

Table 5 (Appendix) whereas the following Fig. 9 highlights the fact that selecting local regions through kNNs does makes a difference and a performance gain with respect when to a random choice of regions. It should be noted here that the performance of LR-L_2k_rnd appears somewhat similar to the global one LR-G. This is of no surprise, since the attributes of a random sample are, by selection, more similar to the overall population from which the sample is drawn than from a sub-region with specific characteristics.

Fig. 8
figure 8

Average performance over 44 snapshots (different y-axis scales) (LR = logistic regression, RF = random forest, XGB = gradient boosting, L = local classifier, G = global classifier, 2k = 2000 for kNN)

Fig. 9
figure 9

kNNs vs random regions (different y-axis scales) (LR = logistic regression, G = global classifier, 2 k = 2000 for kNN, * = training snapshot for global LR)

6 Conclusions and Future Work

The development of reliable models for credit scoring remains a challenge for researchers and practitioners. Technological advances in ML/AI provide new capabilities in this field, enabling the exploitation of large amounts of data. However, as conditions in the economic and business environment are in constant change, credit scoring models require regular updating. Motivated by this finding, this paper presented an adaptive behavioral credit scoring scheme which uses online training to provide estimates for the probability of default through an instance-specific basis.

Going back to our research hypotheses we can draw our conclusions:

H1: With respect to the potential gain of local methods vis-a-vis their global counterparts our results indicate clearly that local logistic regression outperforms and outranks the baseline global logistic regression. This does not seem to hold for the ML methods we used (RF and XGB) where the differences between local and global models are not statistically significant.

H2: Concerning the superiority of ML methods over baseline LR-G our results fall within a range of performance improvement of 2–8% observed in various credit scoring applications of ML/AI found in literature [30,31,32,33,34, 127]. However, it is quite important to observe that the performance of Local LR is on par with RF and XGB.

H3: Finally, our analysis clearly indicates that the performance of a local model is affected by the selection of a region of competence based on similar characteristics with the queried test instance. A random selection of points from the feature space provides inferior results compared to the kNN approach adopted in this study.

Bearing into consideration the volume of the real-world data used and the extensive out-of-sample validation performed, thus safeguarding for overfitting, our work clearly indicates that using local LR methods can provide real-time adaptation therefore providing a solution to the problem of population drift and the need for continuous re-calibration (which holds for LR and ML models alike), yielding comparable results with complex state-of-the-art ML algorithms. Additionally, LR per se is not a “black box” model which is extremely beneficial for regulatory purposes. However, dealing with the complexities of model risk management and governance [128,129,130] in the case of using real-time, adaptive local models may pose equal or even greater challenges for their practical application.

Another issue that yields further examination is the reason that the tested ML methods do not get the benefit of applying the same local regions as in LR. One possible answer tends towards the direction of the intrinsic way that RF and XGB are working by exploiting combinations of predictors within the feature space, thus better capturing the specific dynamics of a sub-region. This needs to be further examined.

Further work can also be performed towards the direction of:

  • exploring advanced balancing techniques such as SMOTE [131] or RUSBoost [132] for local sampling considering the highly imbalancing nature of credit datasets [64, 93] where balancing may affect not only performance in terms of misclassification errors but also non-convergence errors when using local LR,

  • usage of penalized methods such as LASSO or Ridge [113, 133],

  • usage of different distance metrics (e.g., Manhattan or Mahalanobis) or even different algorithms for choosing local regions instead of the basic kNNs, such as Reduced Minority kNNs [95].