1 Introduction

Bankruptcy risk assessment is crucial for firms, and advanced statistical models aid in predicting business failures. Credit risk management has evolved with financial innovations, necessitating the development of internal rating models aligned with regulatory guidelines (King et al., 2024; Hasnain et al., 2022). Credit risk assessment and default prediction are interconnected, with bankruptcy being a legal process sought by debtors unable to repay debts (Alhammadi et al., 2024). Machine learning techniques and statistical modeling enhance default prediction accuracy (Rahmani et al., 2024). Business failures impact the economy, prompting the need for effective default prediction models (Alvi at el., 2024). Loan portfolio models and the Internal Rating-Based approach assist in managing loan portfolios for commercial banks. The Basel III Accord and post-financial crisis challenges have led to the development of new models considering financial and macroeconomic indicators (Altman et al., 2004; Beaver et al., 2005).

Default prediction models incorporating financial and non-financial information play a crucial role in assessing credit risk for banks and financial institutions (Balasubramanian et al., 2019). Improving classification performance in credit scoring is essential for profitability (Malhotra et al., 2020). The failure of companies like WorldCom and Lehman Brothers had a significant negative impact on the economy (Kim et al., 2022). Various models have been developed to predict business failures, but contextual and methodological limitations exist (Jones 2023). Financial distress prediction models based on machine learning offer early indications of potential failure (Kato & Nakamura, 2024). Machine learning methods accurately predict asset prices and credit ratings (credit ratings means credit scoring). Developing quantitative models based on past financial data is a cost-effective solution for default prediction (Baek, 2023).

Financial distress is a state of financial difficulties faced by companies due to various factors. It involves different phases of distress, ranging from mild to severe (Florez-Lopez & Ramon-Jeronimo, 2015). The objective is to overcome the difficulties and avoid bankruptcy by analyzing the root cause and implementing a recovery plan. Monitoring and adjusting the plan as needed is crucial (Abellán & Castellano, 2017). Managing financial distress requires careful attention to the company’s financial condition, performance, and actions taken at different stages (Outecheva, 2007). Corporate failure involves financial distress leading to bankruptcy, which can be preceded by warning signs (Altman, 1968). Liquidity disorder is a common cause, and insolvency occurs when the company cannot meet its obligations. Companies may recover through debt restructuring. The severity of distress determines appropriate actions (Jarrow, 2005).

Financial distress has wide-ranging effects beyond the distressed firm itself, impacting investors, competitors, and other entities the consequences can be categorized into four areas: costs borne by the defaulted firm, costs borne by investors, compensation from other entities, and costs incurred by creditors and stakeholders (Branch, 2002). Bankruptcy costs can be measured as the actual cost to the company, the impact on customers and stakeholders, compensation required from other companies, and the overall economic impact. Financial distress can disrupt supply chains, cause job losses, and harm the economy, as seen in cases like Enron and WorldCom. Internal factors like inefficiencies and limited resources, particularly in small businesses, contribute to financial distress (Graham and Harvey, 2001,). It is crucial for companies to identify causes and take proactive measures to avoid financial ruin (Weitzel & Johnson, 1989). Factors contributing to financial distress include insufficient funding, exchange rate fluctuations, unstable management, high interest rates, declining demand, lack of regulations, competitive innovations, and poor financial practices (Jahur & Quadir, 2012).

The primary objective of this research is to bridge a significant gap in existing literature by providing an unprecedented, detailed analysis of credit risk assessment for non-financial listed companies in Pakistan, a topic seldom explored with such depth in this region as best of our knowledge. Leveraging advanced machine learning algorithms, this study meticulously selects efficient features (financial ratios) that closely mirror reality, ensuring the predictive accuracy of defaults. It rigorously tests these ratios through scenario or sensitivity analysis (proposed default definitions in methodology), evaluates the performance of various machine learning classifiers for superior accuracy and predictive power, and pioneers the development of a bespoke scorecard system for each company, this what we claim as the most novel job in context of Pakistan (Meaning deriving probability of default (PDs) events by Machine Learning algorithms – Black box Model) supported by Sigrist et al., (2023). This innovation allows for the creation of tailored rating scorecards using proposed financial ratios and weights (the beauty of this article is identifying the assigned weights along with back testing), further validated through comprehensive stress testing for model calibration and recalibration as suggested by Bequé et al., (2017). A cornerstone of this research is its establishment of a tested and efficient internal rating-based (IRB) system, marking a novel contribution to the field (this is the most genuine and authentic practical contribution of the research), any stakeholder may use our model and get real time PDs which will be very helpful for him which making investment decisions. To the best of our knowledge, no such detailed and holistic research effort has been undertaken in Pakistan before, underlining the uniqueness and potential impact of this study in advancing practical risk assessment models and providing valuable tools for investors and industries to effectively manage credit risk.

The problem addressed in this research is the lack of accurate credit default prediction models for non-financial listed companies in Pakistan. The existing deficiency in early warning signs in the non-financial sector leads to insolvency and financial distress, negatively impacting the economy and contributing to higher unemployment rates (Aggarwal et al., 2020). To fill this gap, the study aims to develop, validate, calibrate and recalibrate credit default prediction models using machine learning algorithms (Sun et al., 2021). These models will enable stakeholders, including decision-makers, management, lenders, and creditors, to anticipate and prevent business failures effectively (Chopra & Bhilare, 2018). By providing insights to regulatory bodies, commercial banks, and financial institutions, the research aims to enhance credit risk assessment and management processes, ultimately leading to improved financial stability and reduced risk in the financial sector (Farooq et al., 2018).

This research introduces four core novelties in predicting financial defaults in Pakistan. Firstly, it pinpoints the most predictive feature for defaults. Secondly, the study showcases an algorithm with unparalleled accuracy in the realm of default prediction models. Thirdly, it underscores the profound impact of the assumptions used in labeling firms as defaulting. Lastly, making scorecards using best features extracted by ML algorithms, and aligning them with BASEL III guideline for model calibration. Using a simulation based on Basel III guidelines, the research aids in assessing the risk of non-financial listed firms in Pakistan, subsequently guiding informed loan decisions. This proposed model is instrumental for finance managers in assessing customer financial health, supporting profitable investments, and bolstering job security. By ensuring timely default predictions, stakeholders can preempt business failures. Overall, this study serves as a beacon for crafting reliable default prediction models, benefiting a wide spectrum of financial institutions and regulatory bodies while emphasizing the unique contributions of the research.

This study faces limitations in its scope, notably in not testing other advanced machine learning algorithms like CatBoost, XGBoost, and LightGBM due to time and resource constraints. Furthermore, it does not incorporate macroeconomic, corporate governance, and ESG indicators, nor additional financial ratios that could potentially increase the model’s predictive power, primarily because of data accessibility issues. The research also omits analysis of listed financial institutions, unlisted companies, and SMEs in Pakistan, attributed to challenges in obtaining relevant data.

The study is divided into four sections. Section 2 explores the literature gap related to AI-based prediction models and justifies the research objectives. Section 3 explains the research methodology, including the tools and techniques used to achieve the objectives. Section 4 presents the results obtained from applying the methodology, and Sect. 5 concludes the article.

2 Literature Review

The ex-ante theory of financial distress involves predicting and analyzing the factors and decisions preceding a company’s financial difficulties. It emphasizes assessing risks and returns before distress occurs, focusing on early indicators and preemptive actions (Farooq et al., 2018).

Beaver (1966) introduced discriminant theory, which quickly gained popularity for predicting failure. Altman (1968) made significant contributions by introducing the Z-score and testing model as an effective measure of assessing corporate solvency. He also highlighted the usefulness of logit and probit models in the 1980s and 1990s, which relaxed the assumptions of Multivariate discriminant analysis (MDA). Ohlson (1980) proposed an alternative approach to predicting financial failure that relaxed the assumptions of MDA. His model offered valuable insights and has influenced subsequent research in the field.

The inclusion of market variables (Aggarwal et al., 2018), financial ratios, and qualitative variables in predicting financial failures has been a common practice due to their sensitivity and relevance in intelligent models (Agarwal & Taffler, 2008; Altman, 1968; Bauer & Agarwal, 2014; Gilbert et al., 1990; Ravi Kumar & Ravi, 2007; Kaur & Aggarwal et al., 2023). Moreover, the incorporation of corporate governance indicators has proven to significantly impact firm performance (Lee & Yeh, 2004; Liang et al., 2016; Lin, Liang, & Chu, 2010).

Several prominent studies have contributed to the literature on predicting financial failures. Nam et al., (2008) developed a range of techniques including discriminant analysis, logit analysis, probit analysis, regression tree, k-nearest neighbor, rough set, and classification theory. These models have been widely recognized and utilized in the field.

Several studies have been conducted to predict credit risk default and financial distress in various contexts. Agrawal and Maheshwari (2019) focused on logistic regression-based multiple discriminant analysis in Indian firms and identified the industry index as a significant variable for predicting defaults. Obradović et al. (2018) found that models incorporating both financial and non-financial variables achieved higher prediction accuracies compared to models using only financial variables.

In terms of methodology, different approaches have been explored. Bai et al. (2019) utilized rough-set theory and fuzzy C-means clustering to determine creditworthiness in the Chinese agricultural industry, emphasizing skill-based traits and education as important variables. On the other hand, (Khoja et al., 2019; Aggarwal et al., 2019) emphasized the significance of considering local macroeconomic indicators and avoiding single-focused evaluations to reduce insolvency risk in the UK, US, and GCC, BRICS and G7 Countries (Kaur 2024).

Regarding the analysis of corporate governance and financial distress, (Fernando et al., 2020) used the panel logit model to investigate the relationship in the USA and Sri Lanka, achieving a high average accuracy for insolvency prediction in Serbia. In contrast, (Maina 2020) focused on commercial banks in Kenya and highlighted the unique role of corporate governance and supervision frameworks in determining financial distress.

Recent research in credit risk prediction and bankruptcy analysis has employed various models and techniques to enhance accuracy and effectiveness. Machine learning-based models, such as ensemble models, clustering and consensus stages, and random forest algorithms, have been shown to be prevalent and effective in predicting default and credit risk. Studies by (Grishunin et al., 2022; Tang et al., 2019) have highlighted the success of machine learning approaches using financial and macroeconomic indicators.

In addition to machine learning, other models have also demonstrated strong predictive capabilities. Research by (Lin et al., 2019) emphasized the effectiveness of wrapper-based feature selection methods, while (Zhou & Lu et al., 2015; Alaka et al., 2018) employed big data analytics to improve default risk prediction in construction firms. Furthermore, (Khemakhem & Boujelbene, 2018) highlighted the importance of financial ratios and non-financial data for accurate default prediction. The studies collectively show the importance of incorporating different variables, methodologies, and models to gain comprehensive insights into credit risk and bankruptcy prediction.

In the field of credit risk and default prediction, various studies have been conducted to explore the effectiveness of different statistical modeling techniques. Abdou et al., (2019) investigated the use of logistic regression, discriminant analysis, probabilistic neural network, and multi-layer feedforward neural network in the Indian banking sector. Their research revealed that sophisticated credit risk scoring models like probabilistic neural networks could help mitigate 14% of credit risk defaults. Additionally, they highlighted the importance of demographic-based variables in determining default timing.

In addition to these key contributions, other studies have explored nonparametric models and techniques such as rough set theory (Yeh et al., 2010), decision tree (Chen, 2013), support vector machines, and k-nearest neighbor (West, 2000). Hybrid methods, such as advanced ensemble techniques, have also been employed to enhance the performance of existing models (Choi et al., 2018).

In Pakistan, the absence of early warning signs in non-financial firms has led to insolvency, financial distress, and an increase in unemployment (Hasnain et al., 2022). Various models, including machine learning classifiers like Neural Networks, Decision Trees, and Support Vector Machines, have been proposed to predict credit default, with potential for improved accuracy (Grishunin et al., 2022). However, determining the most accurate model remains a challenge (Halim, et al., 2022).

This research aims to identify early signs of financial problems using an appropriate machine learning model to prevent business failures (Ranawat, & Chakraborty, 2024) Financial ratios are instrumental in predicting defaults (Huang et al., 2022), and the study focuses on their use in the “Through the Cycle Methodology” of BASEL-III.

Different research studies propose machine learning and deep learning models for predicting financial distress or default (Gao & Balyan, 2022). A hybrid machine learning approach with the k-nearest neighbor model and the bagging ensemble method shows promise in credit rating and risk assessment (Lu, 2022; Lim et al., 2024).

Despite the significant advancements in predicting financial failures and credit default, there are still some literature gaps that this research aims to address. The existing studies have primarily focused on the application of machine learning models, such as Neural Networks, Decision Trees, and Support Vector Machines, in predicting credit default (Dube et al., 2021). However, there is a need to determine the most accurate model among these options (Tsai et al., 2021). This research aims to fill this gap by identifying an appropriate machine learning model that can effectively identify early signs of financial problems and prevent business failures (Zhu et al., 2023).

Furthermore, while financial ratios have been widely recognized as instrumental in predicting defaults (Alonso & Carbo, 2021), their use in conjunction with the “Through the Cycle Methodology” of BASEL-III needs further exploration. The literature has not extensively delved into the specific application of financial ratios within this methodology. This research will address this gap by focusing on the use of financial ratios within the “Through the Cycle Methodology” of BASEL-III to predict financial problems and potential business failures.

Additionally, while machine learning and deep learning models have been proposed for predicting financial distress or default (Gao & Balyan, 2022), there is a need to explore hybrid approaches that combine different models to enhance accuracy. The research will investigate the effectiveness of a hybrid machine learning approach, specifically incorporating the k-nearest neighbor model and the bagging ensemble method, for credit rating and risk assessment (Lu, 2022).

this research aims to contribute to the existing literature by filling the gaps in determining the most accurate machine learning model for predicting financial problems, exploring the application of financial ratios within the “Through the Cycle Methodology” of BASEL-III, and investigating the effectiveness of a hybrid machine learning approach for credit rating and risk assessment. By addressing these gaps, this study seeks to enhance the understanding and prediction of financial distress and potential business failures in non-financial firms in Pakistan.

3 Methodology

The research focuses on credit default prediction and aims to develop accurate prediction models using advanced statistical and machine learning techniques. It adopts a realist ontology, a positivist epistemological approach, and a value-neutral axiology. The methodology emphasizes objective data analysis, reliable prediction model creation, and adherence to established research methodology guidelines. The study follows a deductive approach, collects financial data from reliable sources, and aims to generalize its findings for stakeholders. The methodology ensures researcher independence, impartiality, and the production of authentic and valuable results for credit default prediction.

The research utilizes a comprehensive framework of financial ratios as independent variables to predict credit defaults. These ratios are categorized into capitalization ratios, cash flow ratios, coverage ratios, leverage ratios, liquidity ratios, profitability ratios, and firm size indicators. The study acknowledges the importance of these ratios in assessing an organization’s financial health and risk of default. By including a wide range of financial ratios, the research aims to establish accurate default prediction models applicable across different scenarios, aligning with previous studies in the field.

3.1 Data, Data Preprocessing and Experimental Setup

The study utilized a comprehensive dataset obtained from the State Bank of Pakistan, Annual Reports of each firm, and a financial data service provider in Pakistan. It focused on 396 non-financial firms listed on the Pakistan Stock Exchange, covering various sectors, over a 24-year period from 2000 to 2023. The dataset consisted of 71 financial ratios categorized into seven ratio categories such as liquidity, profitability, capitalization, coverages, leverage, size and cash flows. we used binary variable was as the dependent variable, where 0 represents non-default, and 1 represents default along with three default assumptions. Furthermore, details for independent and dependent variables may be obtained by clicking the link as given below. https://docs.google.com/spreadsheets/d/1JuNII1q85rBd7SLSf1pfg4RE8L0O0SDX/edit?usp=sharing&ouid=104272322739994774143&rtpof=true&sd=true

Data preprocessing involved eliminating insufficient history and incomplete data, as well as handling outliers using the winsorizing method (Barnett & Lewis, 1994). The resulting clean dataset comprised 396 firms and 71 financial ratios. Multivariate and univariate analysis, including correlation analysis and panel analysis, were performed to analyze the variables’ effect on default prediction.

The research employed the Weight of Evidence or Information Value technique to identify the most predictive features. Various machine learning algorithms, such as Logistic Regression, K-Nearest Neighbor, Decision Tree, Random Forest, Support Vector Machine, Naive Bayes, and Artificial Neural Networks, were compared for the default prediction model. The Scikit-learn library was utilized for algorithm testing and evaluation, employing accuracy measures, logistic coefficients, and cross-validation algorithms.

3.2 Feature Selection

The author developed a two-step feature selection process. The first step involved multivariate analysis, where correlation metrics were used to identify and minimize the impact of multicollinearity among the features. This step aimed to ensure that the selected features were not highly correlated with each other (Garg and Tai, 2013).

The second step utilized univariate analysis based on the weight of evidence (WoE) and information value (IV). These parameters, suggested by Wod (1985), were employed to assess the predictive power of individual features. WoE and IV scores were calculated, and features with higher scores were considered more influential for default prediction.

By incorporating these steps, the author aimed to select features that comply with the BASEL-III guideline, which recommends the use of Information Value and Weight of Evidence in financial analysis. Figure 1 explains complete research process in a intelligent way.

Fig. 1
figure 1

Research process (Author’s Illustration)

3.3 Intelligent Models

3.3.1 Logistic Regression

Logistic regression (LR) is a non-linear equation used for binary class classification, where the dependent variable is binary (either ‘0’ or ‘1’). LR creates an S-curve to represent the data distribution between categories such as good or bad, default or non-default. It is commonly used in default prediction models (Myung, 2003).

LR allows the calculation of probabilities of defaults (PDs) for individual observations in the dataset based on the coefficients derived from the LR model. This distinguishes LR from other classification models in machine learning.

The author of the research utilized LR as a filtering tool for feature selection. They used LR to identify significant features and eliminate those that were not significant or had falsification in LR analysis.

3.3.2 K-Nearest Neighbor

The k-Nearest Neighbor model was established by Henley, (1996,), which is an augmented form of Euclidean distance or Mahalanobis distance between the input feature vector X and the training data of \(\langle {X}_{k}\rangle \frac{m}{k}=1\). It is noticeable that KNN is a supervised machine-learning model which needs the input variable. Below mentioned derivation represents the equation of the KNN classification algorithm:

3.3.3 Decision Trees

A decision tress classification algorithm was proposed by Rutkowski et al., (2012) DT is also a kind of supervised learning which needs certain input parameters or information to reach the results. There are root nodes that discriminant between default and non-default classes. The simplest example is given in the below figures.

3.3.4 Random Forests

ML algorithm for Random Forest was deliberated by Breiman (2001), which is an ensemble of decision trees. Supposedly, numbers of decision trees are constructed with bootstraps strategical sampling with a certain number of observations. Every decision DT was established based on a subset of randomly selected K features. A new feature vector will be assigned as a class by every DT. Hence, a consolidated classification. RF will assign a new feature vector a class through a voting system based on the outputs secured by DT nodes. The below figure elucidates the flow of RF working.

3.3.5 Support Vector Machines

Cortes et al., (1995) proposed model of support vector machine (SVM), which is structured on the decision boundary or Hyperplane, which discriminant classes in extensively high dimensional feature space. Linear SVM algorithm emphasis on the maximization of margins of \(\Vert a\Vert = \sqrt{{\sum }_{i}^{m}=1 {a}_{i}^{2}}\) between the default and non-default. Classes are placed according to the below mentioned equation of SVM.

3.3.6 Naïve Bayes

Naïve Bayes is a well-known machine learning algorithm utilized for classification tasks. It relies on Bayes’ theorem and assumes that the features within a dataset are independent of each other, given the class label. This assumption simplifies the probability calculations and enhances the algorithm’s computational efficiency (Dempster et al., 1977).

3.3.7 Artificial Neural Networks

Mitchell (1997) elucidated that artificial neural network (ANN) was developed based on a biological neural network system. it is quite similar to the nexus of brain neurons and its expertise to process data and make it meaningful (Thomas, 2000; Garg, & Aggarwal, 2021; Tsai et al., 2014). ANN is comprised of three layers (1) the Input Layer, (2) the Hidden Layer, and (3) Layers Tsai and (Wu, 2008), as given in the figure appended below.

3.4 Model Validation

The developed model will be rigorously validated using a range of evaluation metrics, including Precision, Recall and F1-Score. These metrics will assess the accuracy and reliability of the established model(s). To ensure robustness, cross-validation techniques such as K-Fold Cross Validation and Stratified Fold Cross Validation will be employed.

Once the model(s) have undergone comprehensive evaluation, rating scorecards will be constructed. Since ML algorithms do not provide coefficients like logistic regression, an alternative approach will be used. Feature rankings based on permutation importance, as studied by Boughaci & Alkhawaldeh, (2020; Luo, (2022), will be utilized. Weights will then be assigned to each feature as given in the Table 1, reflecting their importance in the rating scorecards. This method guarantees that the selected features are appropriately considered in the credit risk assessment process.

Table 1 Assigning weights for scorecard creation

To calculate the probability of default (PD) values for each observation, the study applied weights to the variables in each classifier. These weights were multiplied with the corresponding ratios to obtain PD values. logistic regression (LR) used coefficients as weights, while exponential distribution was utilized for both LR and the other six classifiers.

For LR, the exponential LR scores were derived by multiplying the ratios with the coefficients, and these scores were used to calculate Odd scores (PDs). The same treatment was applied to other financial ratios by multiplying them with assigned weights to obtain PDs. This approach ensures a fair contribution of each variable in the calculation of PD, leading to a more accurate assessment of credit risk.

The author will establish nine credit rating buckets based on the guidelines of BASEL Accord III and assign the obtained ratings to these buckets accordingly. In the final phase, the obtained ratings will be calibrated. Following the recommendations of BASEL Accord III, calibration and recalibration tests will be conducted. Calibration tests will include metrics such as area under curve (AUC), receiver operating characteristic (ROC), GINI Coefficient, KS-Stats, and Brier Score. Recalibration tests will involve the binomial distribution—traffic lights approach and the population stability index (PSI). By employing these tests, the author aims to ensure the accuracy and reliability of the obtained ratings and to align them with the standards set by BASEL Accord III.

4 Results and Analysis

In our methodology, we initiate the testing process by identifying the best features through both multivariate and univariate analyses. Initially, we employ correlation heatmaps to address multicollinearity within our dataset, resulting in the formation of three distinct panels. Consequently, this approach leads to the creation of nine models, as each panel is further examined under three different default assumption criteria. Following this, we refine our feature selection using univariate analysis tools, as recommended by Abdullah et al. (2023); Wod (1985). After completing the univariate analysis, we proceed to narrow down our features with the aid of logistic regression, employing it as a filtering tool in its conventional form—without dividing the dataset into training and test splits—as advised by Zhou et al., (2015). Upon finalizing the feature filtering process to pinpoint the most effective and relevant features, we incorporate these selected features into our machine learning algorithms for further analysis. The outcomes of this feature selection process are accessible via the link provided below. https://docs.google.com/document/d/1764wBC_2PL7I_jdsLFUJp4Adnnk8iIFm/edit?usp=sharing&ouid=104272322739994774143&rtpof=true&sd=true

Above Table 2 exhibited remarkable accuracy in the test-trained split compared to all nine models. The random forest classifier (RFC) achieved the highest accuracy rate of approximately 87.96%, with minimal false positives (FP) and false negatives (FN) compared to other classifiers. The artificial neural network (ANN) ranked second, with an accuracy rate of around 86.95%, followed by support vector machines (SVM) at 86.21%. The confusion matrix confirms the improved accuracy across all classifiers, indicating the selection of the best model.

Table 2 Results of machine learning algorithms

Notably, model 6 (out of nine) demonstrated a significant increase in accuracy compared to model 5. It outperformed all other models, including model 5. Model 6 was developed with 12 features, employing a relaxed default approach with a 60% erosion of equity. The Table 3 provided below presents the relative rankings and accuracy for each classifier, accompanied by their margins of error.

Table 3 Relative rankings of ML algorithms

Table 4 illustrate accuracy scores of all ML algorithms with Model 6. The accuracy of model 6 was confirmed through validation, utilizing the classification report and complementary testing. The random forest classifier (RFC) achieved the highest accuracy of approximately 88%. It exhibited precision, recall, and F1-Score values of 88%, 88%, and 87% respectively. The Artificial Neural Network (ANN) followed closely with an accuracy of around 87% and precision, recall, and F1-Score values of 87%, 87%, and 86% respectively. The Support Vector Machine (SVM) ranked third in accuracy, achieving an accuracy rate of approximately 86% along with a precision of 86%, recall of 86%, and F1-Score of 85%.

Table 4 Classification report—ML models

Additionally, it is worth noting that the decision tree classifier (DTC) and K-Nearest neighbors (KNN) showed similar performance. Both classifiers attained an accuracy rate of around 84%, precision of 84%, and F1-Score of 84%. However, DTC had a higher recall of 88% compared to KNN’s 83% recall.

These results further validate the accuracy of model 6, indicating its strong performance across multiple evaluation metrics and supporting the superiority of RFC, ANN, and SVM in predicting default rates.

Table 5 represent results of model validation and cross validation. All classifiers were thoroughly validated, with the top-ranked classifiers (RFC, ANN, and SVM) maintaining their positions. Cross-validation results confirmed the excellent performance of RFC, with accurate fold values around 86%, indicating its insensitivity to data shuffling. SVM also demonstrated good performance, with fold values around 84%, suggesting its effectiveness in handling data shuffling issues.

Table 5 Model validation and cross validation

Model 6, constructed with 12 features and a relaxed default approach (60% erosion of equity), and exhibited a significant increase in accuracy compared to model 5. It outperformed all other models, including model 5. Complementary testing validated the accuracy of model 6, with RFC achieving the highest accuracy of 88%, followed by ANN with 87% accuracy and SVM with 86% accuracy. KNN and DTC showed similar accuracies around 84%. The rankings of the top classifiers remained consistent, and all classifiers were significantly justified in the validation process.

Random forest classifier (RFC), support vector machines (SVM), and artificial neural networks (ANN) emerged as the most effective algorithms for default prediction modeling (DPM) in this study. RFC achieved the highest accuracy, followed by ANN and SVM, consistent with previous research. These algorithms are recommended for similar DPM tasks. Model 6 with RFC as the classifier showed the highest accuracy and validation scores among all nine models, making it the best scenario for DPM. The relaxed default approach in Model 6, with a 60% erosion of equity, outperformed other scenarios. ANN performed slightly better in Model 6 compared to Model 5. The study suggests further exploration of different default conditions to identify key thresholds based on equity erosion. Overall, Model 6 with ANN as the classifier is the best scenario for predicting a firm’s default/non-default status.

Table 6 represent selected features by all filtering process and machine learning testing which is basically features of model 6. The model in the study includes financial ratios from various streams, such as Capitalization, Cash Flow, Coverage Ratios, Leverage, Liquidity, Profitability, and Size. The analysis of these ratios revealed important findings. Non-current liabilities to total assets showed a positive relationship with credit default, indicating higher default probability for companies relying heavily on non-current liabilities. Operating cash flows had a negative association with default events, suggesting that higher operating cash flows reduce the likelihood of default.

Table 6 Selected features

Liquidity ratios, specifically LIQ10, LIQ13, and LIQ3, showed negative and significant associations with default events, indicating that higher values for these ratios indicate better financial health. The least significant variable was LIQ13, but it still contributed to the calibration of default probabilities. LR3 (Log (Total LT Debt + Total ST Debt)) had a positive relationship with defaults, while PR1 (Net Sales / Total Assets) and PR14 (Operating revenue/total assets) had negative relationships with defaults, highlighting the importance of asset turnover and cost control.

Firm size (S3) was negatively related to defaults, indicating that larger firms are less risky. Finally, CR4 (EBIT / Total Liabilities) and CR6 (FFO/Total Debt) exhibited negative relationships with defaults, emphasizing the importance of adequate EBIT and cash flows to cover liabilities and debts. The identified features in Model 6 are justifiable for default prediction models. Financial ratios play a crucial role in default prediction, as emphasized by previous research. The capitalization ratio, cash flow, liquidity, profitability, coverage ratios, debt, and firm size are all important factors in default prediction.

Table 7 illustrate generated probability of default ranges for all seven classifiers and this will be used for scorecard development. In the study, Model 6 was determined to be the best model for default prediction. However, since ML classifiers like KNN, GNB, SVM, DTC, RFC, and ANN do not provide model coefficients, permutation importance was used to rank the features in terms of their importance. Weights were assigned to each feature based on their ranks, with higher ranks indicating higher weights. Only LR provided model coefficients, which were used to calculate odds scores for each variable. For ML classifiers, a weighted average number (Ratio * Weight) was used as the default score for each variable. An exponential distribution was then applied to standardize the distribution across all classifiers.

Table 7 Probability of default (PDs) ranges

To assign credit ratings, 9 percentiles were used to bin the default scores into rating bands, following the guidelines of the State Bank of Pakistan. The aim was to have fewer defaults in higher credit ratings and a higher chance of default in lower credit ratings. PD ranges were developed for all 7 classifiers and used to assign credit ratings to each data point (entity for every year). These assigned ratings were then calibrated according to the guidelines of Basel Accord-III and (Bequé et al., 2017) for model calibration and monitoring.

The study involved developing credit rating bands based on default scores obtained from Model 6 and ML classifiers. These bands were calibrated and monitored according to the Basel Accord-III guidelines. Binning of rating bands are placed in link given as below. https://docs.google.com/document/d/1764wBC_2PL7I_jdsLFUJp4Adnnk8iIFm/edit?usp=sharing&ouid=104272322739994774143&rtpof=true&sd=true

Table 8 represent results of stress testing or calibration or generated credit ratings (which is basically credit scorings or scorecards). In the previous section, the author focused on evaluating the discriminatory power of the RFC model, which was determined to be the best classifier based on accuracy, validation, and cross-validations. The probability of default (PD) was derived using feature rankings, and credit ratings were assigned based on SBP slabs of ratings. The percentage of defaults was calculated for each rating grade, and a cumulative accuracy profile (CAP) graph was generated.

Table 8 Random forest classifier (calibration results)

The table above presents the results of the actual, perfect, and random models. The CAP graph in the bottom right corner indicates the discriminatory power of the RFC model. It is observed that the actual model performs well, with a concave curve lying between the perfect and random models, indicating improved discrimination. This aligns with the Basel-III guideline, which suggests that the CAP graph should exhibit such characteristics. The Area under the ROC curve (AR) and GINI values for the RFC model are both around 0.50, indicating a solid discriminatory power and these findings are backed by Abdou, Tsafack, Ntim, & Baker, (2016). This is further supported by the Kolmogorov–Smirnov (KS) test, which reveals that the majority of defaults are concentrated in the second slab, accounting for 39.32% of defaults.

The KS test highlights the discriminatory power of the RFC model, particularly in the first two slabs (worst grade), while the third slab also shows acceptable performance but requires further refinement. The RFC model demonstrates the strongest discriminatory power in predicting defaults, as evidenced by the KS statistics. Additionally, it achieves the lowest Brier score among all classifiers, including LR, indicating that it is the superior classifier for default prediction modeling (DPM).

4.1 Results of 2023

4.1.1 Random Forest Classifier

Table 9 are results of recalibration based on population stability index (PSI). Based on the analysis, it is evident that the model’s performance in 2023 has shown significant improvement, as indicated by the robust PD distribution pattern, with only a slight variation between grades 3 and 4. However, this discrepancy has been accounted for and normalized through the log of test and training, resulting in a stable PSI score. Furthermore, the observed pattern of more defaults in the last rating grade and fewer defaults in the best rating grades aligns with the expected behavior.

Table 9 Population stability index

The PSI value for the model is approximately 0.07, which is considered sufficient according to the guideline. Therefore, it can be concluded that no further adjustments are necessary for the model. To validate these findings, a binomial distribution analysis was also conducted, and the results support the research conclusions.

The performance of the 2023 model is deemed satisfactory, and based on both the PD distribution pattern and the binomial distribution analysis, recalibration is not required.

Table 10 denotes results of Binomial Distribution—Traffic Lights which is another test of recalibration prescribed by BASEL Accord III. The results demonstrate that the 2023 model performs well, with a consistent PD distribution pattern, albeit with a minor variation between rating grades 3 and 4. However, this difference has been normalized through the log of the test and training data, resulting in a stable PSI score of approximately 0.07. The lower rating grades exhibit higher default events, as expected, while the higher rating grades have fewer defaults. The binomial distribution analysis also supports the conclusion that no recalibration or changes are required.

Table 10 Binomial Distribution—Traffic Lights

When comparing all the models, the 2022 data reveals a wide range of numbers across different classifiers. Some models report the highest default events in the 1st interval of KS stats, while others report them in the 2nd or 3rd interval. In 2022, the RFC model demonstrated strong performance in terms of AR and GINI, but lacked discriminatory power. Conversely, ANN outperformed all models with excellent predictive power and adequate discriminatory power. LR showed healthy discriminatory indicators but limited predictive power, while KNN exhibited reasonable numbers for both discriminatory and predictive power. GNB and DTC produced results that were more closely aligned.

The 2023 model’s performance is deemed satisfactory, and there is no need for further recalibration or changes. The following comprehensive comparison table summarizes the performance of all the models.

The Table 11 provides a comprehensive comparison of recalibration models. Most default events were concentrated in the second interval of KS stats, with GNB being the only model showing the highest defaults in the third slab, which is still acceptable. To draw a conclusion, it is necessary to consider each model’s discriminatory and predictive values.

Table 11 Summary of recalibration – 2023

The LR model exhibited a diverse distribution and rating shuffles, resulting in decreased predictive power (highest BR Score: 0.19769). Similar observations were made for the ANN model, which had a distorted PD distribution impacting the BR Score. The remaining models showed a similar shuffle between rating grade 3 and rating grade 4. LR had strong discriminatory power with AR and GINI values around 0.57371 but slightly weak predictive power. KNN had weak discriminatory power with AR and GINI values around 0.39174 but good predictive power with a BR Score around 0.15899.

GNB, SVM, and DTC performed closely to each other, exhibiting good discriminatory power with AR and GINI values of 0.48697, 0.42886, and 0.47915, respectively. They also showed adequate predictive power with BR Scores of 0.16082, 0.15691, and 0.15968, respectively. ANN demonstrated good but not adequate performance in recalibration, with AR and GINI values around 0.41697, slightly lower than all classifiers except KNN, but good predictive power with BR scores around 0.15896.

Lastly, RFC remained the best model in recalibration and model monitoring, with AR and GINI values being the second-highest around 0.52735 (lower than LR but with better predictive power) and BR Scores around 0.15882. In conclusion, RFC outperformed all other classifiers.

5 Discussion

Default prediction is a significant concern globally, as demonstrated by high-profile bankruptcies such as WorldCom, Enron, and Lehman Brothers. Default prediction models have been developed for both mature and emerging markets, but they have faced criticism for methodological and contextual issues, particularly in defining default. Forecasting default events before they occur is crucial in corporate finance to protect stakeholders. This research addresses the gaps in default prediction modeling and serves as a benchmark, providing a comprehensive guide that incorporates recommended machine learning algorithms, validation techniques, scorecard presentation, calibration, recalibration, model stability, and monitoring. The research’s novel contribution aligns with Basel Accord III, IFRS guidelines, and ESMA Standards, making it valuable to financial institutions, rating agencies, regulatory bodies, and other stakeholders. It offers real-time probability of defaults (PDs) for various machine learning algorithms, enabling stakeholders to make informed decisions on credit lending to entities in Pakistan. The research also provides the option for stakeholders to generate their own PDs using the provided coefficients or feature scores. Overall, this research is a comprehensive resource for anyone involved in default prediction models (DPM).

6 Conclusion

The research focused on developing a default prediction model using machine learning techniques. Data preprocessing involved removing variables with incomplete history and missing values. Feature selection was performed through multivariate and univariate analysis, resulting in 13 efficient features for predicting default. Seven machine learning classifiers (LR, KNN, GNB, SVM, DTC, RFC, and ANN) were evaluated, with RFC performing the best in terms of accuracy, validation, and cross-validation. The findings aligned with previous research. The third objective aimed to compare the results of all classifiers, highlighting RFC as the highest-performing model with an accuracy and validation rate of 88%. The study aimed to align with BASEL III guidelines and used the population stability index (PSI) for model stability assessment. The RFC model demonstrated superior calibration, stability, and monitoring. Probability of default (PD) scores generated by RFC were the strongest among the classifiers. Overall, the research provided valuable insights into default prediction models in corporate finance.

The first objective was achieved by identifying 13 effective features that can accurately predict default. The research findings align with the studies conducted by (Christopoulos et al., 2019; Javaid & Javid, 2018; Inam et al., 2019; Karas & Reznakova, 2020; Zhu et al., 2019; Sariev & Germano, 2020; Muñoz-Izquierdo et al., 2019; Ragab & Saleh, 2021; Petropoulos et al., 2020; Ogachi et al., 2020; Bhattacharya & Sharma, 2019; Shrivastava et al., 2020). Results of the best features were reported in the Table 6 (Selected Features) of Sect. 4 “Results and Analysis”.

6.1 Practical Implication

The research has significant implications for financial analysts, credit analysts, academic researchers, financial institutions, and credit rating companies in Pakistan. It recommends incorporating real-time probability of default (PD) ranges as a tool for assessing financial risk. The identified key features for default prediction can enhance credit risk evaluation. Academic researchers can cite this comprehensive study for future work. The statistical framework can be used in parallel with judgmental models in commercial banks’ internal rating based (IRB) systems. The research fills a gap by providing valuable statistical insights for credit rating agencies in Pakistan. Overall, it improves financial analysis, credit assessment, and understanding of default prediction models in the corporate finance domain.

6.2 Limitation of Research

The study has limitations in terms of not including all advanced machine learning and deep learning algorithms due to resource constraints. It also lacks the incorporation of macroeconomic factors, corporate governance indicators, and ESG indicators due to data accessibility issues. The study focused on BASEL-III guidelines but did not use the IFRS calibration methodology. Additional financial ratios were not included to maintain comprehensive coverage. The study’s sample was limited to certain types of firms in Pakistan, which may affect generalizability.