1 Introduction

Financial stability is sine qua non for sustained and rapid economic progress of banks. Among various indicators of financial stability, banks’ assume critical importance on the asset or loan quality, credit risk and efficiency in the allocation of resources to productive sectors. Credit risk can be due to credit default, concentration of exposure to an industry or individual and, sovereign’s unwilling or inability to meet obligations. Credit risk evaluation is a continuous activity that starts with underwriting a potential loan till payments collection after on-boarding the loan. With growing uncertainty in the economy, the political and legal will to deal with defaulters, default prediction is a necessity to ensure that right practices of credit risk evaluation is practiced. Banks try to mitigate associated risks by insurance, covenants, diversification, risk based pricing, etc. Due to the global financial crisis of 2007–2008 and the regulatory concerns of Basel norms, credit risk evaluation has become a major focus of banking and financial industry. According to IMF, world economy is at the risk of another financial crisis [1] due to ineffective credit risk evaluation. Typically, credit risk is measured based on capacity to repay, capital, loans’ conditions, credit history, and associated collateral. Jarrow Turnbull [2] proposed one of the first reduced form models for ascertaining credit risk. Banks and financial services rely on credit risk departments and engage agencies such as Standard and Poor’s, Moody’s, Fitch and others to perform credit risk evaluation for a fee. There are also credit scoring companies such as Experian, Equifax, CIBIL and others that focus on past credit history on financial contracts of a customer. We focus on the credit risk evaluation that is associated with credit scoring of new loan, non-performing asset (NPA) prediction and fraud detection of existing loans, as these are risks associated with principal and interest or both that banks gave to good or bad customers. We do not focus on concentration and sovereignty risks for the current study as they are not necessarily related to customer behaviour.

NPA is a loan (including a leased loan) that becomes non performing when it ceases to generate income for the bank. With changes in global economies and inter dependent economies, an economic sneeze in one country leads to economic meltdown in another country. While Cyprus [3] leads the countries with approximately 46% NPA, in large economies, India is ranked first with NPA concerns and ranked fifth highest in the world [4]. The NPA accumulated by Indian lenders (approximately 114 Billion USD in 2018) is higher than those of banks in the major economies which include China, Japan, the US and UK. Countries with higher NPA ratios (to the total loans) than India’s are part of the distressed PIIGS group— Portugal, Italy, Ireland, Greece, and Spain. India’s gross NPAs stood at 18.77% by end of March 2019 for public sector banks and 14.33% for all scheduled commercial banks, lesser than the previous year due to recent initiatives including set up of Central Repository of Information on Large Credits, Asset Quality Review and Insolvency and Bankruptcy Code. Reserve Bank of India (RBI—India’s central bank) continues to caution on gross NPAs and asset quality.Footnote 1 The process of income recognition is objective and based on record of recovery rather on any subjective considerations. Likewise, asset classification of banks is on the basis of objective criteria to ensure that the norms are applied uniformly and consistently. Also, the provisioning is on the basis of classification of assets, the period for which the asset has remained non-performing, the availability of security and the realizable value thereof. A report by India’s Central Bank—RBI has identified 17 reasons for NPA.Footnote 2 Some of the rules for a loan or an advance to be considered as NPA are:

  • “Interest and/ or installment of principal remain overdue for a period of more than 90 days in respect of a term loan”,

  • “The account remains ’out of order’ (outstanding balance remains continuously in excess of the sanctioned limit/drawing power) for a period of more than 90 days”,

  • “An overdraft or cash credit bill remains overdue for a period of more than 90 days in the case of bills purchased and discounted”,

  • “Interest and/or installment of principal remains overdue for two harvest seasons but for a period not exceeding two half years in the case of an advance granted for agricultural purposes”, and

  • “With respect to accounts, any amount to be received remains overdue for a period of more than 90 days.”

While rule based mechanisms are fairly successful, the mechanisms do not identify spending patterns, in/out cash flow, repayment patterns, new advances/loans, deposits and other collateral changes, seasonal, economic and political changes, ownership changes, and other social behavior of customer to predict a potential NPA. There also have been cases of window dressing of potential NPAs with overdrafts and short term advances to circumvent NPA classification. Identifying these patterns, guides banks for proactive intervention and take rightful steps including loan restructuring. Importance of pattern based recognition of anomalies, consumer behavior is studied by industry [5, 6] and academicians [7,8,9,10].

To reduce potential NPAs, it is crucial that credit score is evaluated more diligently. One of the key decisions of banking and financial institutions is to decide whether or not to grant a loan to a customer. Banks granting loans and advances are expected to have realistic repayment schedules based on borrowers’ cash flows. The subprime mortgage crisisFootnote 3 between 2007–2010 was due to ineffective credit scoring while granting loans. A smart bank considers numerous factors to give a loan to a customer, for example, a coal company may not be given loans by banks for large expansion as there are global concerns over reducing the carbon emission. Credit scoring is an important analytical technique in credit risk evaluation based on customer history and environment factors. It becomes a binary or multi class problem to distinguish low credit risk customers from high credit risk ones. Some likely scenarios that state the need of good or bad Loan identification are

  • “Lakshmi is an urban low-income factory worker with dependency on monsoon. She is in her mid thirties and stays in suburbs of a metro city with husband and two kids. Her family stays in a small rented house of 300 sqft. from last three years with basic amenities like TV, gas stove, etc. She has total work experience of 9 years and is working in the current organization for the past 18 months. Her monthly salary is 18000/- and she and her family has health coverage from company insurance. She requires personal loan for her kids’ higher studies.”

  • “Prasad is newly married, working as a software professional. Requires consumer loan to buy household furniture and new style motor cycle. His company has dependency on US visas that is going through lot of rejection.”

  • “Roadside grocery seller has seen onion dumping in large volume at prices less than the cost of production. The farmer has taken farm loan and likes to get rid of the stock before onions get rotten.”

It is also likely that customers and banks can fall prey to fraudulent transactions. Examples of fraud include credit card fraud, insurance fraud, accounting fraud, etc. Fraud attempts have seen a drastic increase in recent years with increased digitization, making fraud detection more important than ever to reduce credit risk. As identified by Steve Albrecht [11], perceived pressure, opportunity and rationalization are the primary reasons for committing fraud. Financial frauds involve complex transactions with involvement of ’white collar criminals’. Cyber, social engineering, mortgage application, merchant, rogue trading, financial statement cook-up, currency and other are some of the common fraud topologies. The recent reports of a central bank state that frauds have increased by 72% in the banks.Footnote 4 Along with external frauds, unfortunately, frauds due to internal employees are also increasing at more than 20% [12]. Some of the common frauds committed by customers and internal employees are

  • “Customers create fake liens to obtain loans or obtain more than the prescribed amount.”

  • “Employees collude with customers to provide loans at lower interest rates and unverified collaterals in return of bribe or favor.”

  • “Customers use the loan amount for meeting their desires or other trivial activities rather using it for the purpose it was lent.”

Our focus in the paper is not related to frauds that are based on social engineering, compromise of accounts, cyber attacks and others that are not conspired by customers. In the study, we also do not focus on the fraudulent money laundering for terrorism and other illegal activities though they continue to be an area of concern for regulators, financial investigation units of countries and UN.

With the unprecedented growth in both banking and payment transactions in digital form, small or large banks, micro finance, self help groups and other financial institutions are becoming repositories of large volumes of varied data piling up at a great velocity. The cost of compliance, regulatory reforms and risk management needs careful attention for simplifying business functions. The increase in availability of data can lead to more informed decisions provided the data is analyzed quickly and meaningfully. Existing data warehouses, MIS and other reports are becoming less important with emergence of pattern based data (structured and unstructured) analytics. Currently, banks with the help of credit rating agencies have started to use intelligence built over a period of time in the form of rules, statistics or pattern recognition techniques to perform credit risk evaluation. Even regulators such as European Central Bank [13] and others have suggested features based on structured and unstructured (natural language) data for early warning on credit risk. Data analytics involves complex processing that goes beyond statistics, into the field of computer science (via machine learning subsuming new wave of artificial intelligence) and operations research. Dr. Jim Gray of Microsoft, refers Artificial Intelligence (AI) as the fourth paradigm [14] of science with theoretical, experimental and computational paradigms that preceded it in the evolution of science. With AI, the hidden patterns are recognized and appropriate alerts are raised in a useful and usable way. According to IBM’s 2010 Global Chief Executive Officer Study,Footnote 5 89% of banking and financial markets CEOs state that their top priority is to better understand, predict and give customers what they want leveraging analytics. Some of the broad use cases that analytics [15, 16] are expected to help with are: Acquiring customers, serving the extant customers and making them profitable, Targetted marketing, Market basket analysis (Cross-sell/Up-sell), Churn prediction including feedback processing, Customer sentiment analysis, Market risk (subsuming foreign exchange rate, interest rate and liquidity risks) modeling, Automated Teller Machine cash replenishment modeling, Productivity/profitability based ranking of banks, Portfolio optimization, Application screening and Channel optimization. More specifically, they can be used for (i) Credit scoring (ii) NPA prediction (iii) Fraud detection (transactional and non-transactional) and other use cases. In the recent times, Fintechs are giving a new vigor to innovation in banking and financial services with automation and customer experience. Fintechs are evaluating creditworthiness of loan applicants, and improve the interface between customers and their service providers [17]. Fintechs are also providing lending platforms for unsecured loans [18] while evaluating creditworthiness in few seconds leveraging various machine learning techniques consuming structured and unstructured data. However, the ML techniques and the information on hyper parameters is not available to public or research community. While there has been extensive research in industry and academia on credit scoring, NPA identification and fraud detection with rule based, statistical and pattern based approaches; we did not find any consolidated literature that discusses datasets, challenges and research gaps. In the remainder of this paper, we focus on literature gathered from academic and industry publications on credit scoring, NPA and Fraud. We perform a comprehensive systematic literature review so that various approaches and techniques can be studied. We further analyze the shortlisted papers after applying inclusion and exclusion criteria and synthesize our findings.

2 Approach for literature review

There is extensive literature on various approaches on credit scoring, NPA prediction and fraud detection. The growing concerns and increasing benefits of automation on credit risk evaluation provided us the motivation for conducting this study. We use the review process suggested by Kitchenham [19] to conduct the systematic literature review(SLR). The overview of the process is shown in Fig. 1. Following are the research questions that would be addressed by this study along with the rationale (Ra.) behind including them in the study:

  1. RQ1.

    Why and what are the AI techniques being used for credit risk evaluation?

  2. Ra.

    AI includes natural language processing (NLP), Machine learning (ML), information retrieval and extraction, expert systems, fuzzy logic and other approaches are shown in Fig. 2. While our primary interest lies in ML techniques, having broader research question would help us to analyze the advantages and disadvantages of ML and gaps in other techniques.

  3. RQ2.

    How are the ML techniques being used for credit risk evaluation?

  4. Ra.

    This research question can provide us an insight into various ML approaches including probabilistic, neural networks, optimization and ensemble based, for credit risk evaluation. This is also expected us to understand if there are any commonalities (models, feature extraction, datasets, etc.) within credit scoring, NPA prediction and fraud detection. The response to this question can provide us the scoring techniques (binary or a scale) for credit risk evaluation. The identified public datasets can be used by interested researchers in the area to improve their algorithms.

  5. RQ3.

    What are the challenges/limitations in this focus area?

  6. Ra.

    The identified challenges can help us to identify threats to validity associated with application in credit risk evaluation of various ML models, scoring techniques, regional (legal and compliance) and product specific issues. The usage of loss factor and hyper parameters will help us understand the implementation of ML techniques.

  7. RQ4.

    What are the research trends in credit risk evaluation?

  8. Ra.

    This can provide insights and potential guidance to researchers interested in credit risk evaluation.

  9. RQ5.

    Which are the universities that are working in this area?

  10. Ra.

    It is possible that researchers working in this area might want to get in touch with the universities and researchers working in the same area. This can enable researchers and industry to possibly collaborate on datasets and algorithms to improve credit risk evaluation.

Fig. 1
figure 1

Systematic literature review process [19]

Fig. 2
figure 2

Artificial intelligence landscape [20]

A systematic review protocol is a documented plan describing all the details about how a review will be conducted. We used a living document that was continuously updated during the review process. This protocol was used as a reference document by the reviewers and was evaluated by other fellow researchers with in our institute to provide feedback about the design of the study.

2.1 Search strategy and study selection

We used databases such as Springer, ScienceDirect, IEEE Xplore and ACM Digital Libary to gather the relevant literature based on search query or search string. These databases were chosen as they cover most of the important journals and conferences. We did not perform any search on repositories such as Wiley, Taylor and Francis, IGI and others due to limited access and also the cost involved. The first step towards finding relevant studies is to find relevant keywords. As we are interested in only bank related credit risks, our first keyword is “banking”. The keywords that follow are “credit risk”, “credit score”, “default”, “NPA”, “Non Performing Asset”, “Non Performing Loan” and “Fraud Detection”. We identified the related keywords (synonyms) leveraging investopedia.Footnote 6 Then, we have a set of keywords that are related to the techniques for doing credit risk evaluation or models, such as : “AI”, “artificial intelligence”, “ML”, “Machine Learning”, “classification”, “Supervised”, “unsupervised”, “Deep Learning”, “Neural Network”, “Radial Basis Function Networks”, “SVM”, “Decision Tree”, “Discriminant Analysis”, “Naive Bayes”, “Nearest Neighbor”, “Random Forest”, “Hidden Markov”, “Markov Chain”, “Regression”, “Fuzzy Logic” and “Expert System”. The following query was formed to identify studies of relevance from the databases with the keywords :

(“banking”) AND (“credit score” OR “credit risk” OR “default” OR “NPA” OR “non performing asset” OR “non performing loan” OR “fraud detection”) AND (“AI” OR “artificial intelligence” OR “ML” OR “machine learning” OR “classification” OR “supervised” OR “unsupervised” OR “deep learning” OR “neural network” OR “radial basis function networks” OR “SVM” OR “support vector machine” OR “decision tree” OR “discriminant analysis” OR “naive bayes” OR “nearest neighbor” OR “random forest” OR “hidden markov” OR “markov chain” OR “regression” OR “fuzzy logic” OR “expert system”)

Let \(\varDelta\) = (“banking”) AND (“credit score” OR “credit risk” OR “default” OR “NPA” OR “non performing asset” OR “non performing loan” OR “fraud detection”)

As each of the databases had their own query formats, we had to modify or breakdown the queries. For ACM Digital Library, the query was broken down into two parts due to limitations in query string size. The following queries were input and the results were compiled:

\(\varDelta\) AND (“AI” OR “artificial intelligence” OR “ML” OR “machine learning” OR “classification” OR “supervised” OR “unsupervised” OR “deep learning” OR “neural network” OR “radial basis function networks” OR “SVM” OR “support vector machine” OR “decision tree” OR “discriminant analysis”)

\(\varDelta\) AND (“naive bayes” OR “nearest neighbor” OR “random forest” OR “hidden markov” OR “markov chain” OR “regression” OR “fuzzy logic” OR “expert system”)

We used the original query for IEEE Xplore as the entire query string could be accomodated. As the limit for the size of query was low for ScienceDirect, following queries were used separately and the consequent results were compiled:

\(\varDelta\) AND (“AI” OR “artificial intelligence” OR “ML” OR “machine learning” OR “classification” OR “supervised”)

\(\varDelta\) AND (“unsupervised” OR “deep learning” OR “neural network” OR “radial basis function networks” OR “SVM”)

\(\varDelta\) AND (“support vector machine” OR “decision tree” OR “discriminant analysis” OR “naive bayes”)

\(\varDelta\) AND (“nearest neighbor” OR “random forest” OR “hidden markov” OR “markov chain” OR “regression”)

\(\varDelta\) AND (“fuzzy logic” OR “expert system”)

For Springer Library, we divided the search query into two parts because of limit set by the database. Following are the queries and their results were compiled:

\(\varDelta\) AND (“AI” OR “artificial intelligence” OR “ML” OR “machine learning” OR “classification” OR “supervised” OR “unsupervised” OR “deep learning” OR “neural network” OR “radial basis function networks” OR “SVM” OR “support vector machine” OR “decision tree” OR “discriminant analysis”)

\(\varDelta\) AND (“naive bayes” OR “nearest neighbor” OR “random forest” OR “hidden markov” OR “markov chain” OR “regression” OR “fuzzy logic” OR “expert system”)

We also did a manual search to find the seminal or important studies in this field. This was to ensure that all important studies are included in our search.

2.2 Inclusion, exclusion criteria and quality assessment

Here we describe the inclusion, exclusion and quality assessment criteria on the search results to identify the relevant papers. The literature is included if:

  • It discussed new or improvements to existing techniques of credit risk evaluation including credit scoring, NPA prediction and fraud detection. In addition to studying many different techniques, this helped filter out repetition, especially the ones that discuss the same techniques.

  • It is published between 1993 and March 2019. We have chosen studies after the AI boom in 1980’s and the subsequent AI winter that followed till 1993 [21]. This way we made sure that the studies chosen fall in the era of modern artificial intelligence when there are several studies being published on how artificial intelligence can be used to solve world problems.

  • We also followed the forward-snowballing approach to identify the relevant studies. As a consequence, some additional papers were chosen as a part of our study.

The literature is excluded from the selection process if:

  • Poster, short paper, doctoral symposium paper, thesis or dissertation, or grey literature are removed from evaluation. To maintain the scope of this study, we have confined ourselves to only full research papers.

  • It is not written in English. Almost all major contributions in this area have their texts available in English.

  • The full-text is not available or accessible.

  • Papers that had reference to AI techniques but dealing with rules and statistics such as time series, regression, correlation and other methods were also excluded from the study.

  • Papers that are related to fraud detection but did not involve conspiration by fraudulent customers.

  • Papers that were not published in Computer Science conferences and journals were removed. Conference proceedings that are not related to “Artificial Intelligence” were also removed during quality assessment.

To ensure quality papers are only included as part of the review, we included the following steps in our process

  • It is published between 1993 and 2014 but lacks even a single citation (without considering self citation). The minimum citation count was set to one to ensure a bare minimal standard for the literature being studied. A relaxation is given to studies from last five years and they are included even if there is not a single citation to them. This was done in the light that research papers in this area usually take some time to get noticed. We set the mark to five years based on our experience with other SLRs.

  • We also did not include papers that are not peer reviewed research.

  • It is a duplication study. i.e. it is found in other parts of the searching process or published in other sources.

  • Papers that do not contain research objectives, experimental rigor and lack validation were excluded from study due to lack of quality in the research.

  • Authors also did a manual search to find the seminal or impressive studies in this field. This was to ensure that the automated search did not exclude any important relevant studies. Studies were chosen according to their number of citations.

  • Authors formed an internal review team to perform quality assessment on search criteria and search results. The authors met after every step in the SLR process to analyze the issues on hand. Emails and spreadsheets were used for recording the findings and observations.

3 Results

The first step of the search process is to apply the search query along with basic inclusion and exclusion criteria to the search databases. Step two is to screen the resulting studies on the basis of their title, keywords, abstract and conclusion. Step three is applying remaining inclusion and exclusion criteria along with quality assessment. This involves critically going through the studies to see if they contribute to the existing methods on credit risk evaluation. The studies are also excluded if they don’t have a single citation and is published between 1993 and 2014. Step three is repeated a fixed number of times to make sure that all the studies are relevant to our systematic review.

Fig. 3
figure 3

Results of the search process

Fig. 4
figure 4

Count of papers under different digital libraries categorized by risk evaluation technique

Fig. 5
figure 5

Year wise trend of number of papers for different risk evaluation techniques

A total of 136 papers were shortlisted for review in our study. The results of the search process in terms of how the papers were shortlisted is shown in Fig. 3. The numbers in the figure refer to the number of studies shortlisted after each step. The initial studies and the results of the shortlisted studies can be accessed from [22]. After applying basic inclusion and exclusion criteria along with the search query on digital libraries and including the manual search results, we got 1032 studies. As can be seen in Fig. 3, we had a large number of studies from Springer library whereas only 16 studies from the ACM Digital Library. This is because Springer has a large collection of scientific journals from which we get our search results, there were duplicates among conference papers and book chapters. These studies were screened based on their title, abstract, keywords and conclusion, leaving us with 149 studies. Remaining inclusion and exclusion criteria were applied on these 149 studies such as papers that did not involve fraud detection papers conspired by customers. After repeating this step for 2 iterations with co-authors as reviewers, we were left with 138 studies. Then a final shortlisting is done on the 138 studies leaving us with 136 studies. The final numbers after the iterations are written in the figure after applying each step in inclusion and exclusion criteria.

Figure 4 gives the number of papers for different risk evaluation techniques sorted according to the digital libraries. The number of papers on default prediction was on the lower side compared to other risk evaluation techniques. Credit scoring is the most widely used credit risk evaluation technique. Year wise trend of the number of papers falling under different credit risk evaluation techniques can be seen in Fig. 5. We observed that there was a peak in number of studies for the year 2009 for credit scoring, primarily, due to economic downfall in the year 2008. Number of studies for fraud detection has increased in recent years, primarily, due to increased digitization in banks. Studies on NPA prediction have risen since 2016 indicating a need for more thorough credit scoring and echoing the observations of IMF [1] and banking regulators.

Fig. 6
figure 6

Count of papers on AI techniques for credit risk evaluation classified as ML, survey and others

The different survey papers published before and how they differ from our paper is shown in Table 1. One of the things that was noticed was that only one study followed the systematic literature review approach for their study. The comprehensive and systematic nature of this study makes it unique.

Table 1 Summary of non-ML AI approaches for credit risk evaluation

3.1 Answer to RQ1: AI techniques

Artificial Intelligence can broadly be divided into categories shown in Fig. 2. Relevant categories to us are: Knowledge Representation, Planning and Deductive Reasoning and, Problem Solving. Planning involves Machine Learning and it is categorized into supervised, unsupervised and reinforcement learning. However, not all models under these categories are popular for credit risk evaluation. Under non-ML AI techniques, Ontology based and fuzzy logic based systems were proposed by the researchers for credit risk evaluation. The studies shortlisted after search process were divided into three categories namely credit scoring, NPA prediction and fraud detection. We have further classified them into studies that use ML Techniques, that are survey/analysis studies and studies that use non-ML AI techniques. The results can be seen in Fig. 6. The AI techniques (other than ML techniques) used for credit scoring and fraud detection are shown in Table 2.

Fuzzy logic based models for credit scoring are popular. Marikkannu and Shanmugapriya [23] proposed a a fuzzy set based domain driven approach for customer credit data classification. Linear combinational sets of attributes for classification are built using domain expertise knowledge. Romanyuk [24] proposed a decision support system concept for granting of loan. It is based on the use of loan price function (which is continuous) of the credit score of borrower. Wei [25] also proposed a credit risk assessment model based on fuzzy theory. Hoffman et al. [26] proposed two evolutionary fuzzy rule learners. Evolution strategy is used in the first approach for generating approximate fuzzy rules, where every rule consists of membership functions which have their own definitions. Another learner is a genetic algorithm that extracts fuzzy rules which are descriptive. In this method, a common linguistically explainable definition of membership functions which are in disjunctive normal form is shared by all fuzzy rules. Other AI methods such as Ontology based, echo state network based, decision table, mobiscore, bstacking, expert system, grey relational analysis, adaptive reference system and domain adaptation approach have also been explored by researchers for credit scoring. Kotsiantis et al. [27] proposed an ontology-based system that predicts credit risk by using intelligent reasoning and searching mechanisms. The proposed ontology was designed and implemented such that it represents statements which are financial in nature. The domain could be modeled in a way that was shareable, efficient and reusable because of the use of ontology. Pedro et al. [29] proposed MobiScore, an approach in which mobile phone usage data is used to build a model of the financial risk of user. This model could prove to be a good alternative when the applicant’s financial history is not available. Xia et al. [30] proposed a novel ensemble credit model, which is heterogeneous in nature, that combines the bagging algorithm and stacking technique. Mahmoud et al. [31] proposed an expert system for assessing and supporting credit decisions on the banking sector. The main goal of the expert systems is to make skill available to technicians and decision making people. Lin et al. [32] proposed a grey relational analysis (GRA) approach for credit risk assessment of the banking sector. Huang and Chen [33] proposed a domain adaptation approach based data mining strategy for tasks which require credit risk assessment. In this method, the training of the algorithm is done on a source domain with numerous samples. Then the algorithm is applied on the target domain with relatively less number of samples. It does not require the equal distribution of the two domains.

There are quite a few non-ML artificial intelligence techniques used to detect frauds in transactions and loans. Gadi et al. [34] applied Artificial Immune System for credit card fraud detection. They also did a comparison of the results with that of other techniques such as Naive Bayes, Bayesian Networks, Neural Networks and Decision Trees. For parameter optimization, they used Genetic Algorithm (GA). Duman and Ozcelik [35] used genetic algorithm (GA) and scatter search for detecting credit card fraud. Van Vlasselaer et al. [36] proposed APATE which is an approach to detect credit card transactions in online stores which are fraudulent in nature. The approach takes intrinsic and extrinsic features of the transactions and combines them. Combination of both these features leads to best performing models. The key observations from the research on non-ML AI techniques for credit risk evaluation are:

  1. 1.

    The results from various approaches show that their accuracy is comparable with Decision Trees, Support Vector Machine (SVM), Neural Networks, etc. However, there is no study which compares these proposed approaches to see which among them is better.

  2. 2.

    Fuzzy theory based systems for credit risk evaluation have potential to be used.

  3. 3.

    There are three studies which propose a tool for credit granting institutions [27, 31, 37] to help them in loan granting decisions. These tools are a good option for banks to adapt provided the administrator of the tool has domain and models knowledge to change the classification with changing scenarios of the outside world.

The key summary of Non-ML AI approaches for credit risk evaluation can be seen in Table 3. The past two decades have seen a growing interest in machine learning among the researchers with good computing capability to process large volume of data. We observed from the results in Fig. 6 that ML techniques are being more explored by researchers for credit risk evaluation—credit scoring, default prediction and fraud detection. In answer to RQ2, we discuss about ML techniques for credit risk evaluation.

Table 2 Non-ML AI techniques for credit risk evaluation
Table 3 Summary of non-ML AI approaches for credit risk evaluation
Table 4 ML techniques for credit risk evaluation
Fig. 7
figure 7

ML techniques for credit risk evaluation

3.2 Answer to RQ2: ML techniques

The ML techniques used for credit scoring, credit risk evaluation, NPA prediction and fraud detection are tabulated in Table 4. The distribution of different ML techniques for credit risk evaluation can be seen in Fig. 7. One major finding after the process of going through these studies is that ML techniques outperform the traditional statistical and optimization models [143]. The study by Malhotra and Malhotra [143] suggested that Neural Networks prove to be better than traditional statistical and optimization techniques. However, Huang and Day [144] showed that the support vector machine models have better accuracy rates among the 17 classification models investigated and therefore the past classification models are outperformed in the credit scoring context. This is supported by Khemakhem and Boujelbene [145] who did a study on credit risk evaluation for Tunisian banks and compared traditional models and modern ANNs and SVMs. The study also concluded that RBF kernel SVM was the best method in terms of sensitivity, specificity and accuracy with the error rates which are least among others. Nwulu and Nnamdi [146] did a comparative analysis of SVM and ANN for credit scoring and concluded that ANNs perform slightly better than SVMs. Thus, we can say that growing interest of researchers towards developing ML techniques is justified as these models are better in terms of accuracy. In the following subsections, we give detailed explanation on how credit risk is actually computed and what datasets researchers use for their machine learning models.

Credit risk evaluation is done through the development of classification models, in order to distinguish between creditworthy and non-creditworthy clients [46]. A common approach for credit risk assessment is to apply some kind of classification technique to previous data of customers so that we find some kind of relation between the characteristics of the customer and failure of the loan. An important component of the modern techniques for credit risk evaluation is an accurate classifier that discriminates between good customers and bad ones. Due to its importance and better accuracy figures, there is an increasing research interest about credit risk assessment through machine learning techniques. Firstly, many statistical models and optimization techniques, such as linear discriminant analysis [147], logit analysis [148], probit analysis [149], linear programming [150], integer programming [151] and k-nearest neighbor (KNN) [152] are widely applied to credit risk evaluation and modeling tasks. There can be further improvement to these techniques although they can be applied for credit risk assessment. Recent studies have revealed that artificial intelligence (AI) techniques, such as SVM and neural networks perform better than traditional statistical models and optimization techniques for credit risk evaluation due to flexibility of tuning the weightages and ability to classify even though the features are not easily separable. We describe the research related to each credit risk type separately here and at the end of the each type, key observations are drawn.

3.2.1 Credit scoring

Credit scoring using machine learning is generally done using some kind of classifier which differentiates between creditworthy and non-creditworthy customers using the previous data of the customers. An important step in the classification process is to choose an accurate classifier for classification of good customers and bad customers. The ML techniques used by researchers for credit scoring can be seen in Table 4. The different techniques that can be seen are neural network techniques and its variants, SVM and its variants, Naive Bayes, Markov Chain, HMM, Bayesian Networks, Decision Tree, Bayesian Ensemble, HLVQ-C, Hybrid models and Ensemble models.

Neural networks are becoming increasing popular among researchers in recent years. Li et al. [44] proposed a model based on Back-Propagation (BP) algorithm to identify “good credit” groups from “bad credit” groups. Li and Wu [43] and Zhu et al. [45] proposed a credit risk assessment model based on BP Neural Network to identify potential defaulters.

Hu and Tang [47] proposed an artificial neural network (ANN) based credit risk assessment which measures the credit score of the applicant. This model has many characteristics such as self-adaptation, self-learning and parallel processing. The most suitable candidates for this model are the domestic commercial banks which have incomplete data and delayed data. Dima and Vasilache [46] proposed an ANN model for corporate credit risk evaluation to classify good creditors from bad ones. The paper uses probit regression and ANN model and the classification is based on the number of delay days.

Derelioǧlu et al. [53] proposed a cascaded MultiLayer Perceptron (MLP) and Neural Rule Extraction (NRE) system for classification of customers as either creditworthy or uncreditworthy. In the rule extraction stage, the forwarded result is revealed to be of what condition the good customer was finalised in the decision. Zhang et al. [58] proposed a credit risk evaluation approach using flexible neural tree (FNT) model for classification of loan applicants. Zhaoji et al. [59] proposed a wavelet network model based on Particle Swarm Optimization (PSO) for classification of loan applicants. Fan and Yang [60] proposed a denoising autoencoder approach for training the neural networks. The paper proposes a denoising-autoencoder-based Neural Network model for credit risk analysis. This was proposed as the authors identified that the traditional ANNs learn not only from the training data but also from the noise in it. To overcome this drawback, this model was proposed. Lai et al. [51] built a Neural Network metalearning model for credit scoring. Marin-de-la-Barcena et al. [56] proposed artificial metaplasticity (biological property of real neurons) applied on MLP. So neurons have this biological property of metaplasticity. Barcena et al. applied this property on neural networks and were able to propose a new machine learning method for credit scoring. Tomczak and Zieba [57] proposed a scoring model based on Classification Restricted Boltzmann Machines (ClassRBM). This model first trains the data on ClassRBM and then generates a scoring table. Geometric mean of sensitivity and specificity is used to take care of the imbalanced data. Baesens et al. [55] analysed three real life credit datasets and presented the results. The analysis was done using neural network rule extraction techniques. Decision tables were used to visualize the scores. The rules were extracted using three rule extraction techniques. It was concluded that neural rule extraction techniques have the potential to be used for credit risk analysis.

As can be observed, researchers are moving towards hybrid systems with neural networks in it. Huang et al. [98] proposed classification of loan applicants of state-owned commercial banks using fuzzy neural networks. Huang and Tian [97] proposed a classification model of applicants for commercial banks based on Fuzzy Probabilistic Neural Network Model (FPNN). This model is a combination of the Probabilistic Neural Network (PNN) and relative membership degree in fuzzy mathematics. Oreski et al. [92] proposed a hybrid system with Genetic Algorithm (GA) and ANNs for credit scoring of applicants. In this model, the feature selection is done using GA and classification using ANNs. The proposed hybrid system was found to be competitive with other models for credit scoring tasks. Taremian and Naeini [93] proposed a Hybrid Intelligent Decision Support System (HIDSS) for credit risk evaluation and classified applicants as creditworthy or not, based on neural networks and GAs. MLP Neural Network was used for this purpose in which a co-evolutionary process was used to train the weights of the MLP network. Weidong et al. [95] proposed a hybrid model based on Back Propagation (BP) Neural Network and Logistic Regression. The primary advantage of using this model is that it gives better accuracy than simply applying logistic regression. Also, it is more robust than simply applying BP neural network. Djemaiel et al. [96] proposed a hybrid neural network model built using a combination of Radial basis function (RBF) neural network and Elman neural network. The context for the data was set using big data. The proposed model proved to be efficient when it was used to classify customers as “good” or “bad” based on their credit scores. Hence, the proposed hybrid model can be a good choice when opting for a classification technique for credit scoring. Fu and Liu [89] proposed a model in which Radial Basis Function (RBF) Neural Network is combined with Genetic Algorithm (GA). This model is called GA-RBF Neural Network Model. Genetic algorithm is used for optimization of weights in this model, position of center and spread of center of RBF neural network.

SVM is a widely researched classification technique for credit scoring due to many reasons. Not many data points or support vectors are needed for determination of the optimal hyperplane. SVMs provide an excellent generalization ability. It is also relatively easy to train SVMs. SVMs also do not contain any local optimal like neural networks. SVMs scale relatively well to data with high dimensionality and trade off between classifier error and complexity. Many have used SVM and its variants to perform credit scoring. Farquad et al. [62] proposed a PCA-SVM model which performs PCA for dimensionality reduction on dataset and SVM for classification. The PCA-SVM model had good performances. When compared to SVM alone, it had better accuracy. Similarly, it outperformed PCA-Logistic Regression model. Harris [63] introduced the use of clustered Support Vector Machine (C-SVM) for credit scoring. This model was proposed in response to SVM being computationally expensive for high dimensions. C-SVM tries to addresses this challenge and provides us with credit score of a customer in relatively less time even if the dataset is non linear and large. Huang [64] integrated Kernel Graph Embedding (KGE) with SVM for credit scoring. In this model, KGE is a graph based technique used for dimensionality reduction. This SVM-KGE classifier was shown to be better than traditional SVMs and other multi-class SVMs. Li [65] proposed a model based on fuzzy integral support vector machine (SVM) in which the importance of the output of sub SVM is taken into account. This method proved to perform better than SVM applied alone. Feng et al. [66] and Yang et al. [68] proposed SVM classification model based on PCA for dimensionality reduction for commercial banks. It is similar to the PCA-SVM model proposed by Farquad et al. Lv and Peng [67] proposed a model which combines rough sets and SVM to evaluate credit risk in commercial banks. The indexing system was established in this model and the reduction of the number of indexes was done using rough sets. Comparison with back propagation (BP) Neural Network showed that the rough set-SVM method is more precise and efficient than it. Wei et al. [74] proposed classification of credit applicants using SVM with mixture of kernel. The model uses 1-norm and convex combination of basic kernels. Computational cost is greatly reduced as the quadratic problem is reduced to only one linear programming problem. Wei et al. [69] proposed a least squares support vector machine with mixture kernel (LS-SVM-MK). Just like previously Wei et al. used mixture of kernel with plain SVM, this time the researchers used it on LS-SVM. The problem of the traditional LS-SVM model such as the loss of robustness and sparseness for credit risk evaluation was solved using the mixture of kernel. It was found out that LS-SVM-MK can improve the generalization ability of LS-SVM and can obtain a smaller number of features. Sun and Yang [73] proposed a multi-layer support vector machines (SVM) classifier to evaluate the credit risk for commercial banks. The accuracy of this method is shown to be higher than BP neural network. Lai et al. [75] proposed the use of least square support vector machine (LSSVM) technique to design a credit risk assessment system for classifying “good” customers and “bad” customers. A linear programming problem is all that needs to be solved unlike the traditional quadratic equation which saves us some computational complexity as a result. Gestel et al. [79] proposed a Least Squares SVM classifier for credit scoring that outperforms traditional SVM classifiers. Later, Gestel et al. [77] proposed a Least Squares Support Vector Machine (LS-SVM) classifier within the Bayesian evidence framework. It automatically inferred and analyzed the creditworthiness of potential corporate clients. This method of classification was shown to be better than traditional Linear Discriminant Analysis (LDA) and Logistic Regression models. Zhu et al. [70], Ma and Liu [71], Li et al. [154] and Li et al. [76] also proposed a SVM model for identifying good creditors from bad ones.

Ruiz et al. [90] and Gestel et al. [102] proposed a hybrid model which uses logistic regression and SVM to perform credit scoring. For loan classification processes, Ruiz et al. modeled credit score based on non-traditional data which is obtained from smartphones. Gestel et al. emphasize on good readability of the model and show that as the SVM model has a gradual increase in the complexity, starting with a basic model, the readability and performance of the model goes up. Huang et al. [91] proposed a data mining approach using SVM for credit scoring. The proposed hybrid GA and SVM integrated strategy simultaneously performs model parameters optimization and feature selection task. Zhou and Bai [99] proposed a SVM classifier using genetic optimization algorithm which is hybridized with rough set theory. A reduced information table is the result of the application of rough set theory. SVM is trained using this reduced information table and the classification rules are also crafted using the same. Hao et al. [101] proposed a Fuzzy SVM (FSVM) for credit scoring. FSVM assigns fuzzy membership to each data points which helps in improving the generalization ability of traditional SVMs. Jiang and Yuan [100] used Particle Swarm Optimization (PSO) for searching the SVM parameters. After the search is done, the SVM model is used for credit scoring. Martens et al. [78] proposed rule extraction techniques for SVMs and introduces two others Trepan and G-REX. The other two are taken from the AI domain. The proposed technique does not loose much accuracy and also provides comprehensibility or readability as compared to other models. The accuracy of this model is even comparable to C4.5 and logit.

Apart from neural network and SVM based approaches, several other classification techniques are proposed for credit scoring. Though not a popular classification model for credit scoring, Naive Bayes approach has also been proposed by some. Vedala and Kumar [80] proposed a Naive Bayes classification for credit scoring. This scoring is done primarily on e-lending platforms. The paper uses social networks to extend its database. Okesola et al. [81] also proposed a Naive Bayes classification model for credit scoring. The input variables in this method are the demographic and material indicators. A modern approach for credit scoring is the decision tree method [155]. Szwabe and Misiorek [87] proposed a decision tree model for making credit decision. In this paper, several approaches for classification of loan applications are evaluated that provide a single decision tree as the final form of their results. Xia et al. [88] proposed a boosted Decision Tree approach for credit scoring. Bayesian technique was used for hyperparameter optimization. Wei et al. [85] proposed a model for credit risk evaluate using decision tree algorithm. Lang and Sun [86] studied the problem of class imbalance in credit risk early warning by applying decision tree algorithm. Empirical results have shown that there is strong sensitivity for decision tree algorithm to imbalanced data. This is when it is modeled for early warning of credit risk. Hidden Markov Model (HMM) has been explored for credit scoring. Benyacoub et al. [82] proposed a HMM combined with Baum-Welch procedure for credit scoring for iterative re-estimation of the parameters from a sequence of observations. Petropoulos et al. [83] used Student’s-t hidden Markov models (SHMMs) for corporate credit scoring system. Capturing of correlations and high robustness to outliers is an extra advantage of using SHMMs. SHMMs are shown to have competitive performances as compared to other models. Timofeev and Timofeeva [61] proposed an estimation of Loan Porfolio Risk based on Markov Chain Model. Discrete time model is used and the system state is fixed through identical time intervals which is taken as once a month.

Another method used by the researchers for credit risk evaluation is ensemble learning method. It is similar to hybrid systems. The difference is that in ensemble learning, the decision is taken by pooling multiple classifiers while in hybrid method of classification, various techniques are used on the data and the final parameters and pre-processed data is passed on to a single classifier which does the classification. There are many examples of researchers using ensemble learning method for credit risk evaluation. Ensemble techniques outperform individual classifiers, hence, they are widely in use. Chen et al. [108] proposed an ensemble model which ensembles logistic regression analysis (LRA), MLP-NN and cluster. A Bayesian approach is followed for the ensemble. It was found that this method outperforms single classifiers. Hsieh et al. [107] proposed an ensemble classifier which incorporates various data mining techniques. Class-wise classification is introduced as a preprocessing step. Bayesian network, SVM and Neural Network are used for the augmentation of the ensemble classifier. Ziȩba and Świa̧tek [105] proposed an ensemble classifier based on switching class labels techniques. There are two data mining problems which are solved through using switching class label technique: first is that asymmetric cost matrix would be an issue, another is imbalanced dataset’s predicament. Zhen and Wenjuan [104] proposed a SVM ensemble method based on fuzzy integral for credit risk evaluation. Different weights are given to separate components of SVM and their outputs are aggregated to give the result. The accuracy of the model was found out to be satisfactory.

Krishna and Ravi [169] proposed feature subset selection method by incorporating Adaptive Differential Evolution as a wrapper and tested it on three datasets for both credit scoring and fraud detection. The proposed method proved to be better than the previous ones.

Table 5 Studies of ML techniques in credit scoring

We also found three studies on credit scoring using deep learning approaches. We found out that deep learning approaches can be useful in evaluating credit score. A summarized information about the techniques used in the studies for credit scoring can be found in Table 5. Here, \(\eta\) is the learning rate of the neural network and C and \(\sigma\) are the parameters for a nonlinear support vector machine (SVM) with a Gaussian radial basis function kernel. The key observations drawn from the research on credit scoring are:

  1. 1.

    Neural networks are the most widely studied models for credit scoring, most notably feed forward neural networks. The primary advantage in using feed forward neural networks is its excellent generalizability property. However, the interpretability of these models are an issue as they are black box models, which makes it difficult for the person in charge of giving loans to understand the process followed by the model.

  2. 2.

    SVM method has been used by many studies for classification, however, it becomes computationally expensive when large data sets are used. This problem has been addressed by some [63, 90].

  3. 3.

    Hybrid and ensemble models are becoming popular as the proposed models overcome the shortcomings of individual classifiers and provide better accuracy rates.

  4. 4.

    There are survey studies that compare the results by different classifiers [144, 156]. Comparison of individual and ensemble classifier is done by Singh [157].

3.2.2 NPA prediction

Another type of credit risk evaluation technique is NPA or default prediction. This is performed to predict which loan is likely to become a default so that appropriate measures can be taken to deal with the situation. The ML techniques that are used for default prediction are different types of neural networks, SVM and hybrid models. Zhang [111] proposed an early warning default risk model based on rough sets and BP Neural Network algorithm. First, a default index is created for the personal loans and then rough sets is applied to it. This helps in streamlining the indexes. Then a BP Neural Network is trained on the data samples to determine the default risk. Makrygianni and Markopoulos [110] proposed default prediction using feedforward ANN which considers economic and personal information of the loan applicant. The proposed model was found to give satisfactory accuracy. Ribiero et al. [115] proposed enhanced default risk models using SVM+. Generalization is improved even further when using SVM+ as it not only takes training data into account but also additional information. Baseline SVM was outperformed by SVM+ on a French company dataset. Feki et al. [114] proposed methods of discrimination of banks as per the rate of Non Performing Loans (NPLs). It was done using different approaches of multiclass SVM and Gaussian Bayes models. Strategies for variable selection are also proposed. Ni et al. [117] proposed an extension of Factorization machines called RobustFM. Class imbalance problem and noisiness problem in default prediction is supposed to be addressed by RobustFM. In terms of F-measure, RobustFM outperforms traditional state-of-the-art classifiers. Chen et al. [113] proposed a loan default prediction model in which a hybrid undersampling method is used. The name of this undersampling method is DSUS and a stochastic sensitivity measure and the RBF Neural Network is combined with k-means clustering method for default prediction. Data was taken from a P2P company in China and used for the validation of the performance of the method. Su and Zhang [120] proposed an early-warning model by optimizing the weights and thresholds of BP neural network using GA. It is based on nonlinear combinatorial forecasting principle. The accuracy and the simulation error are known to be improved on use of GABP method as opposed to the traditional methods. Miglionico and Parillo [112] proposed an early warning indicator system using ANN. The implementation was done using a custom developed sfloat24 Math library. The ANN consisted of 3 layers, and a low cost FPGA device was used for its development. Fault tolerance and good accuracy are the characteristics of ANN when concerned with loan risk evaluation. Yao et al. [121] proposed a indicator system to evaluate credit risks of commercial banks based on fuzzy neural network. The results were good and it was found out that this model served as a better model than the black box neural network models. Oguz and Gurgen [116] explored the Hidden Markov Model(HMM) for the task of probability of default (PD) modeling and classification. The credit customers are assigned default bankruptcy probabilities using PD modeling instead of classifying them as creditworthy and uncreditworthy customers. The HMM method is shown to be robust and powerful for default prediction tasks. Table 6 shows the summary of the studies under default prediction.

Table 6 Studies of ML techniques in default prediction

The key observations drawn from the research on NPA or default prediction are:

  1. 1.

    SVM and neural networks are mainly used for default prediction.

  2. 2.

    Recently, hybrid models have gained a lot of popularity as they outperform individual classifiers. Performance of SVM can be enhanced by incorporating methods such as rough set theory and fuzzy theory with it.

  3. 3.

    Lack of public datasets for default prediction and governmental regulations that are primarily rule based seemed to have curtailed the research on NPA prediction. However, recent guidelines from regulators on early warning looks encouraging to use ML techniques for NPA prediction.

3.2.3 Fraud detection

Fraud in financial transactions can endanger their reputation among customers as well as cause heavy damages. As said by Abakarim [132], banks and financial institutions are investing in perfecting the machine learning algorithms and big data analytics to identify fraud and come up with fraud detection systems which are accurate and competitive. There can be many types of frauds in the banking sector. However, as stated earlier, we will focus only on credit card frauds, banking transaction frauds and loan application frauds conspired by customers. Fraud detection is a binary classification problem in which the loan is categorized as either ’fraudulent’ or ’non-fraudulent’. The idea is to apply a well suited classifier on the problem, however, the classifier should also be trained on a suitable dataset. The major approaches for coping with credit card fraud in banking are either statistical or based on artificial intelligence. The ML techniques applied by the researchers in these studies are NN, HMM, SVM and its variants and decision tree.

Mubarek and Adalı [122] proposed a MLP neural network technique for fraud detection. The proposed MLP ANN was shown to yield average better performance when compared to Naive Bayes and Decision Tree models. Patil and Dharwadkar [123] also worked on customer retention and fraud detection and proposed a supervised ANN for classification purpose. This supervised ANN showed competitive results and better accuracy than similar models. Ghobadi and Rohani [124] proposed a credit card fraud detection model to prevent credit card frauds using the Artificial Neural Networks. The model also includes a Meta Cost Procedure. It is added to deal with the problem of class imbalance of data. Zhan and Yin [126] proposed a fraud detection method for loan applications based on Neural Network and Knowledge Graph. Borrower’s phone network is used to extract features which is a time consuming process when done using other methods. Kazemi and Zarrabi [127] proposed deep neural networks for fraud detection in credit card transactions. Deep autoencoder is used to extract features from the information provided by credit card transactions. Deep learning has proved to be beneficial in several fields and this model has shown to do well for credit card transaction fraud detection. Zamini and Montazer [128] proposed an unsupervised fraud detection method using autoencoder based clustering. The autoencoder consists of 3 layers and the k-means clustering is used for the clustering purposes. The model proved to be better in comparison to other models. Liu et al. [129] proposed an Ant Colony based approach for fraud detection in business. The model performs better as compared to the traditional ANNs as the local optima problem is solved in the ant colony optimization based approach. Charleonnan [130] proposed a credit card fraud detection technique using RUS and MRN algorithms, so the technique for fraud detection was named as RUSMRN. Classification of unbalanced data is done using boosting and data sampling. A Taiwanese bank is used for data collection. Bouchti et al. [131] used deep reinforcement learning (DRL) for fraud detection in banks. Various interesting facts about DRL are covered in the paper and competitive performance is shown by DRL method. The paper is rather technical, however, a new approach for fraud detection has appeared in front of the research community. Karlos et al. [133] predicted fraudulent financial statements (FFS) using active learning. Supervised learning methodology has been used for this purpose. Active learning strategy seemed to perform better than supervised models. Jiang et al. [134] proposed an approach for credit card fraud detection using feedback mechanism and aggregation strategy. Rahmawati et al. [135] proposed fraud detection in business processes in the bank credit application using Hidden Markov Model (HMM). The accuracy of the method was found to be competitive and was benchmarked at 94%. Khan et al. [136] proposed a credit card fraud detection system using Hidden Markov Model (HMM). The system is compatible with scaling to large databases or to say large volumes of credit card transaction. Kotsiantis et al. [138] predicted fraudulent financial statements (FFS) using decision trees. Published financial data was used for detecting fraudulent financial statements and the performances of the machine learning techniques in using this data was evaluated in this paper. Decision tree was shown to achieve the best performance among all the classifiers taken into consideration. The input vector of the decision tree contained only financial ratios. Ravishankar et al. [158] did an analysis on detection of financial statement fraud using data mining techniques. The dataset was taken from 202 Chinese companies and the comparison was done with feature selection and without it. Probabilistic Neural Network (PNN) outperformed all others which was without feature selection techniques. PNN along with Genetic Programming (GP) outperformed the ones with feature selection.

Hybrid methods have been adopted in fraud detection techniques. Mareeswari and Gunasekaran [140] proposed prevention of credit card fraud using hybrid Support Vector Machines (HSVM). Communal and spike detection are used as hybrid techniques. Scalability is efficient in this method upon updating the evaluation of data. Montini et al. [141] proposed a hybrid sampling model for bank fraud diagnosis. The MLP model is used for training the bank transaction data. Kamaruddin and Vadlamani [142] employed a one-class classification approach in big data paradigm for detecting credit card fraud. It was an implementation of a hybrid architecture of PSO and Auto-Associative Neural Network for one-class classification. Big data analytics is used in this method and this method is also known as PSOAANN. Table 7 summarizes major studies for fraud detection. The key observations drawn from the research on Fraud detection are:

Table 7 Studies of ML techniques in fraud detection
  1. 1.

    Neural network based classifiers are most popular among researchers for fraud detection with 43% of the studies of the selected studies being based on neural networks.

  2. 2.

    ANN based models perform better than linear models [132] for classifying loans as fraudulent or not.

  3. 3.

    SVM has proved to be better than back propagation neural networks [137] for classification of loans as fraudulent or not.

  4. 4.

    There is no significant survey or analysis study on fraud detection using machine learning best to our knowledge.

  5. 5.

    There is extensive research on Social Engineering, Cyber attacks, Software vulnerabilities based frauds that is beyond the scope of our current study as they are not initiated by customer.

Table 8 Public datasets for credit risk evaluation

3.2.4 What are the public datasets available?

As there were observations on lack of public datasets, we analyzed available datasets for various credit risk evaluation techniques, shown in Table 8. Some studies did not make their dataset public or simulated their own dataset making it difficult to compare credit risk algorithms. German, Australian and Japanese Credit datasets are the most used datasets for credit scoring. The datasets are explained below one by one.

The German Credit dataset has 1000 instances of which 700 are delinquent and the remaining are non-delinquent customers. The dataset contains 20 attributes (see Table 9). Interestingly, Microsoft Azure Studio also demonstrates the German Credit Dataset to do credit risk evaluation. From the research studies and examples on various patterns that can lead to NPA and fraud, we opine that the dataset does not represent the world scenarios.

The Japanese Credit dataset has 125 instances which represent creditworthy and un-creditworthy clients.

This Australian Credit Approval dataset concerns credit card applications. It has 14 classification features with 690 instances. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is consists of a good mix of nominal and continuous attributes. There are also a few missing values.

Lending Club dataset file is a matrix of about 890 thousand observations and 75 variables for loans issued between 2007-2015. The details of all the features can be viewed here.Footnote 7 This is probably the biggest dataset available for loan.

Credit Card Fraud dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The features names have been changed to meaningless values and the number of columns in the dataset is 31.

Table 9 Attributes of German credit dataset

3.3 Answer to RQ3: What are the challenges or limitations in this focus area?

As seen in previous subsections, there are numerous AI techniques for credit risk evaluation which include credit scoring, NPA/default prediction and fraud detection. We have statistical techniques, rule based techniques and ML based techniques to evaluate credit risk. The main advantage of computer aided credit risk evaluation is that human work is minimized since it learns from a pre-collected database to make accurate and reliable predictions. However, this research area like any other area comes with challenges.

One problem that is difficult to deal with in the area of credit risk evaluation is the changing domains of training dataset and testing dataset. The training dataset could be from a different geographic area or from a different bank when compared to the testing dataset. As there can be different rules and regulations in different areas and banks, the dataset will vary significantly and so will the relation between its features. Thus, this change in domain is responsible for inaccuracy in the sample classification and hence, there is an need to address this problem of changing domains. This problem has been tried to tackle by Huang and Chen [33] but this problem needs further exploration.

Another limitation of using machine learning models in credit risk evaluation is the influence of external factors or parameters. As an example, a farmer taking an agricultural loan may not be able to pay his loan interests because of factors such as poor rainfall. Thus, these unknown factors (in this case, weather) hinder the ability of machine learning models to make accurate predictions. These factors require accessing information that is outside banking environment and are not part of customer profile. Also, some of the data can be in the form of images and unstructured text that needs to be extracted and gleaned for training the models. Another example could be the various macro economic issues such as country’s GDP and inflation.

A common challenge that the researchers face during credit risk evaluation is pre-processing of data. Noisy data or data that contains outliers can have heavy effect on performance of model and so can redundant and irrelevant features [103]. Researchers use feature selection step or data-filtering to overcome this problem. Fan and Yang [60] tried to overcome the problem of noise using denoising-autoencoder as discussed before.

One of the prime challenges researchers face in evaluating credit risk is when datasets get large. This is when nonlinear approaches in classification become more and more computationally expensive. In credit risk evaluation, there are usually many irrelevant variables in the sample data which need to be removed. These variables make computation more expensive and we have to do redundant computation. For SVM classifier, size of the matrix of the quadratic programming is directly proportional to the number of training points [159]. This means that as number of training points increase, the size of the matrix increases. Thus, the quadratic programming problem becomes more and more difficult. To support the claim that SVMs that long training time, we can look at the SVM-GA model of Huang et al. [91]. It takes a long training time. This means that people are now in search of patterns in the datasets that would help in bringing down the time complexity [63]. SVMs are also black box models and hence improving the comprehensibility of these models is an area that needs further exploration. Parameter selection in SVM learning is a critical process if one wants to successfully model SVM for credit risk evaluation. Nowadays, grid search, rough sets, trial and error and genetic algorithm based techniques are becoming increasingly popular for parameter selection. GA is a parameter optimization technique while rough set is an indexing technique. Grid search is another technique for parameter selection, however, it is known to affect the computational complexity of SVMs in a negative manner. SVMs can become more robust if the parameter selection is explored properly and the techniques applied to it.

Studies have suggested that neural networks outperform many statistical techniques such as discriminant analysis, logistic regression and optimization techniques. However, they are not stable. This means only specific samples can be used for application of model. When there is a change in sample, the model’s accuracy will change greatly. A large number of parameters, such as training methods, learning rate and network topology have to be refined before the neural networks can be successfully deployed. Another major drawback when neural networks are used for credit risk evaluation is that they lack the capability of explaining themselves. While high predictive accuracy rate can be achieved through them, the reasoning behind their decision making is not readily available [160]. There are many more drawbacks of neural networks such as trapping into local optimum and overfitting. Also since neural networks are non-linear in nature, sometimes there is huge time required for computation when there is a large dataset. It is still a challenging issue to find the optimal neural network model.

Another major concern in the field of credit risk evaluation is data shortage. Given a method, it is difficult to say that its performance is better than another method under all situations. Due to competitive press and privacy, in a realistic situation, a researcher can collect fewer data about credit risk. This makes it difficult for statistical methods and machine learning algorithms to obtain a continuously good result for credit scoring. To cope with the challenges of data shortage and poor performance, oversampling and other approaches are imperative to be introduced. Thus, we can safely say that further research is required in the area of data availability and data collection for credit risk evaluation.

While one may be still be convinced to use machine learning models for credit risk evaluation, it is good to keep in mind that machine learning models like any other models are not 100% accurate. Thus, relying on them for making decisions comes at a risk. It is up to the user to decide that to what extent he/she wants to involve them in the decision making process of credit risk evaluation.

3.4 Answer to RQ4: What are the research trends in credit risk evaluation?

Since, there were structural changes in the global financial market as well as an increase in the overall risk level was observed, it has become imperative to study credit risk evaluation. Over the last 20 years, much progress has been done in the area of credit risk evaluation. Credit scoring models are constructed by two fundamental and yet popular statistical tools: Linear Discriminant Analysis (LDA) and logistic regression (LR). As the times are changing, new methods have arrived such as Neural Networks, SVMs, k-NNs and Decision Trees. There are many other methods as described in the previous sections. However, hybrid models and ensemble models are becoming increasingly popular. Neural network and SVMs have their limitations which are being tackled by the current generation of researchers.

The prime research that is being carried out in the field of credit risk evaluation use classification algorithms that are non-linear in nature, such as neural networks and SVM. The research works related to neural networks and SVMs can be found in the previous subsections. SVM has received a lot of attention in the machine learning community because of its excellent generalization ability. Few have tried to perform credit scoring using Naive Bayes classification [80, 81]. For all three type of credit risk evaluation techniques, the researchers have also proposed many hybrid models that combine parts of two or more algorithms. Ensemble models for credit scoring are also becoming popular. The proposed ensemble models outperform single classifiers [162]. The HMM that has made remarkable achievements in speech recognition, engineering and many other fields is also applied in credit scoring and fraud detection, Benyacoub et al. [82] proposed an HMM based model for credit scoring. Decision trees are another widely used classification technique for credit scoring. But neural networks and SVM are still most popular machine learning models for credit scoring, default prediction and fraud detection.

3.5 Answer to RQ5: Universities working in the area of credit risk evaluation

The details on authors of the papers included in this study can be found in [163]. We noticed from our observations that considerable amount of the studies are from Chinese universities (see Fig. 8). The notable researchers in the field of credit risk evaluation according to number of studies published are shown in Table 10.

Fig. 8
figure 8

Number of studies per country

Table 10 Notable authors for credit risk evaluation according to the count of published studies

3.6 Important Results

Important information about some of the studies included in the SLR can be viewed in Table 11. The comments give an insight into how some of the challenges posed in front of authors are tackled.

Table 11 Additional important results for different approaches in evaluating credit risk

4 Conclusion

As per the protocols of our SLR, we extracted 1032 research papers and 136 studies were shortlisted for review. As we analyzed the papers, we found out there were multiple challenges in the field of credit risk evaluation. Each model comes with its own risks and challenges and cannot be relied completely upon for evaluation. A single complex classifier is not a solution to credit scoring and even for fraud detection according to the famous “no free lunch theorem” [164]. This is because of the problem of changing domains as discussed previously. Different banks from different geographic locations or even the same location will have different rules and regulations and thus the dataset will vary significantly. Hence, if we train the model on a dataset from one domain and test the model on a dataset from another domain, we will loose accuracy. Researchers are exploring this problem by applying ensemble techniques [88]. Ensemble techniques have proved to perform better than single classifiers [165,166,167]. Interpretability or readability of the results is a major drawback of ensemble learning. Therefore, improving the interpretability of ensemble models is another important research area which needs further exploration.

This study included only four digital databases for study selection, so it is possible that we may have missed some good studies on the topic. However, we are hopeful that we would have covered most of the major studies as we used snowballing approach in our search process as well with manual search to identify good studies. Another limitation of the study is that we did not validate or compare the findings or observations stated in some of the studies.

5 Discussion and future work

To solve the curse of dimensionality, applying feature selection methods is an important task. For feature selection approaches, there has been an increase in the use of GAs and Rough Sets [91, 99, 111, 119]. These algorithms are hybridized with other classifiers such as SVM to increase the accuracy of the model. Thus, hybridized models are becoming popular as more and more researchers are building such models. Their use has opened up a new area for exploration among researchers.

Another area which can be improved upon is data pre-processing of datasets. Datasets are made up of varying features or attributes. There can be redundant or recurring features in a dataset. This can lead to unnecessary computation and low accuracy. Thus, data pre-processing is an important step to improve the performance of a model. Piramuthu [168] discussed a few means to improve the performance of the classifiers through data pre-processing. However, there is room for improvement with more instances representing world scenarios. The data for NPA prediction that factors external, customer and bank features would help banks to implement early warning systems more effectively. Having datasets for various fraud topologies will enhance the usage of ML techniques with minimal false positives.

Deep learning is another area of machine learning which uses artificial neural networks (ANNs). We found out that deep learning can be useful in credit scoring and fraud detection. Further exploration is required in this area.

SVMs seem to be a better choice for solving the classification problem. SVM based approaches overcome the hurdles of overfitting and local optimum in ANN-based models. However, there are several challenges in applying SVM as discussed previously. More concrete research is needed if we want to increase the accuracy of the classifier using SVM. Deeper data processing and more suitable kernel function will help in increasing the accuracy. As the historical datasets are growing, there is a need to find out computationally inexpensive models that can deal with the dimensionality curse of the SVM. Harris [63] proposed a clustered SVM to address the problem. However, this work can further be improved in terms of area under the curve (AUC) and mean model training time.

The studies from our review collection state a need to develop more concrete tools which can address the problem of changing domains of datasets and also provide flexibility in adding any type of model to evaluate credit risk. A possible future work could be to combine rule-based, statistical and machine learning models into a single tool which would help in evaluating credit risk as per the requirements of the financial body. As most of the staff in banks are not technology savvy, building interfaces that do not require technical understanding but provide parallel processing, self-adaptation, self-learning, robustness and flexibility to assessors will enhance adoption of ML techniques.