Keywords

1 Introduction

The use of data mining methodologies have gained significant adoption in business settings, in particular in the financial services sector [1]. However, little is known about what and how data mining methodologies are applied. There are studies that surveyed data mining techniques and applications across domains, yet, they focused on data mining process artefacts and outcomes (eg. [2]), but not on end-to-end process methodology. There are some studies that have surveyed data mining methodologies in hospitality [3], accounting [4], education [5], and manufacturing [6] industries, but no comprehensive studies have been conducted on financial companies. In particular, studies in banking domain were so far narrow in scope - either addressed only specific data mining techniques, typically in connection with concrete business problem or product domain (eg. credit cards [7]), or tackled the technique in combination with required software toolset [8]. Data mining process methodology in this research was not addressed.

Given this gap, we investigate the application of data mining methodologies in the banking domainFootnote 1. This is achieved by tackling the following research questions: for what purposes data mining methodologies are used in the banking domain? (RQ1), how are they applied (“as-is” vs adapted)? (RQ2), and what are the goals of adaptations? (RQ3).

The research questions are addressed by the means of a systematic literature review (SLR). As part of SLR, existing studies have been categorized by deriving taxonomy, and examined in depth by analyzing typical data mining methodologies application scenarios. The paper provides two distinct contributions: (1) it identifies and classifies data mining methodologies application scenarios and business problems addressed in banking industry settings, (2) it examines data mining methodologies adaptations, documenting associated reasons, goals and benefits. In doing so, the paper identifies gaps in ‘de-facto’ standard data mining methodologies that manifest themselves when applied in banking. Further, it provides evidence and insights to built upon further research activities with respect to data mining frameworks applications in banking domain.

The work is structured as follows. Section 2 provides the background while Sect. 3 presents the research design. The findings are presented in Sect. 4 while Sect. 5 concludes.

2 Background

The section provides a brief overview of data mining concept, existing data mining methodologies and their evolution.

Data mining methodologies can be defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large data sets [9]. As such, data mining methods commonly involve extraction, processing, and modeling data by means of methods and techniques.

The data mining methods are commonly represented as a high level process [10, 11] that defines a set of activities and tasks, inputs and outputs required, accompanied with guidelines on how to perform the steps [10]. The foundations for structured data mining were first proposed by [12,13,14] with the introduction of Knowledge Discovery in Databases (KDD). This approach consists of nine steps. The first concerns learning the application domain by which is meant understanding the domain and identifying the goals of data mining. The second step focuses on creating the dataset while the third works with data cleaning and processing. The fourth step, data reduction and projection, concerns finding useful features to represent the data. In the fifth step, the target outcome is defined while in the sixth step, the methods and models to use on the dataset, with consideration to the objectives, are selected. In the seventh step, the work of mining the data is performed followed by the eight step where the results are interpreted and finally, are used as basis for decisions (ninth step).

The KDD approach gained traction in industrial and academic settings [11, 15], and it was also used as basis for refinements aiming to address specific gaps. However, such approaches received limited attention [11, 15] with the exception of SEMMA (Sample, Explore, Modify, Model and Assess). The latter has been widely adopted due to its incorporation into SAS data mining tool [16].

An industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced in 2000 as an alternative to KDD [11]. CRISP-DM is considered as ‘de-facto’ standard for data mining methodology and commonly used as a reference framework by which other methodologies are benchmarked against [10]. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations [11]. The iterative executions of CRISP-DM stands as the most distinguishing feature when compared to KDD. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large data sets and designed to be domain-agnostic [10]. As such, it is widely used by industry and research communities [11].

CRISP-DM has six phases with a total of 24 tasks and outputs. The first phase is to understand the business domain, the project objectives, and converting business requirements into data mining problem definition. In the second phase, the objective is to gain an initial understanding of the data. The third phase focuses on data preparation while in the fourth phase various modelling techniques are selected and applied. In the fifth phase, the models are evaluated to ensure that they can achieve the objectives. In the final (sixth) phase, the models are deployed and results organized, presented, and distributed. Similarly to KDD, CRISP-DM has been used as basis for new data mining approaches which largely addressed deployment, use of insights [17] or project management and organizational factors [18]. CRISP-DM has also been modified to specific domains such as Industrial Engineering [19] and Software Engineering [20].

3 Research Design

The main research objective of this paper is to study how data mining methodologies are applied in the banking domain. We apply systematic literature review (SLR) method as it ensures trustworthy, rigorous, and auditable methodology, as well as supports synthesis of existing evidence, identification of research gaps, and provides framework to appropriately position new research activities [21]. Our SLR followed the guidelines proposed by [21].

To formulate the research questions, we started from the traditional set of “W” questions, specifically “Why?”, “What?” and “How?”. The “Why” question led us to RQ1 (for what purposes are data mining methodologies used in the banking domain?). We then raised the “What” question (“What data mining methodologies are used in the banking domain?”), but discarded this question after a preliminary analysis - we found that all major data mining methodologies (e.g. CRISP-DM, SEMMA, etc.) are used in this domain and there are little insights to be derived from analyzing this question further. Next, we raised the “How?” question, which led us to RQ2 (are data mining methodologies in the banking domain used “as-is” or are they adapted?). An initial exploration of this question led us to the preliminary conclusion that indeed data mining methodologies are sometimes adapted, which in turn led us to pose a third research question: With what goals are data mining methodologies adapted for the banking domain (RQ3)?

According to the guidelines for conducting SLR [21] we derived and validated search terms and strings, identified types of literature, selected electronic databases, and defined the screening procedures.

The search string were derived from the research questions and included the terms “data mining” and “data analytics” as these are often used interchangeably. The terms “methodology”, “framework” and “banking” were added resulting in the search string being defined as (“data mining methodology”) OR (“data mining framework”) OR (“data analytics methodology”) OR (“data analytics framework”) AND (“banking”). Validation of the search string according to [22], led to adding the search string of (“CRISP-DM”) OR (“SEMMA”) OR (“ASUM”) AND (“banking”) in order to capture case study papers. The search strings were applied to Scopus, Web of Science, and Google Scholar databases. Multidisciplinary indexed/non-indexed electronic databases were selected to ensure wide data sources coverage, and to include studies from both academic (peer reviewed) and practitioners communities (“grey” literature). Specifically, our “grey” literature search covered industry reports, white papers, technical reports, and research works not indexed by Scopus or Web of Science.

Based on the SLR best practices [21, 23], we designed a multi-step screening procedures (relevancy and quality) with associated set of Screening Criteria (exclusion and inclusion criteria), and Scoring System. The exclusion criteria served to eliminate studies in languages other than English, duplicating texts, as well as publications shorter than 6 pages, or the ones not accessible (by University subscriptions). Papers that passed all exclusion criteria were retained and assessed according to relevance criteria. Each paper was considered relevant if it was: (1) about data mining approach within the banking domain, and (2) introduced or described data mining methodology/framework or modification of existing approaches. Finally, quality screening was conducted for full texts evaluation. For that we developed a Scoring Metrics as proposed in [22]. Papers were given the score of 3 if all steps of the data mining process were clearly presented and explained. Further, to merit a score of 3, the paper must have also presented proposal on usage, application, or deployment of solution in organization’ s business process(es) and IT/IS system, and/or discuss prototype or full solution implementation. If description of some process steps were missing, but without impacting the holistic view and understanding of the work performed, the paper was given a score of 2. Only papers scoring “2” or “3” were included in the final primary studies corpus.

The initial number of studies retrieved amounted to 693 of which 167 were academic and 526 “grey” literature. Having performed the screening based on exclusion criteria, 509 studies remained and were subject to relevance screening. 141 papers were finally identified as relevant and moved into quality assessment phase, and 41 peer-reviewed papers and 61 studies from “grey” literature received a score of 2 or higher. By means of SLR we identified primary texts corpus with 102 relevant studies. Figure 1 below exhibits yearly published research numbers with the breakdown by “peer-reviewed” and “grey” literature starting from 1997.

Fig. 1.
figure 1

SLR derived texts corpus - data mining methodologies peer-reviewed research and “grey” literature for period 1997–2019 (no. of publications).

Temporal analysis of texts corpus resulted in two observations. Firstly, we note that research on application of data mining methodologies within the banking domain began more than a decade ago - in 2007. Research efforts made prior to 2007 were infrequent and irregular, with 3–4 years gap periods between publications. Secondly, we note that research on data mining methodologies within banking domain has grown since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, we also note that the number of publications have roughly tripled over the last decade, hitting all-time high in 2018 with 22 texts released.

4 Findings and Discussion

In this section, we present results of publications analysis, address the research questions and discuss threats to validity.

RQ1 - For What Purposes are Data Mining Methodologies Used in the Banking Domain? In-depth analysis of text corpus revealed that data mining methodologies are predominantly being employed in the banking domain for two main purposes - customer-oriented and risk-oriented (see Fig. 2a below).

We identified 47 customer-oriented studies which address various aspects related to customer behavior modelling. A typical example is profiling according to usage pattern of different digital channels, [24]Footnote 2 authors profiled Internet bank users, while [25] focuses on patterns of electronic transactions based on demographic and behavioural features. In the field of Customer Relationship Management (CRM), the most common business problem analyzed relate to identifying and predicting customers who are likely to churn [26], customer loyalty and retention [27], customer segmentation [28], and customer value identification [29]. Further, smart and improved customer targeting in sales campaigns [30] and improved targeting and customer prioritization decision support are also popular business problem [31]. A few studies consider efficiency aspects of bank’s infrastructure such as Automated Teller Machines (ATMs) and branch networks (eg. [32]).

The second most commonly analyzed area is Risk Management, predominantly, credit risk. We identified 34 studies that focus on modelling tasks for supporting a variety of risk management processes including credit risk scoring and default prediction [33], prediction of financial distress [34], and credit decisions for private and corporate customers (especially, small and medium enterprises as in [35]). Further, identification and prevention of fraud behavior [36] and AML (anti-money laundering) risks [37] are addressed as well. Finally, other risk management topics, such as market risk, as well as asset management [38], trading strategies [39], overall economic analysis and predictions [53] are also addressed.

Fig. 2.
figure 2

Applications of data mining methodologies in banking: (a) breakdown by purposes; (b) breakdown by adaptation paradigms

RQ2 - How are Data Mining Methodologies Applied (“as-is” vs Adapted)? The second research questions addresses the extent to which data mining methodologies are used “as-is” versus adapted. Our review identified two distinct paradigms on how data mining methodologies are applied. The first is “as-is” where the data mining methodologies are applied as stipulated. The second is with “adaptations”, i.e., methodologies are modified by introducing various changes to the standard process model when applied. Furthermore, our review led us to identify three distinct adaptation scenarios namely “Modification”, “Extension”, and “Integration”:

Scenario “Modification” - introduces specialized sub-tasks and deliverables in order to address a specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing CRISP-DM or KDD stages.

Scenario “Extension” - primarily proposes significant extensions to CRISP-DM resulting in either fully-scaled and integrated data mining solutions, data mining frameworks as a component or tool for automated IS systems or adapted to specialized environments. Adaptations where extensions have been made elicit and explicitly presents various artefacts in the form of system and model architectures, process views, workflows, and implementation aspects. Key benefits achieved are deployment, implementation and leveraging of data mining solutions as integral components of IS systems and business processes. Also, data mining process methodology is substantially changed and extended in all key phases to accommodate new Big Data technologies, tools and environments [47, 53].

Scenario “Integration” - ‘Integration’ primarily concentrates on either combining CRISP-DM with data mining methodologies originated from other domains (e.g. Business Information Management, Business Process Management, BI [58]), adjusting to specific organizational aspects [62], and discrimination-awareness with respect to customers [56]. Adaptations in the form of integration typically introduces various types of ontologies and ontology-based tools, business processes, business information, and BI-driven framework elements. Key benefits are improved at the deployment phase, improved usage of data and discovered knowledge, higher business processes effectiveness and efficiency. Key gap filled in is lack of CRISP-DM integration with other organizational and domain frameworks.

We also noted that publications discussing “as-is” implementations have grown strongly but at the same time, adaptations are also gaining ground (as exhibited in Fig. 2b). Further, there is balanced development and distribution of the research among “Modification”, “Extension” and “Integration” paradigms. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as banking domain specific requirements and as such need to be adapted.

RQ3 - What are the Goals of Adaptations? We address the third research question by analyzing each of adaptation scenarios in depth.

Modification. This adaptation scenario was identified in 12 publications where modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to “as-is” case studies is clearly the presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional CRISP-DM phases. They are granular, specific and executed on tasks, sub-tasks, and at the level of deliverables. This is in clear contrast to “extensions” where one of the key proposals are new phases, such as including a new IS/IT systems implementation and integration phase. Also, with modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations to the IS/IT systems and business processes. Further, in the context of banking domain, this research subcategory can be classified with respect to business problems addressed (presented in the Fig. 3Footnote 3.)

Fig. 3.
figure 3

Data mining methodologies in banking - ‘Modification’ scenario example texts mapping to business problems

Extension. “Extension” scenario was identified in 10 publications and we noted that it was executed for the two major purposes:

  1. 1.

    To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process - address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to address changes to business process to incorporate data mining into organization activities

  2. 2.

    To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artefacts such as Big Data architectures, system models, workflows, and data flows.

We also conclude that the first purpose focuses on implementation of specific data mining models and associated frameworks and processes. For example, apart from classification model and evaluation framework, [47] proposes a knowledge-rich financial risk management process while [48] introduces framework for machine-learning audits. [49] presented data mining-based solution for AML implemented as a tool with respective IS architecture and investigative process. [50] focused on combined data mining concept introducing multiple data sources, methods and features, all incorporated in the real-time prototyped solution. [51] focused on actionable data mining by presenting post-processing data mining framework which enables automated actions generation. In the similar vein, [52] presented large-scale data mining framework extended to incorporate social media data including adaptions to parallel processing. The major benefit achieved by these adaptations, apart from resolved business problem or research gap, is the usefulness of results produced in the decision-making process.

In contrast, the second purpose concentrates on design of complex, multi-component information systems and architectures. For instance, [53] have constructed a framework that considers socio-economic data, its processing methods, a new data life-cycle model, and presented an architecture for Big Data systems to integrate, process and analyze data for forecasting purposes. [54] proposed refinements of reference data mining methodology to address Big Data analytics, applications prototyping and its evaluation, project management and results communication. Finally, [55] proposed cross-border market monitoring and surveillance system with 3 subsystem components, system and data flows. In this research, authors discuss and present useful architectures, algorithms and tool sets in addition to methods and techniques which alone are not sufficient to create deployable systems and tools. The key benefits provided are broad context enabling practical implementations of complex, integrated data mining solutions. The specific list of studies mapped to each of the given purposes along with key artefacts is presented in Fig. 4 below.

Fig. 4.
figure 4

Data mining methodologies in banking - ‘Extension’ and ‘Integration’ scenarios adaptation goals, their artefacts and example texts mapping

Integration. Integration of data mining methodologies were found in 14 publications. Our analysis shows that these adaptations are at the highest abstraction level and typically executed with the goals to (1) introduce discrimination-awareness in data mining, (2) integrate/combine with other organizational frameworks, and (3) integrate/combine with other well-known frameworks, process methodologies and concepts. Example list of studies with artefacts is presented in Fig. 4Footnote 4 and further discussed.

Discrimination-aware data mining (DADM), as proposed by [56], includes tool support for “correct” decision process. The major benefit is increased correctness and usefulness of results in the decision-making process, monitoring, avoidance of discrimination and transparency.

[57] author combined data mining methodology with organizational context to instill and improve data-driven decision-making. Further, [58] integrated data mining with business process frameworks and models (also proposed by [59]). [60] integrated data mining with BIM (Business Information Modelling) while [61] merged data mining with BI. All with the purpose to improve usage of data, business processes effectiveness and deployment of data mining solutions. These works are complemented by number of publications [62, 63] specifically tackling actionability of data mining results, which aim to reduce likelihood of data mining project producing high quality knowledge with limited or no business benefit. Authors propose shift to domain-driven data mining paradigm by integrating such new key component as domain intelligence, human-machine cooperation, in-depth mining, actionability enhancement, and iterative refinement process. Emphasis on data-mining business requirements, model sharing and resuse from business user perspective is also tackled by introducing ontology-based data mining model management approach [64]. Identical problems are addressed from organizational point of view by [65], which focused on Big Data Analytics governance framework. Finally, number of innovative research papers focused on integrating data mining with technical concepts and frameworks from other domains, for example, relational (symbolic) data mining methods [66] and game theory [67].

To summarize, from “extension” and “integration” research we have identified three important banking domain specific factors, which require adjustments of existing data mining process frameworks and models. Firstly, potential discrimination in the context of credit decision-making requires financial services companies to adapt data mining to achieve transparency. Secondly, large number of accumulated data and associated complex IS/IT architectures, require to adapt data mining process to address complex data mining models deployment patterns and implement them as component of complex systems and business applications. Thirdly, actionability of data mining results, adaptation of analytics outcomes to end, business-user needs are of utmost importance to achieve business value realization. We can hypothesize that in banking domain as the leading adopter of data mining solutions with significant investments, failures of realizing full business value of data mining projects are more explicit and observable and need to be addressed.

This study has inherent threats to validity and limitations associated with the selected research method (SLR). The validity threats include incompleteness of search results (internal validityFootnote 5) and general publication bias (external validityFootnote 6). We have mitigated internal validity by strictly adhering to inclusion criteria, and performing significant validation procedures. With respect to external validity, we conducted trial searches to ensure validity of search strings and proper identification of potential papers. Our initial publications harvest size reached almost 700 texts originated from indexed peer-review research and “grey” literature thus mitigating external validity risk. Further, the key limitation of the SLR method for this study is that banking industry internal practices are not frequently disclosed in academic literature. We mitigated the negative impact by inclusion of “grey” literature where reporting on existing industry practices by professionals is common.

5 Conclusion

In this study we have examined data mining methodologies usage in the banking domain. By means of Systematic Literature Review we have identified 102 relevant studies of peer-reviewed and “grey” literature which have been evaluated in depth to address three research questions: for what purposes data mining methodologies are used in the banking domain? (RQ1), how are they applied (“as-is” vs adapted)? (RQ2), and what are the goals of adaptations? (RQ3).

Tackling RQ1 (For what purposes?) we have discovered that data mining methodologies are applied regularly since 2007 and their usage has tripled. Further, data mining in financial services domain is primarily used for two main purposes - to address Customer Relationship Management and Risk Management related business problems.

Answering RQ2 (How?), we have identified that over the last decade data mining methodologies have been primarily applied “as-is” without modifications. Yet, we have also discovered emerging and persistent trend of using data mining methodologies in banking with adaptations. Further, we have distinguished three adaptations scenarios ranging from granular modifications on tasks, subtask and deliverables level and ending up with merging standard data mining methodologies with other frameworks.

Addressing RQ3 (What are the adaptations goals?), we have examined the adaptation objectives, banking domain specific factors behind such adaptations, and as a result have identified three such aspects. Firstly, discriminatory awareness and transparent decision-making (human-centric aspect) require data mining process adaptation. Secondly, actionability of data mining results (business-centric aspect) plays a central role in the banking domain. Thirdly, we have also identified that standard data mining methodologies lack deployment and implementation aspects (technology-centric aspects) required to scale and transform data mining models into software products and components integrated into Big Data Architectures. Therefore, adaptations are used to integrate data mining models and solutions in complex IT/IS systems and business processes of the banking industry. This study highlighted the needs and established ground for future work to develop refinements of existing data mining methodologies for the banking domain which would address three above mentioned concerns.