1 Introduction

Mobile payment transactions are carried out using mobile phone technologies that allow users to deposit, withdraw, spend, transfer and send money. There are nearly three hundred mobile payment services worldwide, which are particularly popular in Sub-Saharan Africa and Asia. In 2020, mobile payment transactions totaled $767 billion, conducted by approximately 1.2 billion registered users according to Statista. In addition, mobile payments have reportedly enormous potential during the COVID-19 pandemic, as it can greatly increase the promptness and efficiency of money transfers while minimising the necessity of face-to-face contact with bank and government staff (Blumenstock, 2020).

Recent mobile payment case studies (Iman, 2018; Jocevski et al., 2020; Verkijika, 2020) suggest that mobile payment systems have been challenged by several types of factors that have emerged in the context of advances in financial technology. Commercial and technical factors have been identified as particularly important to their future growth. As regards the first group of factors, the need to increase cost efficiency is particularly emphasised because most mobile payment transactions in developing countries are low value but high volume (Franque et al., 2020). Technical factors include, in particular, security concerns, as the legal frameworks and enforcement mechanisms are often inadequate in developing countries (Akanfe et al., 2020; David-West et al., 2022; Pal et al., 2020). To deploy a mobile payment system, it is therefore necessary to minimise fraud in order to increase customer trust and security, as reported in existing mobile payment acceptance models (Chin et al., 2022; Jia et al., 2022; Kar, 2021; Pal et al., 2021).

The increasingly growing use of mobile payments has boosted the chances of criminals committing mobile phone fraud in an illegal effort to circumvent security measures of mobile payment services. There is consequently a lot of pressure to investigate potential security threats that may be exploited, with the ultimate aim of preventing fraud on a mobile payment service and developing countermeasures against attacks (Chen et al., 2021; Lopez-Rojas et al., 2016; Rieke et al., 2013). Early detection of fraudulent transactions is a key task in this effort. Recent developments in mobile payment services have therefore heightened the need for automated detection systems that enable immediate detection and prevention of fraudulent transactions.

The main challenges currently facing researchers involved in detecting fraud in mobile payment transactions include: (1) extreme class imbalance (only a small proportion of customers have fraudulent intentions); (2) changing patterns of fraud over time (fraudsters are always looking for new ways to bypass systems and commit crimes); and (3) inadequate selection of performance metrics. The consequence of the first challenge is a poor user experience for legitimate customers, as the detection of fraudsters usually also implies rejecting some legitimate mobile payment transactions. The second challenge usually leads to a decrease in the performance and efficiency of the detection model. Therefore, machine learning models must be constantly updated, otherwise they will not meet their objectives. Regarding the last challenge, in some cases the providers of mobile payment systems should prefer a higher false positive rate in exchange for a lower false negative rate and vice versa. But how to choose the right ratio between these two errors remains a challenging area in the field of fraud detection in mobile payment transactions.

A relatively high detection accuracy was reported in earlier research by using both traditional supervised learning methods (Choi & Lee, 2017, 2018) and deep learning-based methods (Mubalaike & Adali, 2018; Xenopoulos, 2017). However, a major problem with this kind of application is the extreme class imbalance of transactions, with a considerable dominance of legitimate transactions in the data. This in turn leads to a poor classification performance on the minority class of fraudulent transactions. To address this issue, two approaches have been utilized. The first approach relies on under-sampling methods used to generate a balanced dataset (Pambudi et al., 2019). The main limitation of this approach is the loss of potentially important information stored in discarded legitimate transactions, which can reduce detection accuracy. Alternatively, an attempt has been made to isolate fraudulent transactions in an unsupervised fashion (Buschjäger et al., 2021), inspired by outlier detection methods. Nevertheless, a comprehensive evaluation of machine learning methods is not yet available in the literature. Moreover, little is known about how the two approaches can be integrated to improve the detection performance. To overcome the above problems, here we propose to enhance the performance of eXtreme Gradient boosting (XGBoost), a state-of-the-art machine learning method, by including a data sampling component addressing the issue of extreme class imbalance of mobile payment transactions.

In many financial applications it is necessary to filter out unusual observations to ensure the reliability of the system and prevent attempts to maliciously use it. This is particularly useful for detecting financial fraud attempts, as their behaviour patterns differ significantly from normal financial transactions (Bernard et al., 2021). Outlier detection methods are capable of processing all available data in real time to uncover patterns that evade traditional supervised learning methods. By doing so, organised crime groups can be identified with higher accuracy and less false positives. Outlier detection methods have indeed proved effective for detecting credit card fraud detection (Carcillo et al., 2021), online banking fraud detection (Carminati et al., 2015), and health insurance fraud detection (Yamanishi et al., 2004). Overall, however, there has been limited use of these methods to detect financial fraud, although some review studies suggest that they deserve more attention because the detection performance of supervised algorithms is negatively affected by the inherently heavily imbalanced class distribution of financial fraud data (Ngai et al., 2011). The scarce use of outlier detection methods can be attributed to the difficulty of detecting fraudulent behaviour (e.g., abnormal frequency of transactions or spending behaviour) when overlapping with legitimate behaviour in datasets contaminated with outliers and noise. Moreover, several other challenges have been identified that make it the difficult to detect outliers in the financial domain. First, efficient general purpose outlier detection methods are lacking because an outlier detection method in one fraud domain may not be appropriate for other scenarios, as legitimate and fraudulent behaviour is different from domain to domain (Ahmed et al., 2016). Second, unsupervised learning is preferred as sufficient labelled data for building models are rarely available. Third, legitimate behaviour may change over time, and fraudsters try to make their activities look legitimate. To take advantages of both supervised machine learning and outlier detection methods, for the first time, we propose a semi-supervised ensemble fraud detection model combining unsupervised outlier detection and supervised XGBoost methods that exploit all transactions contained in a large, highly imbalanced mobile payment transaction dataset.

Finally, financial implications of fraud detection methods in mobile payment transactions have also been neglected in earlier research. Therefore, our third contribution is to propose a novel performance measure of cost savings that takes into account the financial implications of false positive and false negative rates of fraud detection systems. Using the PaySim dataset, our findings provide evidence for the effectiveness of both XGBoost leveraged by an under-sampling class-balancing procedure and extreme gradient boosting outlier detection (XGBOD), thus providing important tools to support operation and management of mobile payment services.

In summary, the contributions of this study are threefold:

  1. 1.

    Developing a novel fraud detection framework for mobile payment systems by integrating the XGBoost method with class-balancing adjustments and unsupervised outlier detection methods, making it suitable for detecting fraud in a typical class-imbalanced mobile payment scenario.

  2. 2.

    Proposing a novel cost savings measure to evaluate the performance of mobile payment fraud detection systems. Unlike the traditional performance measures, the proposed measure considers both the cost savings from the correct detection of fraudulent transactions and the decrease in the margin for the transactions incorrectly identified as fraudulent.

  3. 3.

    Using the benchmark PaySim dataset of more than 6 million mobile payment transactions, we demonstrate that the proposed fraud detection framework not only outperforms state-of-the-art fraud detection methods in terms of detection accuracy but also generates substantial financial savings to the providers of mobile payment systems.

The remainder of this paper is organized as follows. Section 2 reviews the related work on fraud detection in mobile payment transactions with respect to data sources, methods used and performance achieved in earlier studies. Section 3 outlines the proposed fraud detection framework. Section 4 provides the results of the evaluation on the PaySim dataset, robustness check, and financial implications. Section 5 concludes with providing some possible directions for future research.

2 Fraud Detection in Mobile Payment Systems – Literature Review

A considerable amount of literature has been published on financial fraud detection, see West and Bhattacharya (2016) for a review and Hajek and Henriques (2017) for a comprehensive evaluation of financial fraud detection methods. Risk factors of financial fraud were investigated, indicating that pressure / incentive to commit fraud is the most important risk factor (Huang et al., 2017). Related studies can be broadly categorized according to the financial fraud type as follows (Onwubiko, 2020): (1) account takeover fraud, (2) payment fraud, and (3) application fraud. Onwubiko (2020) also identified four main fraud channels, namely physical, web, telephony, and mobile. Frauds in mobile payment transactions have increasingly been recognized as a major concern in finance due to recent developments in mobile payment services (Chen & Sivakumar, 2021). Therefore, security requirements must be met to address security issues related to mobile payment transactions, such as mobile malware and SMS-based attacks (Kang, 2018). Heterogeneous software and hardware mobile platforms make the security problems more challenging (Li & Clark, 2013).

Regarding the data used in previous studies and summarized in Table 1, the lack of real-world datasets has been identified as a major problem in the application domain. Therefore, most earlier research tended to generate simulated synthetic data based on features captured from real-world fraud and legitimate transactions. To do so, Rieke et al. (2013) extracted payment laundering patterns from real-world events. However, the number of instances was insufficient for efficient fraud detection, as indicated by relatively low false negative (legitimate) rates in early studies (Coppolino et al., 2015; Rieke et al., 2013). Considerable progress has been made by introducing the PaySim financial simulator (Lopez-Rojas et al., 2016, 2018) that resembles normal mobile transactions and injects fraudulent behaviour to produce a larger number of financial frauds. Agent-based simulations and statistical analysis confirmed that the simulated data are as prudent as the original aggregated anonymized real data, thus, representing an optimal control environment for fraud detection in mobile payment transactions. By leveraging the PaySim data, Lopez-Rojas and Barneaud (2019) demonstrated their advantages over the relatively small real-world dataset. In addition, the simulated data retained the transactions and causal dynamics of the original data. It should be however noted that by preserving the statistical properties of the real-world data, the high class imbalance in favour of legitimate transactions is also maintained in the simulated dataset.

Traditional machine learning methods with supervised or unsupervised learning are not effective in handling extreme class imbalance in the data. Although a relatively high overall accuracy was reported in several studies, these methods performed well only in terms of majority (legitimate) class accuracy (Choi & Lee, 2017, 2018; Du et al., 2018; Zhou et al., 2018). This holds also for more recent deep learning models, such as deep belief networks (Xenopoulos, 2017) and restricted Boltzman machines (Mubalaike & Adali, 2018). To overcome this major limitation, class imbalance was first approached by using under-sampling methods and then machine learning methods were trained on the balanced dataset (Pambudi et al., 2019). Similarly, Xenopoulos (2017) used under-sampling to produce balanced bootstraps for ensemble learning, and Misra et al. (2020) and Schlör et al. (2021) applied it to generate balanced training data for deep learning-based detection models. The main drawback of the under-sampling approach is that potentially useful instances are often excluded from the training data, which can significantly degrade the detection accuracy. Alternatively, isolation-based approaches were used to approximate the data distribution and build a generative model using mixture components. This outlier detection method was successfully applied to fraud detection by Buschjäger et al. (2021).

Table 1 Summary of data and methods used in previous studies

However, a comprehensive evaluation of state-of-the-art machine learning-based approaches exploiting under-sampling methods for handling class imbalance problem, is lacking in the literature. Hybrid semi-supervised methods taking advantage of supervised learning and unsupervised outlier detection methods have also been overlooked. Finally, only standard performance measures have been used to evaluate fraud detection performance in mobile payment systems, thus neglecting the financial implications of fraud detection.

3 Fraud Detection Framework

The proposed framework for fraud detection in mobile payment systems is presented in Fig. 1. The proposed fraud detection models are aimed to take advantage of XGBoost while overcoming the problem of extremely imbalanced classes in mobile payment transaction data. We will demonstrate that this approach is not only more accurate than supervised machine learning and outlier detection methods used in existing studies but that our approach is also more profitable in terms of the proposed cost savings measure.

Fig. 1
figure 1

Fraud detection framework

3.1 Proposed Fraud Detection Models

This section outlines two fraud detection models proposed in this study. First, the eXtreme Gradient boosting (XGBoost) method, augmented with random under-sampling, is introduced to leverage both the supervised learning capability and robustness of XGBoost, a state-of-the-art machine learning method, and the data sampling component to overcome the class imbalance problem inherent in mobile payment transaction data. The second model exploits the extreme gradient boosting outlier detection (XGBOD) method, a semi-supervised algorithm that improves the performance of the XGBoost method on highly imbalanced mobile payment transaction data by introducing outlier scores obtained from multiple unsupervised outlier detection methods.

3.1.1 Extreme Gradient Boosting with Random Under-Sampling

The proposed RUS+XGBoost integrates the random under-sampling (RUS) method with XGBoost, as depicted in Fig. 2. The RUS component is first used to generate balanced training samples, and XGBoost then generates additive models to produce the final prediction on whether the mobile payment transaction is fraudulent or not.

Fig. 2
figure 2

Flowchart of RUS-XGBoost for fraud detection

Under-Sampling for Handling Class Imbalance Problem

The extremely high imbalance between legitimate and fraud classes makes detecting financial fraud a challenge (Du et al., 2018). Considering the importance of class imbalance in financial fraud detection, numerous methods have been used to improve the classification performance of supervised learning methods. In the related literature (Pambudi et al., 2019), data-level solutions have been particularly successful because they allow to address the imbalance problem before training machine learning methods. In addition, data-level methods integrated into classifier ensembles appear to be particularly effective (Galar et al., 2012). From the data-level methods, over-sampling methods create artificial instances in the minority class to balance the training data. However, this can lead to problems of overfitting and overgeneralization as instances of the majority class are ignored. Moreover, given the gradual increase in data on financial fraud, under-sampling methods should be a better choice than their over-sampling counterpart.

The RUS method used in this study enables controlling for the number of samples selected from the original data. RUS is a non-heuristic method that randomly selects a data subset from the majority class, which is computationally effective and enables sampling heterogeneous data (Haixiang et al., 2017).

Extreme Gradient Boosting

XGBoost is a computationally efficient and scalable implementation of gradient boosted decision trees that build additive models in a stepwise fashion. The overall error is minimized incrementally by introducing additive models based on the errors obtained in the previous steps. This results in an ensemble of base learners with better prediction ability than the individual classifiers. This is achieved by gradually improving the accuracy, low tree depth and equal contribution of the base learners to the final combined model. To further improve robustness to noise and overfitting, gradient boosting was augmented with a random sampling scheme (stochastic gradient boosting). XGBoost is an enhanced implementation with a more regularized model to control overfitting. The objective function of XGBoost to be minimized is given as follows (Chen & Guestrin, 2016):

$$\begin{aligned} \text {obj}^{(t)} = \sum \limits _{i=1}^{n} (y_{i} - \left( \hat{y}_{i}^{(t-1)} + f_{t}(x_{i}))\right) ^{2} + \sum \limits _{t=1}^{T}\Omega (f_{t}), \end{aligned}$$
(1)

where \(y_{i}\) is the target value of the i-th instance, \(\hat{y}_{i}^{(t)}\) is its predicted value at the t-th iteration, \(f_{t}(x_{i})\) is the additive decision tree model greedily added to improve the model performance, and \(\Omega (f_{t})\) is a regularization term penalizing the model complexity. The goal of this regularization procedure is to compress the weights for many features to zero to perform feature selection, which is advantageous when dealing with high-dimensional data. Therefore, XGBoost is currently one of the best performing classifiers across domains and has been successfully applied to insurance fraud detection (Dhieb et al., 2019).

3.1.2 Extreme Gradient Boosting Outlier Detection Model

The XGBOD method (Zhao & Hryniewicki, 2018) is a semi-supervised ensemble algorithm integrating multiple unsupervised outlier detection algorithms and an XGBoost classifier, as illustrated in Fig. 3. First, unsupervised methods are used to obtain data representations in terms of transformed outlier scores (TOS). Second, a feature selection method is used to reduce the TOS feature space so that only relevant TOS are retained. Then, the outlier score matrix is combined with the original features to produce a combined feature space. An improved feature space is thus generated, and the XGBoost classifier is used in this feature space to produce the final outlier scores for each mobile payment transaction. The advantage of this approach is its good predictive ability, which is due to its robustness to overfitting and data imbalance.

Fig. 3
figure 3

Flowchart of XGBOD for fraud detection

In the proposed XGBOD-based fraud detection model, a variety of unsupervised outlier detection methods (presented in Section 3.2.2) are used to produce the TOS features. To maintain the balance between their diversity and accuracy, the balance selection algorithm (Zhao & Hryniewicki, 2018) is used to perform TOS selection. This algorithm applies a discounted accuracy function \(\Psi (TOS_{i})\) to pick the subset of p most relevant TOS. The function is defined as follows:

$$\begin{aligned} \Psi (TOS_{i}) = \frac{AUC_{i}}{\sum _{i,j=1}^{k}\mid \rho (TOS_{i},TOS_{j})\mid }, \end{aligned}$$
(2)

where \(AUC_{i}\) is the AUC performance of the i-th outlier detection method, and \(\rho (TOS_{i},TOS_{j})\) denotes the Pearson correlation coefficient between a pair of TOS.

3.2 Machine Learning Methods for Comparative Evaluation

In this section, we present the machine learning methods used for comparative evaluation in detecting fraud in mobile payment transactions. The methods can be broadly divided into (1) machine learning methods with supervised learning that address the class imbalance problem typical for financial fraud detection data, and (2) outlier detection methods.

3.2.1 Supervised Learning Methods for Imbalanced Data

k-nearest Neighbour Classifier

The k-nearest neighbour (k-NN) method is an instance-based non-parametric classifier that uses training instances for comparison purpose. An instance is classified considering its k most-similar instances (typically in terms of Euclidean distance) using a majority vote. This simple approach proved to be accurate in a comparative analysis of machine learning methods for highly imbalanced credit card fraud detection (Awoyemi et al., 2017). In financial fraud detection, it is assumed that fraud instances are far from the samples of the legitimate class. Therefore, k-NN can be effectively used even in unsupervised outlier detection mode (Ramaswamy et al., 2000).

Support Vector Machine

SVM is a particularly effective classifier for financial fraud detection due to its capacity to deal with high-dimensional data (Du et al., 2018; Pambudi et al., 2019; Seera et al., 2021). The SVM algorithm aims to find the optimal separating hyperplane that maximizes the margin between instances from different classes. The decision boundary is represented by a subset of the data known as support vectors. Finding the parameters of the hyperplane is an optimization problem that takes into consideration both, minimizing the training error and maximizing the margin. To handle nonlinear relationships in the data, kernel functions (e.g., linear, polynomial or radial basis functions) are employed to map the classification problem from the original feature space to a new feature space of higher dimension where linear separation is possible.

Random Forest

Random forest (RF) integrates multiple decision tree predictors trained independently on different data samples. This allows to generate a number of trees, ensuring that the generalization error converges to a certain limit. Another major advantage of RF is its non-differentiable decision boundary. In addition, random feature selection is used to split the nodes in each tree, making the RF classifier more robust to noise. The application of RF in financial fraud detection is particularly effective when the class distribution is imbalanced because its hierarchical structure enables learning patterns from both classes (Nami & Shajari, 2018). These advantages explain the good performance of RF on financial fraud detection tasks (Zhou et al., 2018).

3.2.2 Outlier Detection Methods

Outlier detection is typically conducted using unsupervised machine learning methods. The methods presented in this section are trained to represent the legitimate data using clusters of similar data observations. Then, an unseen instance is assigned a score that is compared to a threshold representing the decision boundary separating legitimate instances from outliers.

The evaluation conducted in this study contains four types of outlier detection methods, namely (1) proximity-based methods, (2) linear model-based methods, (3) ensembling methods, and (4) neural network-based methods.

Proximity-Based Methods

To detect outliers, proximity-based methods investigate the neighbourhood of each data instance. For example, the local outlier factor (LOF) method (Breunig et al., 2000) uses the Euclidean distance between the data instance and its closest neighbour to obtain an outlier score. In the k-NN method (KNN) (Ramaswamy et al., 2000), a partition-based algorithm is first used to identify candidate partitions containing outliers, and then the distances of instances from these partitions are calculated to detect outliers. An important advantage of proximity-based methods is their independence of the data distribution. In other words, no a priori knowledge about the data distribution is required. However, these methods usually do not scale well for high-dimensional data. To reduce the sensitivity of LOF to the curse of dimensionality, the cluster-based local outlier factor (CBLOF) method (He et al., 2003) replaces closest neighbours with closest clusters, and the angle-based outlier detection (ABOD) method (Kriegel et al., 2008) replaces distances with the angular radius and variance of each data vector. The histogram-based outlier detection (HBOS) method assumes independence of features to score instances in linear time and is thus computationally more efficient compared to nearest-neighbour-based methods. However, HBOS fails in detecting local outliers because the density estimation produced by histograms does not allow modelling local outliers.

Linear Model-Based Methods

Linear model-based methods rely on the construction of decision boundary separating instances in the legitimate class from the rest of the input data space. The one-class SVM (OCSVM) method (Schölkopf et al., 2000) constructs a separating hyperplane in high-dimensional space by minimizing the structural risk to capture regions of data belonging to the legitimate class. To prevent overfitting, this method allows a certain percentage of data instances (regularization parameter) to fall outside the separation boundary. The minimum covariance determinant (MCD) method (Hardin & Rocke, 2004) combine a multivariate location and scale estimator with a robust clustering algorithm so that the determinant of the covariance matrix is minimized for each cluster. This method is first trained to fit a minimum covariance determinant model and then the outlier score is calculated using the Mahalanobis distance. However, problems can arise when clusters overlap significantly, leading to poor convergence of the algorithm.

Ensembling Methods

Isolation Forest (Liu et al., 2008) aims to separate outliers from the rest of the data samples. To calculate an isolation score for the data instances, random forest is employed. The method assumes that outliers are susceptible to isolation and, therefore, can be isolated closer to the root of the tree. Specifically, the average path length from the root of the trees can be used obtain the isolation score. Isolation trees are thus able to build sub-models on different data samples while maintaining low computational complexity and the ability to scale to handle large volumes of data and high-dimensional problems. Similarly, lightweight on-line detector of anomalies (LODA) comprises a collection of weak learners represented by one-dimensional histograms approximating probabilities of random data projections. The use of sparse projections makes LODA robust to both the large number of samples and missing data, allowing the detection of anomalous samples in real-time (Pevny, 2016).

Neural Network-Based Methods

Neural network-based methods utilize feature learning to reduce dimensionality. An autoencoder is an unsupervised neural network capable of nonlinear dimensionality reduction and reproducing input data vectors. Sakurada and Yairi (2014) showed that autoencoder (AE) can be successfully applied to outlier detection. To detect outliers in financial fraud, AEs can be trained to learn legitimate behaviour and compute a reconstruction error representing the outlier score (Sakurada & Yairi, 2014). To achieve robustness in learning disentangled representations, variational autoencoder (VAE) was proposed that utilizes both the joint data distribution and their latent generative factors (Burgess et al., 2018). VAE represents a probabilistic graphical model whose posterior distribution is estimated using a neural network. The outlier score of VAE is calculated as the reconstruction probability. Recently, generative adversial networks (GANs) have been deployed to unsupervised outlier detection. Specifically, multi-objective generative adversarial active learning (MO-GAAL) uses GANs to sample informative potential outliers following a mini-max game between a discriminator and a generator (Liu et al., 2019). Thus, GANs assist the discriminative algorithm in finding a boundary that can effectively separate fraudulent outliers from legitimate normal data. This has been exploited in several studies on financial fraud (Sethia et al., 2018; Delecourt & Guo, 2019).

3.3 Performance Evaluation

In many related studies (Du et al., 2018; Misra et al., 2020; Mubalaike & Adali, 2018), the ratio of correctly classified transactions to the total number of transactions (i.e., accuracy) has been used as the evaluation measure. However, in the scenario of class-imbalanced data, this measure fails to detect well the model performance for the minority (fraud) class.

As noted in previous research (Lopez-Rojas & Barneaud, 2019), an inherent problem in detecting financial fraud that needs to be addressed is the unknown distribution and impact of all fraudulent transactions. In the absence of an adequate measure of fraud detection performance, existing fraud detection approaches rely on traditional measures of classification performance. The most desirable performance measure is the ability to correctly identify fraudulent transactions (true positive rate). In addition, minimizing false positive and false negative transaction rates (see confusion matrix in Table 2) is also a desirable quality of fraud detection systems, especially in a changing fraudulent environment. Here, we use these standard classification measures to evaluate the performance of fraud detection models. The true positive rate (Recall) is defined as the number of transactions correctly identified as fraudulent as a percentage of all fraudulent transactions as follows:

$$\begin{aligned} Recall = \frac{TP}{TP+FN}, \end{aligned}$$
(3)

where TP and FN are the numbers of true positive and false negative transactions. The false positive rate (FPR) is the number of transactions incorrectly identified as fraudulent as a percentage of all legitimate transactions:

$$\begin{aligned} FPR = \frac{FP}{FP+TN}, \end{aligned}$$
(4)

where FP and TN are the numbers of false positive and true negative transactions. The false negative rate (FNR) is the number of transactions incorrectly identified as legitimate as a percentage of all fraudulent transactions:

$$\begin{aligned} FNR = \frac{FN}{TP+FN} = 1 - Recall. \end{aligned}$$
(5)
Table 2 Confusion matrix for fraud detection

In reality, financial institutions try to reduce the risk of fraud while trying to comply with regulations, but Recall is difficult to estimate in the real world because FN is unknown (hidden fraud). Therefore, financial institutions can only calculate Precision (i.e., the number of transactions correctly identified as fraudulent as a percentage of all transactions that are expected to be fraudulent) (Lopez-Rojas & Barneaud, 2019):

$$\begin{aligned} Precision = \frac{TP}{TP+FP}. \end{aligned}$$
(6)

Previous studies have also considered the F1 measure (Pambudi et al., 2019; Schlör et al. 2021), defined as the harmonic mean of precision and recall:

$$\begin{aligned} F1 = 2 \times \frac{Precision \times Recall}{Precision+Recall}. \end{aligned}$$
(7)

The area under the receiver operating characteristic curve (AUC) has also been used as a more appropriate measure for fraud detection in mobile payment transactions due to its robustness to imbalanced data (Buschjäger et al., 2021; Mendelson & Lerner, 2020). AUC can be defined as the probability that a fraud detection model ranks a randomly selected fraudulent transaction higher than a randomly selected legitimate transaction, as follows:

$$\begin{aligned} AUC = {\int _{0}^{1}}Recall(T) \times \frac{d}{dT}FPR(T)dT, \end{aligned}$$
(8)

where T is the cut-off point.

3.4 Cost Savings Measure

In addition to the traditional performance measures above, here we propose a measure of cost savings measure to account for the financial implications of fraud detection models. The proposed cost savings measure was inspired by profit-based loan default prediction systems, considering potential returns and losses (Papouskova & Hajek, 2019; Ye et al., 2018). On the one hand, correct detection of a fraudulent transaction leads to the following cost savings:

$$\begin{aligned} CS_{TP} = \sum \limits _{i=1}^{n}(TP_{i}\times A_{i} \times 3.36) - (TP_{B} \times A_{F} \times 3.36 ) , \end{aligned}$$
(9)

where \(CS_{TP}\) are cost savings from TP transactions, \(TP_{i}\) is the i-th transaction of TP, \(A_{i}\) is the amount of the i-th transaction, \(TP_{B}\) is the number of TP transactions detected by the reference fraud detection system, and \(A_{F}\) is the average amount of fraudulent transactions. We also took into account that fraud now costs financial institutions $3.36 for every dollar lost to fraud and that the current average percentage of successful fraud attempts is 48% (i.e., \(TP_{B}\)=0.52).Footnote 1

On the other hand, mobile transactions generate a revenue margin of 3.5% on average (Bansal et al., 2019). Therefore, we also considered the cost of FP transactions, estimated as the decrease in the margin for these transactions:

$$\begin{aligned} Cost_{FP} = (TN \times A_{L} \times 0.035 ) - \sum \limits _{j=1}^{m}(FP_{j}\times AT_{j} \times 0.035), \end{aligned}$$
(10)

where \(Cost_{FP}\) is cost of FP transactions, \(FP_{j}\) is the j-th FP transaction, \(A_{L}\) is the average amount of legitimate transactions, and \(AT_{j}\) is the amount of the j-th transaction. The total cost savings \(CS_{total}\) is then calculated as:

$$\begin{aligned} CS_{total} = CS_{TP} - Cost_{FP}. \end{aligned}$$
(11)

Note that the proposed measure is expressed in financial terms and is instance-dependent (with respect to the amount of each transaction), allowing for a direct interpretation by financial institutions.

4 Experimental Results and Analysis

4.1 Data

Consistent with most previous studies (Buschjäger et al., 2021; Du et al., 2018; Xenopoulos, 2017), we used the PaySim datasetFootnote 2 in this study. The main objective of the simulations performed by Lopez-Rojas and his research team (Lopez-Rojas et al., 2016, 2018; Lopez-Rojas & Barneaud, 2019) was to replicate typical fraud scenarios that have similar statistical characteristics to the original mobile payment transaction data. To this end, different types of fraudulent transactions were injected, including cash-in (increasing account balance), cash-out (withdrawing cash), payment (paying for goods or services), transfer (to another user) and debit (sending money to a bank account). PaySim simulated 743 time steps, representing thirty days of real-time data. To introduce fraudulent behaviour into the system, 1,000 fraudsters were included with a 3% probability of committing fraud at any time step. A total of 6,362,620 mobile transactions were involved in the dataset, of which 8,213 were fraudulent. Table 3 provides descriptive statistics of the dataset, and Fig. 4 shows the numbers and amounts of transactions in time steps.

We opted for this dataset for several reasons (Lopez-Rojas & Barneaud, 2019). First, real-time historical data do not include enough fraudulent transactions. Therefore, some previous studies have considered all abnormal transactions to be fraudulent (Choi & Lee, 2017). Second, privacy protections prevent companies from making datasets public. Third, fraudulent behaviour is adaptive, making it difficult to create sufficiently diverse real-world fraud data. In addition, a similar approach based on typical real attack scenarios was taken in studies related to online banking fraud detection (Carminati et al., 2015).

Table 3 Attributes in the PaySim dataset
Fig. 4
figure 4

Visualization of amounts and counts of transactions in the PaySim dataset

4.2 Experimental Setup

For data partitioning, we randomly created training and testing data with a 3:1 ratio (75% training data, 25% testing data). To ensure reliable performance evaluation, we repeated this process five times. Since the performance of the fraud detection methods strongly relies on their hyperparameter selection, we then conducted their optimal selection using 5-fold cross-validation on the training data (for the list of hyperparameters and their values, see Appendix Table 9). Then, we performed fraud detection in mobile payment transactions using the above supervised learning and outlier detection methods. For experiments, we used the following implementations: (1) supervised learning methods in the Python library Scikit-Learn 0.23.0, (2) the RUS algorithm available in the library Imbalanced-Learn 0.6.2, and (3) the outlier detection methods available in the library PyOD (Zhao et al., 2009). The performance of the methods was evaluated using the measures defined in the following subsection.

4.3 Empirical Results

We performed empirical experiments using the PaySim dataset. This section consists of four subsections. First, we investigate the performance of supervised learning methods and the effect of random under-sampling on their effectiveness. Second, the performance of outlier detection methods is evaluated. Third, the financial consequences of the fraud detection models are evaluated. Finally, the robustness of the models is tested using a credit card fraud dataset.

4.3.1 Supervised Learning Methods

In the first set of experiments, we compared the performance of four supervised learning methods (XGBoost, k-NN, SVM, and RF), without using RUS, to obtain baseline performance. Table 4 shows the testing results of overall accuracy Acc, AUC, F1, Precision and Recall. The values of performance measures were obtained as the average of five experiments. For each performance measure, the number in bold represents the best value among the tested methods. The non-parametric Wilcoxon test was performed on the performance measure values obtained in the five experiments to statistically compare the performance between the best performing method and the remaining methods. Significantly similar results at the 5% level with respect to AUC and F1 are marked with an asterisk.

In terms of accuracy, all the supervised learning methods used performed well. However, as noted above, the extreme class imbalance suggests that this evaluation measure is not as relevant in this case. As for the AUC measure, XGBoost was superior to the other methods, indicating a well-balanced performance for both legitimate and fraud classes. The good balance between Precision and Recall caused XGBoost to achieve the best results also in terms of F1 measure. By contrast, SVM and k-NN performed well only with respect to Precision and Recall, respectively, making them unsuitable methods for fraud detection in mobile payment transactions. Overall, these results indicate that only XGBoost without class-balancing adjustment is suitable for detecting fraud in such a class-imbalanced scenario.

Then, we investigated the effect of the RUS under-sampling procedure on the performance of the supervised learning methods. On the one hand, Table 4 shows that RUS greatly improved the values of AUC for SVM, RF and XGBoost. On the other hand, there was a considerable deterioration in F1, which can be attributed to the lower Precision achieved at the cost of higher Recall. In other words, RUS caused almost all fraudulent transactions to be detected, but this was accompanied by a substantial increase in the number of FP transactions. This resulted in a bias for the minority class while reducing the accuracy for the majority class. It is worth noting that we also experimented with other heuristic-based under-sampling methods, such as edited neatest neighbour and Tomek links, to address the class imbalance problem but without improvement in detection performance. Finally, it should be noted that the execution time (training time + testing time) was substantially reduced by using RUS. For example, RUS+XGBoost was computationally most efficient with 2.38 seconds compared to 207.02 seconds required for XGBoost without using RUS.

Table 4 Fraud detection performance of supervised learning methods

4.3.2 Outlier Detection Methods

In the seconds run of experiments, the performance of XGBOD was evaluated compared with other outlier detection methods. Table 5 shows that XGBOD significantly outperformed the remaining methods in terms of AUC and F1. In addition, XGBOD was also dominant with respect to both Precision and Recall, indicating excellent performance on both classes.

Table 5 Fraud detection performance of outlier detection methods

These results can be explained by the semi-supervised learning approach used in the XGBOD method. This is because, unlike other outlier detection methods, XGBOD leverages the labels assigned to mobile transactions. In addition, the transactions contained in the majority class of legitimate transactions are fully utilized by the multiple unsupervised outlier detection methods that produce outlier scores in XGBOD. The XGBoost algorithm applied in the improved XGBOD feature space exhibits good robustness to overfitting and data imbalance, and outperforms the supervised learning methods reported in Table 4 in terms of AUC and F1. However, it should be admitted that the drawback of XGBOD is the longer execution time, on average 4,256.25 seconds.

4.4 Financial Impact of Fraud Detection

To investigate the financial consequences of the evaluated fraud detection systems, we used the performance measures defined in Eqs. 9-11. Table 6 shows the average financial performance of all methods in terms of cost savings from TP transactions, cost of FP transactions and total cost savings. To calculate these results, we used the average amounts of fraudulent and legitimate transactions in the data, i.e., \(A_{F}\) = 1,468,000 and \(A_{L}\) = 178,200.

Table 6 Financial impact of fraud detection methods

In general, supervised learning methods outperformed outlier detection methods in terms of overall cost savings, which can be attributed to the high Recall values of supervised learning methods. Note that cost savings from TP transactions were considered to have a stronger financial impact on total cost savings compared to FP transactions. In contrast, XGBOD delivered the lowest costs associated with FP transactions, which is related to its high Precision performance. Surprisingly, SVM and unsupervised outlier detection methods used in previous studies (Buschjäger et al., 2021; Du et al., 2018) did not perform well in terms of financial impact and provided negative overall cost savings due to their low Recall values.

4.5 Comparison with State-of-the-art Methods

To further show the effectiveness of the proposed fraud detection model, the obtained AUC was compared with that of previous studies that examined the same dataset (Table 7). The best AUC performance thus far reported was achieved using SVM with LogDet regularization (Du et al., 2018). Our result in Table 4 obtained for SVM confirm that equipping SVM with LogDet regularization improves the AUC performance. Indeed, the traditional SVM method is reportedly sensitive to outliers and noisy data (Shajalal et al., 2021). Table 7 also shows that deep neural networks performed well in previous studies (Schlör et al. 2021; Xenopoulos, 2017). However, their performance is limited by the relatively low number of fraudulent transactions in the dataset. By contrast, the worst performance was reported for the Isolation Forest method (Buschjäger et al., 2021). Note that the results for Isolation Forest obtained here (Table 5) are consistent with those from Buschjäger et al. (2021). The results in Table 7 suggest that the proposed XGBoost-based models perform better than those used in previous studies in terms of AUC, which can be attributed to their good scalability and efficient processing of sparse data.

Table 7 Comparison of fraud detection performance of the proposed XGBoost-based models with the results of previous studies

4.6 Robustness Check on Bank Payment Datasets

To confirm the obtained performance evaluation, we checked the robustness of the considered fraud detection methods using a bank payment dataset. The BankSim datasetFootnote 3 (Lopez-Rojas & Axelsson, 2014) was generated using a multi-agent simulation based on a sample of transactional data from a Spanish bank. The dataset was validated using statistical techniques and social network analysis of customer-merchant relationships, thus approximating key features of real bank payment frauds. Each transaction was characterized by payment amount (in EUR), customer and merchant zip codes, customer gender and age, and merchant category (e.g., fashion, technology, transport, and travel). A total of 594,643 transaction records were included, of which 7,200 were fraudulent transaction. The simulation was run for 180 steps representing months. Thieves were injected to steal or clone an average of three credit cards at each step and conduct approximately two fraudulent transactions per day. The result of the simulation is depicted in Fig. 5.

Fig. 5
figure 5

Visualization of amounts and counts of transactions in the BankSim dataset

The BankSim dataset provides a benchmark for detecting fraud in bank payment transactions, as several recent studies have shown (Cui et al.. 2021; Vaughan, 2020). As a robustness check, we trained the evaluated models on the BankSim dataset using the same experimental setup as for the PaySim dataset. Note that the sampling process and data collection system was unique and heterogeneous for both dataset, which allowed us to verify the robustness of the tested fraud detection models. The results in Table 8 suggest that the under-sampling procedure is not as effective for smaller financial fraud datasets, improving the performance of supervised learning methods only in terms of AUC. In contrast, the performance of unsupervised outlier detection methods substantially improved compared to the large PaySim dataset, suggesting their poor scalability. Overall, XGBoost and XGBOD performed well in terms of both AUC and F1 measures, indicating their good robustness to data size.

Table 8 Fraud detection performance on the BankSim dataset

4.7 Discussion

Prior studies (Buschjäger et al., 2021; Pambudi et al., 2019) have noted the importance of addressing the problem of extreme class imbalance in mobile payment transactions. Therefore, our first set of experiments was designed to investigate the effect of under-sampling the majority class of legitimate transactions on the performance of supervised learning methods. Consistent with Pambudi et al. (2019), we observed that the detection performance improved for most of the machine learning methods, especially for the proposed RUS+XGBoost fraud detection model. In contrast to earlier findings (Buschjäger et al., 2021), however, the second set of experiments did not detect any evidence for the effectiveness of outlier detection methods. However, when conducted in a semi-supervised manner, the proposed XGBOD detection model was found to be superior even to the supervised learning methods. Finally, the financial consequences of the fraud detection models were examined to provide guidance on how to set up the right performance metrics for fraud detection in mobile payment transactions. This experiment addressed the need for an adequate measure of fraud detection performance as raised in recent research (Lopez-Rojas & Barneaud, 2019). We found that RUS+XGBoost performed best in terms of cost savings from correctly detecting fraudulent transactions, while XGBOD minimized the cost of false positive transactions.

Based on the experimental results of this study, we propose the following suggestions for mobile payment systems.

Firstly, the providers of mobile payment systems should pay more attention to recent developments in the machine learning research. Specifically, XGBoost enhanced with class-balancing or outlier detection methods should be applied to effectively handle the extreme class imbalance problem in the data and accurately detect fraud in mobile payment transactions. RUS+XGBoost is particularly recommended for its low execution time, indicating its capability for real-time fraud detection.

Secondly, cost savings and transaction costs should be considered when implementing fraud detection systems in mobile payment systems. For fraud detection models in mobile payment systems, these evaluation metrics are critical due to the high costs associated with mobile payment default. The proposed cost savings measure can be used for this purpose as it offers providers appropriate guidance for making decisions on the selection cost-effective fraud detection systems.

The importance of accurate and cost-effective fraud detection systems has dramatically increased during the COVID-19 pandemic because many emerging and developing countries used mobile money transfer to provide COVID-19 aid (Blumenstock, 2020). Indeed, mobile payment systems provide a fast and scalable solution while complying with social distancing measures, which encouraged government-to-person mobile payments. To enable sustainable solutions for mobile money transfer, fraud detection technologies represent a critical component of the frameworks for sustainable government-to-person mobile money transfers proposed in response to COVID-19 (Davidovic et al., 2020).

Finally, our results suggest that unsupervised outlier detection methods are not appropriate for fraud detection in mobile payment transactions. The current study was unable to evaluate the use of the fraud detection system in a real environment because the number of labelled instances is insufficient in existing real-world data. Instead, we experimented with a controlled environment with fraudulent behaviour injected into the data to obtain a well-performing fraud detection system. However, we believe that the accuracy of the proposed fraud detection system would not deteriorate in real-world applications as the data used in this study are based on the real-world anonymized data. To further improve the detection accuracy and to assist the providers of mobile payment systems with the development of fraud detection systems, large labelled real-world data should be collected and made available to enable effective training of state-of-the-art supervised learning methods.

5 Conclusion

In this paper, we have proposed an XGBoost-based fraud detection framework while considering the financial impact of fraud detection. The findings from this study make several noteworthy contributions to the current literature. First, the XGBoost model was combined with under-sampling to effectively address the problem of extreme class imbalance and avoid overfitting. Second, to fully exploit the large amount of underlying data, unsupervised outlier detection methods were integrated into the XGBoost-based model. The comparison of the XGBoost-based fraud detection performance with various state-of-the-art machine learning methods confirmed that we have found a cutting-edge solution for fraud detection in mobile payment systems. Our findings also suggest a role for the proposed model in promoting cost savings of fraud detection systems. Taken together, our results strongly argue against a major role of single machine learning methods and unsupervised outlier detection methods in fraud detection of mobile payment transactions, implying that ensemble XGBoost-based methods are preferable.

In the future research, ensemble methods combined with alternative under-sampling and unsupervised outlier detection methods should be further investigated, including automatic optimization of outlier detection ensembles (Reunanen et al., 2020) and the XGBoost method enhanced with weighted and focal losses (Wang et al., 2020). Unfortunately, it was not possible to investigate our model’s robustness against different mobile payment transaction data distributions due to privacy concerns and other limitations of existing datasets. Therefore, further data would be needed to evaluate model robustness, including testing the feasibility of transfer learning across multiple datasets. The proposed fraud detection models should also be applied to solving related fraud detection problems, such as credit card and loan frauds, which also exhibit class imbalance characteristics and large real-world datasets are available for these problems (West & Bhattacharya, 2016). Other possible application fields of the proposed model include credit scoring (default prediction) (Mahbobi et al., 2021), direct marketing (Wong et al., 2020), and customer churn prediction (Wong et al., 2020). An issue that was not addressed in this study was the interpretability property of the fraud detection models. Therefore, further research might explore the tradeoff between achieving a high detection accuracy while maintaining interpretability (Hajek, 2019). Finally, the current investigation was limited by the use of the cost savings measure only to evaluate the trained model, and thus not in the objective function of the fraud detection model. Future research should therefore examine the performance of fraud detection models using the cost savings measure as the objective function. This could lead to our model delivering even greater cost savings to the end user.