1 Introduction

Credit card fraud jeopardizes the trust of customers in e-commerce transactions. This led in recent years to major advances in the design of fraud detection systems (FDS) to detect fraudulent transactions within a short time and with high precision. Thanks to those improvements, the fraud loss is expected to decrease for the first time in decades [1]. Nevertheless, FDS need continual improvements to keep pace with fraudsters.

Streams of credit card transactions are processed by FDS by using both expert-based and data-driven detection models. Training a data-driven model is challenging because of unbalancedness [2] (frauds represent less than 1\(\%\) of all transactions), concept drift (due to seasonal aspects and evolving fraudster strategies), large size (millions of transactions per day) and the streaming nature of the data. Because of the three first challenges, models actually used in production are often trained with batch learning [3], leading to suboptimal solutions in sequential settings. Batch learning indeed needs frequent retraining steps to cope with sequential data. However, a batch retraining with all available data is typically unfeasible because of technical, temporal and data storage limitations. For this reason sliding window mechanisms [4] are used as an intermediate solution by focusing the retraining on a fraction of recent samples.

Once the retraining rate coincides with the arrival rate, we are in an incremental learning setting. Online learning deals with sequential incoming data by returning a prediction after each single instance observation. Once the true class is made known (e.g., by an oracle), the model is updated in order to minimize a given regret function. Well-known online algorithms are online gradient descent [5] and online Newton step [6].

The efficiency of a learning approach in a sequential setting depends on the stationarity (or not) of the distribution underlying the observations. In the stationary case, the more the samples are considered, the more the accurate is the estimation of the underlying distribution. Also, stationary settings favor batch learning approaches for their convergence properties (to local or global optima). At the same time, those properties may be detrimental if we are confronted with nonstationarity or concept drift. It is well known that concept drift makes fraud detection a difficult task but the real challenge remains how to find the appropriate trade-off between adaptation and learning or between forgetting and memory. In this paper, we propose to tackle concept drift using both the sliding window and the incremental strategy.

In [7], we introduced the use of an ensemble of classifiers trained on different windows of transactions to deal with concept drift. Ensemble methods are based on the idea that the more we combine decorrelated models the more we may reduce variance and ensure stability. The notion of diversity plays then a major role within an ensemble [8]. Moreover, a diversity-based ensemble allows safeguarding different concepts, ensuring both greater adaptability in case of recurring concepts and a reduced impact by negative historical models that would otherwise have a greater weight.

Another recent interesting approach for incremental learning has been proposed by [9]. DTEL (diversity- and transfer-based ensemble learning) combines sliding window [4], transfer learning [10] and ensemble methods. [11, 12].

The DTEL procedure is the following: Each time a new chunk of data is available, a new decision tree classifier is trained from this chunk. Then the historical trees (built from previous data chunks) are transferred to fit the current chunk. (Two transfer learning mechanisms are allowed: Reset the class of the leaves and grow subtrees according to the current chunk.) The set of historical trees is constrained to a fixed size: The removed tree is the one leading to the largest diversity in the ensemble. (The Yules Q-statistic is used.) Finally, this ensemble votes, for each sample, with each tree being weighted by its own prediction error on the current chunk.

The rationale of transfer learning is to fill the gap between two supervised learning tasks by reusing what was learned from the former (called source) to better address the latter (referred to as target) [10]. In this sense, transfer learning may be useful to deal with concept drift if we assume that different concepts are somewhat related [10].

This paper aims to assess the interest of adopting incremental strategies to detect fraudulent credit card transactions in a real-world FDS setting. We aim to answer two main questions:

  1. 1.

    Is incremental learning competitive with conventional batch and retraining approaches?

  2. 2.

    Is a DTEL-like approach, combining incremental, ensemble and transfer learning robust with respect to concept drift?

In order to provide an answer, we designed and implemented 16 different approaches to deal with sequential streams of transactions. The comparative assessment is based on a five-month dataset (more than 50 million e-commerce transactions) provided by our industrial partner. For the sake of comparison, all the approaches rely on a dense neural net (NN), whose architecture had been defined by previous research. NNs are often used in fraud detection [13, 14] and play a major role in continuous lifelong learning [15]. Also, there were a lots of recent work on fraud detection in the neural network community. It includes works on convolutional neural networks [16], autoencoder [17, 18], long short-term memory and gated recurrent units GRU [19,20,21], resampling [19] and ensemble [21] of neural networks.

As far as the DTEL-like approach is concerned, we extend the DTEL [9] in three directions: (i) we move from decision trees to (GPU-trained) neural networks. This makes possible to test the approach in a more realistic and complex setting (original paper mostly includes shorter and synthetic streams); (ii) we assess several diversity criteria; and (iii) we adopt a transfer learning strategy designed on the purpose of fraud detection.

Note that our approach to deal with concept drift can be defined as passive according to the active/passive distinction made in [22]. Active adaptation, which requires a specific test to detect a distribution change [23], is suitable when a small number of major changes occur in time. This is not the case of fraud detection, where several drifts occur on a daily basis.

The rest of this paper is structured as follows: Sect. 2 introduces background and notation. Sections 3 and 4 analyze the two questions addressed in the paper, respectively. Section 5 proposes an additional assessment on a public dataset. Finally, Sect. 6 concludes the paper with some perspectives about the future.

2 Background and notation

Let us consider a FDS ingesting a stream of n credit card transactions \(Tr_1,Tr_2,...,Tr_n\). Streaming processing systems (e.g., Spark [24]) typically group them into batches (or chunks of data) according to specific criteria, like the maximum size of the batch or time interval (daily transactions).

Once the FDS raises an alert, the transaction is checked by investigators before deciding the relevant action (e.g., customer contact, block of the card, etc.) [7]. Given the limited number of investigators, it is crucial for a FDS to return the most accurate ranking (e.g., in terms of a conditional probability score) within the budget k of alerts that can be investigated (typically 100). Such accuracy is measured in terms of precision within the first k alerts, here denoted Pr@k. A measure of the accuracy of the entire ranking is the area under the precision–recall curve (AUPRC) known to be more suited for unbalanced classification than the area under the ROC curve [25, 26]).

Fig. 1
figure 1

Real-life FDS scenario with three sets of data. It takes several days to inspect all transactions, mainly because it is sometimes the card holder who reports undetected frauds. Hence, in practice, the fraud tags of the gap set are unknown. This scenario is repeated each day, with the sliding window approach, as the parameter \(\tau \) is incremented

A peculiarity of real FDS is the delayed feedback [7], due to the fact that transaction labels are made available only several days later, once customers have reported unauthorized transactions. The delay in obtaining accurate labels and the interaction between alerts and supervised information has to be carefully taken into consideration when learning in a concept-drifting environment: It is represented by the “gap set” (also simply called gap) of Fig. 1 for a batch learning perspective. The gap set is the part of the data which cannot be considered as annotated yet. In our experiment, in accordance with field experts, the gap is seven days long.

Notice that in Fig. 1, \(\tau \) refers to the testing day (and therefore the whole sliding window) and that models are built on 15 previous days, taken after the gap set. By changing \(\tau \), we get different testing days. Some strategies to incorporate the gap set data can be found here [27, 28]. In our case, gap set data are just discarded.

2.1 Data and code

The database is made up of about 50M e-commerce transactions occurred during 153 days (61 for model initialization, 7 gap days, 15 validation days, to set the hyperparameters, and 70 test days). The fraud ratio is 0.201%, and each transaction is described by 23 features. All data are standardized. Validation days are used to tune the hyperparameters introduced in the methodological section and detailed in Table 3. Though data cannot be made available for confidential reasons, consider that the data distribution is similar to the one of the Kaggle dataset [29], an older, two-day long, anonymized extract from the same database. Section 5 shows the replication of the main findings of this paper on this widely used dataset.

Note that several choices in our approach are due to the fact that, in data provided by the industrial partner, the time-related information is featurized into time-based aggregates [28]. This reduces the potential of techniques like time-based CNN [16] or LSTM, recently adopted to detect frauds [30] in sequence of transactions (e.g., at card holder level).

All experiments were carried on a server with 10 cores, 256 GB RAM and an Asus GTX 1080 TI. Keras [31] was used for the NNs implementation and training. The code is not made publicly available as some of the content would disclose some of the industrial secret of our industrial partner.

3 Batch, retraining or incremental?

This section addresses the first methodological question of the paper, relative to the impact of the learning strategy (batch, retrained or incremental) on the detection accuracy. For this reason, we designed four algorithms to deal with the sequential setting:

  1. 1.

    Batch (Algorithm 1): the classifier is built once, never updated and applied to the testing set.

  2. 2.

    Retrain(F) (Algorithm 2): the classifier is retrained each F days, using two latest months and used to predict the incoming one. It is then discarded and a new one is built with the same procedure.

  3. 3.

    Incremental(F) (Algorithm 3): the classifier is updated (but not discarded) each F days, using two latest months and used to predict the incoming one.

  4. 4.

    IncrementalEA(F): this is an unsupervised approach based on an anomaly detection autoencoder [32].

Each batch \(B_i\) is made of the transactions of the \(i^{th}\) day. In the pseudo-codes, \(batch_{cl}\) and \(online_{cl}\) refer to the batch and online neural network (NN) implementation of the classifier.

To avoid biases related to the classifier structure, all NNs used in this paper share the same topology. This NN topology is the one currently used in production by our industrial partner Worldline. It is a dense neural network, selected after an extensive parameter tuning addressing the unbalancedness issue. In internal production setting, this NN was shown to outperform other state-of-the-art approaches for unbalanced data, like focal loss [33], class-balanced loss [34] and classical resampling strategies [35]. For the sake of confidentiality, no more details may be disclosed. For a wider discussion on unbalancedness in the specific case of fraud detection, please refer to [7].

The instances in Table 1 have been assessed in the first experimental session whose results are in Table 2.

The first conclusion of those experiments is that batch approaches (Batch, Retrain(30) and Retrain(1)) are outperformed by their incremental counterparts (Fig. 2). Batch and Retrain(30) are significantly less accurate than all incremental approaches. Retrain(1) is statistically equivalent to Incremental(1) but its training time is 180 times longer (see Table 2).

Our second conclusion is that the frequency of update of the incremental approaches (Incremental(30), Incremental(7), Incremental(1)) can be tuned to boost the performance, but the effect is limited. Overall, adopting an incremental approach increases the average AUPRC accuracy from around 6.7 to 10.3%.

As far as IncrementalEA(1) is concerned, the results are not encouraging. This is pretty much in line with results in our previous research [7, 36] showing the inadequacy of pure unsupervised methods in fraud detection.

Figure 2 summarizes the previous results in the form of a Friedman/Nemenyi (F/N) test ([37]). All Friedman null hypotheses were rejected with \(\alpha = 0.05\). For the Nemenyi post hoc test, a method is considered as significantly better than another if its mean rank is more than the critical difference CD higher (the higher, the better). Results for additional metrics are given in Appendix A.

figure a
figure b
figure c
Table 1 This table summarizes the considered approaches for the first part of this paper. More details about the strategies and hyperparameters can be found in Sect. 3
Table 2 This table summarizes the results for the first part of this paper. Update time is an indicative execution time to update the model after each new data batch arrives. Each result is averaged over five runs. Notice that since the NN training is performed on GPU and data manipulation on CPU, the training part is not necessarily the most time-consuming part of the update. * Model is only updated each month or week. ** Model is never updated
Fig. 2
figure 2

Friedman–Nemenyi test based on the card-based AUPRC (70 days with one score per day). The plot compares batch approaches versus incremental approaches. Notice that the a priori fraud ratio based on cards is 0.201%. Each 70 results is averaged over five runs. The averages over the 5x70 individual results are reported on Table 2

4 DTEL-like improvement of incremental learning

This section implements a DTEL-like enhancement of the incremental strategy used in the previous section. Some changes with respect to original DTEL are made:

  1. 1.

    Given the importance of ranking in the AUPRC and Pr@100 assessment, to compute diversity we replace the Yule’s Q-statistics with Spearman’s rank correlation and other correlation measures (e.g., negative correlation).

  2. 2.

    We change the transfer learning strategy according to our recent work on transfer learning for fraud detection [38] (Algorithm 4). Our transfer process is a nonlinear monotonous transformationFootnote 1 of the new data batch cumulative distribution (the source domain) into the cumulative distribution of the data used to initialize the incremental model (the target domain). It can be easily achieved by modifying the percentile normalization. (To make the paper self-contained, a short description is provided in Appendix B.)

  3. 3.

    Ensemble relies on the simple average of probability scores of 10 independently trained NN.

figure d
Table 3 This table summarizes the considered approaches of this paper. More details about the strategies and hyperparameters can be found in the main text
Table 4 This table summarizes the results for the second part of this paper. Update time is an indicative execution time to update the model after each new data batch arrives. Each result is averaged over five runs. Notice that since the NN training is performed on GPU and data manipulation on CPU, the training part is not necessarily the most time-consuming part of the update

We implemented several variants of the incremental algorithm (Algorithm 3) (see Table 3)Footnote 2 by combining different ensemble, diversity and transfer learning aspects:

  • E: Ensemble version of Algorithm 3.

  • \(\texttt {ED}_{R}\): variant of E where each NN of the ensemble is independently updated with probability p (hyperparameter calibrated by means of the validation set). After a few days, each NN has therefore trained on a different subset of days.

  • \(\texttt {ED}_{R}{} \texttt {T}\): variant of \(\texttt {ED}_{R}\) including the transfer learning step discussed above.

  • \(\texttt {ED}_{Sp}\): variant of E, but when the model is in the ranking phase, Spearman’s correlations are computed among all the NN predictions and the more correlated NN is re-initialized at the end of the for loop. A variation consisting in re-initializing \(n_r\) NN(s) per day was considered. \(n_r\) must be tuned.

  • \(\texttt {ED}_{Sp}{} \texttt {T}\): variant of \(\texttt {ED}_{Sp}\) enhanced with transfer learning.

  • \(\texttt {ED}_{cos}\): variant of \(\texttt {ED}_{Sp}\), using cosine similarity instead of Spearman’s correlation

  • \(\texttt {ED}_{ELM}\): variant of E where the extreme learning [40] is considered as a source of diversity. Each of the NN is extended with \(n_l\) (to be tuned) layer(s) between the input and first dense, trainable, layer. The added layers are randomly initialized, untrainable and share the same size as the largest layer of Worldline’s industrial NN. Linear activation function was selected, by trial and error.

  • \(\texttt {ED}_{NC}\): variant of E with a negative correlation (NC) loss function containing a term promoting the diversity in the ensemble. Negative correlation (NC) learning [41] introduces a penalty term into the loss function of each NN in the ensemble so that all the NNs can be trained simultaneously and interactively on the same data. The error function \(\mathcal {E}_i\) for NN i in NC learning is defined by:

    $$\begin{aligned} \mathcal {E}_i = \frac{1}{N} \displaystyle \sum _{n=1}^N \frac{1}{2} (F_i(n) - t(n))^2 + \frac{\lambda }{N} \displaystyle \sum _{n=1}^N p_i(n) \end{aligned}$$
    (1)

    where \(F_i(n)\) is the output of NN i for sample n, t(n) is the target value for sample n and N is the number of NN in the ensemble. The first term is the empirical risk function of NN i. The second term with \(p_i(n) = (F_i(n)-F(n)) \sum _{i\ne j} (F_j(n)-F(n)) \) is a correlation penalty function, with F(n) the output of ensemble of NNs.

  • \(\texttt {ED}_{NC}{} \texttt {T}\): variant of \(\texttt {ED}_{NC}\) with a transfer learning step.

Note that transfer learning can be added to all possible approaches. To avoid to report and analyze every approach twice, we choose to include only the transfer learning variants for a subset of the approaches of this study.

Table 4 is used to summarize the 70 evaluations (one per test day) of the metric for each approach. We report the mean and standard deviation for the AUPRC and the Pr@100. To avoid variability in the NN training, each result is the mean of five different initializations. Most of the approaches are already stabilized as they imply ensembles of classifiers. Notice that the variability is still high in practice: This is due to the number of frauds to detect which can be very different each day and is not due to the initialization of the NNs. A calibrated version of the AUPRC, which removes the effect of the varying proportion of fraud to detect, is presented in Appendix A.

Table 4 also reports the mean execution time for updating the models. Notice that most models are updated daily, but some are updated less frequently. Also, since the NN training is performed on GPU and data manipulation on CPU, the training part is not necessarily the most time-consuming part of the update.

Figure 3 summarizes the results in the form of a Friedman/Nemenyi (F/N) test ([37]). From this figure, it appears that:

  • The ensemble approach E significantly outperforms Incremental(1), yet at the price of a longer training time.

  • Adding diversity is sometimes beneficial (\(\texttt {ED}_R\),\(\texttt {ED}_{NC}\)) and sometimes not (ED, \(\texttt {ED}_{ELM}\), \(\texttt {ED}_{cos}\)). In the beneficial case, the gain is limited but the training time is not really impacted.

  • Considering transfer learning does not make a big difference. Consider \(\texttt {ED}_{Sp}\) vs. \(\texttt {ED}_{Sp}{} \texttt {T}\), \(\texttt {ED}_R\) vs. \(\texttt {ED}_R\texttt {T}\) and \(\texttt {ED}_{NC}\) vs. \(\texttt {ED}_{NC}{} \texttt {T}\). The mean ranks are really close and the F/N test concludes it is a draw in the three cases (this was the case even for unreported pairs). Our conclusion is to use either \(\texttt {ED}_R\) or \(\texttt {ED}_{NC}\) with no transfer, except if the transfer is justified by a lack of source data (as in [39]).

This second set of experiments allows us to increase the performance from around 10.3% average AUPRC (at the end of the first set of experiments) to 11.6% average AUPRC (see Fig. 4 to visualize the results as a function of time).

Fig. 3
figure 3

Friedman–Nemenyi test based on the card-based AUPRC (70 days with one score per day). The plot compares various ensembles and baseline approaches. Notice that the a priori fraud ratio based on cards is 0.201%. Each of the 70 results is averaged over five runs. The averages over the 5x70 individual results are reported on Table 4

Fig. 4
figure 4

The two sets of experiments show an increase in performance from around 6.7% average AUPRC to 11.6% average AUPRC

5 Assessment on public data

Given the confidential nature of the dataset, it is interesting to assess the quality of our approach on a public dataset. In this section we make use of the Kaggle “Credit Card Fraud Detection” dataset [29], made available some years ago by our research group to support the reproducibility efforts in the fraud detection community [32]. Note that this is one of the few fraud detection datasets publicly available and one of the most accessed datasets on the Kaggle platform (more than seven million views so far). This dataset is a two-day long, anonymized extract from the same process underlying the private dataset. Because of the limited time window, we have to adapt the time frames: The first day is used for training and each chunk of data refers to one hour of transactions. Also, since no card holder information is available, we may only report accuracy in terms of area under the ROC curve.

Figure 5 reports the outcome of the Friedman/Nemenyi test. For simplicity, and to ensure sufficient statistical power to the test, we limit ourselves to report the most important methods and baselines from previous sections. In spite of the smaller time frame, the trend of Figs. 2 and 4 is confirmed.

Fig. 5
figure 5

Friedman–Nemenyi test based on the ROC AUC (24 hours with one score per hour). The plot compares the main approaches on the public dataset. Each of the 24 results is averaged over ten runs; average over the 10x24 individual results is reported besides the method name

6 Conclusion

Incremental learning is one of the oldest learning strategies in AI. However, few evidence exists about the current utilization of this strategy in fraud detection production environments. Nowadays, the interest in adopting incremental approaches is raising also for nonanalytical reasons, e.g., the wish to avoid storing sensitive data and being exposed to infringements of the European GDP Regulation.

This paper shows that incremental learning is a competitive alternative to conventional batch learning settings for fraud detection in real-world transactions streams. Though the case study is limited to five months of data on e-commerce transactions, we believe that other fintech domains can benefit from our findings.

The paper discusses, implements and assesses 16 approaches (on private and public data) and answers to two questions: (i) Is incremental learning viable for fraud detection? and (ii) Are there variants of incremental learning (notably based on ensemble learning, diversity and transfer learning) which may still improve its accuracy?

The answer to the first question is affirmative. The second is more mitigated, though the adoption of proper diversity measures in ensembles (notably negative correlation) appears to improve the accuracy.

Future work will focus on extending the set of considered approaches (alternative transfer learning and diversity measures). Also we intend to study the risk of catastrophic forgetting in the adoption of techniques to counter concept drift.