1 Introduction

The idea of involving a card for paying off goods bought, or services used was portrayed in the year 1887 by Edward Bellamy in his idealistic book Looking Backward. In the year 1934, the American Airlines and the whole Air Transport Association introduced a new approach for booking tickets the Air Travel Card where travellers could select the option as “purchase now and pay later” [16]. Since then, the “purchase now and pay later” technique has gained popularity in various markets where monetary transactions are done using cards known as credit cards. In 2018, at least 72% of adults were considered to have at least one credit card, and there were well above 1.12 billion credit cards underused in the US alone [46]. Even though credit cards have completely revolutionized how people purchase, it has empowered common people to make purchase and pay by making just a swipe. But that does not mean that such things cannot meet with misshapen or any other difficult scenario, especially fraud [28, 46]. In general, credit card fraud is carried out as follows:

  • Theft or stealing of card: Technological malpractices cannot be a catalyst to every sort of credit card fraud, but someone stealing and using the card for their benefit is one of the primary ways of credit card-related fraud in today's world.

  • Phishing or skimmed card details: In this category of fraud, fraudsters tend to scam-call random people and try to retrieve personal bank details for illegal and manipulative transfer of funds or, in other words, do online stealing of funds from consumers of bank accounts.

  • Cyber-attacks and data breaches: In this category, the theft of credit card or bank information is carried out to perform frauds. A few years back, in 2016, the immense information break and clouded side of innovation took place in the world of cybercrime that enables fraudsters to commit sophisticated crimes. [46].

As the size of e-shopping, web-based banking, and online transactions increases, fraudsters make the most of each flimsy way in a transaction framework to make fraudulent transactions. Credit card-related fraud costs shoppers and companies a huge amount of money every year. People doing such frauds constantly attempt to track down new guidelines and strategies to commit unlawful activities. Thus, fraud tracking frameworks are now fundamental for banks and related financial institutions so that they can limit such misfortunes. Fraud transactions can be detected either by using a classification approach or by detecting an outlying transaction from normal transactions [44]. In general, credit card fraud detection relies on the analysis of recorded transactions for detection of the nature of transactions as fraudulent or legitimate [53]. The major issues that arise for fraud detection are the identification of customers’ transaction patterns and behaviours, as well as manual and timely investigation of transactions [21] which leads to the need for the development of automated systems. In past, data mining techniques have been applied in this domain extensively [19, 48, 60]. Apart from data mining, several other techniques have also been used such as Genetic Algorithm [13], statistical analysis [14], network-based [12], sequence classification [5, 37], Hidden Markov Model based approach [40], etc. In the last few years, machine learning [47, 62, 63, 68, 77] and deep learning [25, 26, 29, 42] are being extensively applied for credit card fraud detection.

All these fraud detection techniques can be categorized as supervised learning and utilize some datasets consisting of past transactions. In the case of credit card fraud detection, most of the transactions belong to legitimate and few are fraudulent. This leads to an imbalance distribution of data known as the class imbalance problem [2]. The imbalance issue in the dataset may cause the problem of a prediction model being biased towards legitimate transactions. In other words, if we try building a classification model in which we try to predict a fraudulent transaction, there is a high chance of malicious transactions being predicted as legitimate due to such a massive imbalance between the two types. So, to deal with the issue there are various approaches available. The existing approach can be categorized as data-level, algorithm-level, cost-sensitive, and prediction-level. The data-level approach mainly belongs to oversampling and undersampling where an imbalanced dataset is balanced by making the majority and minority class samples in equal proportion [34, 63]. In the algorithm-level approach, the modification is made by a higher magnitude to the minority class samples [73]. The cost-sensitive approach considers a parameter, known as cost, to balance the misclassification error for majority and minority class samples [3]. Apart from these approaches, ensemble learning is also applied to handle imbalanced datasets where multiple learning algorithms are employed and the results are aggregated to make final predictions [42]. This type of approach can be categorized as a prediction-level approach. In this approach, the problem of imbalanced class distribution is handled by assigning different weights to the results of classification models. In recent years, various state-of-the-art classification models with imbalanced data have been proposed that utilizes one or more types of the above mentioned approaches [6]. In this paper, we have applied data-level techniques for the pre-processing of data and utilized multiple machine learning as ensemble learning.

In general, machine learning algorithms are used for various decision and prediction-based purposes and problem-solving tasks. In a similar manner, when there is a task to predict or detect whether a fraudulent transaction has taken place or to identify a fraud transaction amongst a dataset that has valid transactions as well, requires the usage of classification based on machine learning algorithms. The detection system can be helpful in segregation of transactions into their type, whether fraud or valid transactions [33, 67]. But often using different machine learning algorithms on the same dataset gives very different results, so it is at times advised to try something called ensemble learning. Ensemble learning means using two or more machine learning algorithms and combining the results using some aggregation approach to get better results. One of the prominent ways of using ensemble learning for classification is voting ensemble learning. In this approach, the results are often combined, and voting is carried out among multiple results. The majority is said to be the final predicted value of the combined model [71, 76].

In this paper, we have proposed an ensemble learning-based approach for credit card fraud detection. The proposed model is composed of multiple machine learning-based classification algorithms, such as a Random Forest, Logistic Regression and KNN. Although, there are large numbers of various machine learning algorithms are available which can be used in this application. In order to make the proposed model simple and efficient, we have selected these three algorithms as these algorithms performed well in various parameters. As described earlier, ensemble learning can be used to handle the class imbalance problem by assigning weights to the results of different classifiers. In the case of credit card fraud detection, the main focus would be the detection of fraudulent transactions or positive samples. However, due to fewer samples available which belong to the positive class, there is a high chance that a positive sample get undetected. To overcome this problem, a weighted voting approach for the aggregation of ensemble learning models is used. The approach helps assign different weights during the aggregation of the results in order to reduce the misclassification of positive samples. Here, the soft and hard weighted voting schemes have been used to make final predictions. The soft voting ensemble technique includes a combination of probabilities of each prediction in each model and selecting the prediction with the highest total probability. Whereas, hard voting includes selecting the prediction with the highest number of votes. The proposed approach also employed data-level class balancing approaches in the pre-processing stage of the dataset such as random oversampling (ROS) and random undersampling (RUS). A block diagram of the proposed credit card fraud detection framework is shown in Fig. 1.

Fig. 1
figure 1

A block diagram of the proposed credit card fraud detection

2 Related work

With the advancement of technology and cashless transactions, commercial fraud or deception has also increased, especially, in the banking systems [14]. Credit card fraud is one of the most challenging fraud banking systems are facing in modern times. Manual detection of such fraud is a tedious task and there is a high chance that fraudulent transactions go unnoticed [63]. In order to build automated credit card fraud detection systems, many state-of-the-art approaches have been proposed by several researchers. Several researchers have proposed fraud detection based on statistical learning and data mining [31, 53, 60, 72]. In recent years, machine learning and deep learning-based approaches for fraud detection have been applied extensively [32].

Machine learning and deep learning play important roles throughout several efficient fields for data processing [9, 27, 35, 36, 55]. Machine learning and deep learning are being applied for credit card fraud detection. Awoyemi et. al [10] employed multiple machine learning models, including Naïve Bayes (NB), K-Nearest Neighbour (KNN) and Logistic Regression (LR) for credit card fraud detection over the highly skewed dataset. Their analysis showed that the KNN performed better. In [38], an approach based on transaction behaviours has been proposed where multiple machine learning algorithms have been evaluated. However, the results show that only the Random Tree and J48 models have yielded satisfactory accuracy. [24] used various machine learning algorithms along with PCA for feature reduction and SMOTE for balancing the class distribution of the dataset. Several other approaches based on machine learning algorithms have also been proposed for credit card fraud detection [4, 39, 51, 63, 68]. Along with machine learning, deep learning has been also employed by several authors for credit card fraud detection [8, 47, 50]. In [37], LSTM-based transaction sequence classifiers have been proposed. Alghofaili et al. [5] have also proposed an LSTM-based deep learning model for the detection of financial fraud. [61] proposed a neural network architecture which is used and executed for a large number of iterations. In [26], a combination of deep learning based autoencoder and classifiers have been proposed for credit card fraud detection. In [69], the authors have discussed machine learning methods and the challenges in the process of credit card fraud detection.

The most difficult issue nowadays in this field is the class imbalance or skewed distribution of class in a dataset. Usually, the class imbalance problem degrades the efficiency of machine learning or deep learning-based classifiers. In the case of credit card datasets, the availability of positive samples is much less compared to legitimate transactions. This leads to a high rate of misclassification of positive samples. In [32], have suggested various approaches for dealing with problems of class imbalance such as methods based on data level, levelling algorithm, and ensemble approaches. Data level techniques are otherwise called outer level strategies, as they control the preparation information remotely [63]. It is finished by re-sampling of data externally to stabilize the distribution of occurrences in majority and minority classes and is used for finding fraudulent transactions and to improve overall accuracy [64]. Hybrid methods are another type of approach for dealing with the class imbalance problem [7]. In this manner, there is a consistent interest in concocting new methods that can enhance the accuracy of classification. One thought is to utilize a group of classifiers rather than individual ones. In other words, employing ensemble learning to deal with class imbalance problems.

The most popular troupe strategies for ensemble multiple classifiers were presented with bagging [78] and boosting [49]in which a few classifiers were utilized to create one single result with further developed precision. Be that as it may, the exemplary form just proposes the larger part or majority vote to total the results of individual classifiers. Ensemble learning is an important solution to upgrade the performance of a ML-based model. The basic ideology of an ensemble is to combine various classification models to enhance the overall performance of the model. [7] used an ensemble of ML models of Random Forests and Neural Networks. According to observation Random Forest can classify normal transactions correctly but misclassifies the fraudulent transactions. Meanwhile, Neural Networks can classify fraud transactions correctly but misclassify some of the transactions are normal. Hence, the ensemble-based method has the best possible solution. Application of ML and data mining deduce through a significant test that is the way to win wanted grouping accuracy from the data which is exceptionally slanted in nature.

In recent years, various ensemble learning models for credit card fraud detection have been proposed by several researchers. Sohony et al. [64] have proposed an ensemble approach of random forest and neural network. The proposed model has improved both fraud and normal instances detection. In [57], employed multiple learning algorithms along with Adaboost and majority voting technique. The Proposed approach has utilized a real-world credit card dataset and the results of the proposed work show that majority voting performed well. In [56], a weighted voting ensemble approach has been proposed where data is balanced using an undersampling technique followed by feature selection using the random forest. The selected features are further used by an ensemble classifier consisting of multiple machine learning and deep learning models. Xie et. al [74] proposed a heterogeneous ensemble model where equal focus was given to the classification of both positive and negative classes of credit card transactions. It also employed the KNN and K-Means algorithm for balancing the class distribution followed by classification tasks using a voting ensemble. In [25], an ensemble approach consisting of LSTM and Adaboost has been proposed in combination with the Smote-ENN approach for enhancing credit card fraud detection performance. Some authors have also applied different ensemble learning approaches for credit card fraud detection [11, 22, 42, 58]. A summary of various existing techniques for credit card fraud detection is presented in Table 1.

Table1 Summary of state-of-art approach for credit card fraud detection

In all these state-of-the-art literature, different types of approaches based on data mining, machine learning, and deep learning have been used and reported satisfactory performance. A large number of the existing literature are not focused on improving the misclassifications of fraudulent transactions or mostly tackled the problems using data-level class balancing algorithms. Some authors have applied different ensemble learning strategies such as; boosting, bagging, stacking, and voting. In the case of voting ensemble techniques, most of the existing approach relies on single types of voting ensemble strategy. Although, they have reported good performance, still, there is a large scope for improvements in terms of robustness of the fraud detection. In this paper, we have proposed an ensemble approach which comprises three simple base classifiers. The approach utilizes multiple types of voting ensemble strategies such as hard and soft voting as well as weighted and unweighted voting ensemble in order to reduce the misclassification of fraudulent credit card transactions.

3 Dataset

Credit Card Fraud Detection using Machine Learning is a course of data examination by a Data Science group and the improvement of a model that will give the best results in uncovering and forestalling bogus trade. This is accomplished by uniting all significant elements of card users' transactions. The dataset that is being used in this work is taken from Kaggle (CCDataset 2018) which is available publically. It includes all those credit card transactions that took place in September of 2013 by European card owners. The dataset consists of 31 features labelled as two types of transactions; fraud (1) and legitimate (0). The publisher of the dataset employed PCA on the available features to get an equivalent feature set excluding the time and amount attributes. The regeneration of the features is used to avoid revealing confidential information. The record contains 492 fraud cases from 284,807 transactions, and others are valid.

4 Methodology

In general, for developing a machine learning-based system, it is crucial to pre-process the information, which includes information adjusting and feature selection, influencing the overall performance of an algorithm. This step is also important for an ensemble learning model. In the case of ensemble learning, the main concern is the determination of proper classifiers. In particular, choosing an appropriate set of classifiers and various subsets of the dataset could further enhance an algorithm’s performance [56]. In this work, since the data is highly imbalanced, we balance the data using undersampling (RUS) and oversampling (ROS) techniques. Then the balanced dataset is used to evaluate a weighted ensemble machine learning algorithm. The ensemble learning model consists of multiple machine learning models as described below.

4.1 Machine learning

Machine learning plays a fundamental part in detecting fraud in the daily transactions made using credit cards. For foreseeing these exchanges, most of the banking systems utilize different machine learning methodologies. In order to enhance a detection model, past data is used and new elements can also be utilized [23, 39]. In this paper, we have used the following machine learning-based techniques to detect fraudulent and legitimate transactions in credit card transactions:

  • K-Nearest Neighbours: KNN is a vital classifier that belongs to the supervised learning category of machine learning. In this technique, new data is grouped by its closest neighbour's majority vote. The distance between nearest neighbours is estimated by a distance function that utilizes the Euclidean distance, the Manhattan distance, and the Minkowski distance technique as well. K value refers to the number of neighbours the K value is extremely low, one will get less stable outcomes. On the other hand, increasing the K value will permit to increase in the error, but one will acquire stable outcomes. Subsequently, in the current work, the K value is picked by trial and error so no overfitting happens [20].

  • Random Forest Classification: Random Forest is a bagging class of ensemble classifier which employs multiple decision trees, and the decision trees are trained using different subsets of a given dataset. It takes the average to aggregate the results to improve the results. Rather than depending upon a single decision tree, the random forest takes the supposition from each of the included trees in view of the greater part of the votes of predictions, and it predicts the ultimate result.

  • Logistic Regression: It is an important statistical technique for the analysis, as well as classification of binary and proportional response data sets [41]. Logistic regression predicts the result of a straight-out subordinate variable. The approach is a parametric approach where a hyperplane is identified by optimizing the coefficients associated with features. The hyperplane evaluated in this algorithm partitions the samples into binary classes. The result should either be an absolute or a discrete. It tends to be either Yes or No, 0 or 1, True or False, and so forth.

4.2 Ensemble learning

As we all know, the blunders and forecasts in any machine learning models are unfavorably impacted by the bias or the variance, and noise as well. To battle these disadvantages, outfit ensemble models are utilized. An ensemble model is an interaction of how different ML models, like classifiers or specialists, are unequivocally made, and are joined to tackle a particular computational insight issue. Ensemble learning is essentially used to work on classification, prediction, function approximation, etc. [52]. The two most widely used ensemble techniques are Bagging and Boosting. Another popular war of aggregating the results of multiple learning models is voting ensemble.

A voting ensemble is an ensemble ML model that combines the assumptions from various models. A framework might be used to oversee model execution, in a perfect world achieving inclined toward execution over any single model used in the outfit. A voting classifier works by joining the assumptions from various models. It very well may be utilized for characterization and regression. By excellence of regression, this integrates registering the mean of the expectation from the models. By virtue of order, the expectation for each name is added and the engraving with the larger part vote is anticipated [17]. Different types of voting are hard voting and soft voting.

  • Hard Voting: In the case of hard voting scheme for ensemble learning-based classification, the final prediction of all the classifiers are considered and anticipates the class which have the most number of vote from models. For example, suppose there are three base classifiers employed to build an ensemble model and the result is aggregated using hard voting technique. For a sample to classify, if two base classifiers predict it as a positive sample and one classifier as a negative sample. Combining these results, the final prediction of the ensemble model will be a positive class for the sample.

  • Soft Voting: Soft voting includes finding the mean of all the predicted probabilities for each class label and foreseeing the class name with the highest probability. Here, the probability for each class is calculated by averaging the probabilities resulted by all the base classifiers of the ensemble model. In soft voting, it predicts a class with the highest average probability computed for the class.

4.3 Class imbalance techniques

Dataset plays an important role in the learning of a machine learning algorithm. Also, there's a huge amount of data that is produced every minute, and it is a massive task on its own to filter it and form suitable and ready-to-use datasets with no discrepancies as such and well balanced. But a lot of times the major classes seem to have a significant imbalance in their quantity as well. In the case of the credit card transactions dataset, there is a high chance of imbalanced data. For instance, the dataset used in this work has a total of 284,807 transactions, of which only 492 are fraud transactions which makes 248,315 valid transactions [18]. Since there is such a humongous imbalance in this dataset, it makes the outcome more inclined towards valid transactions and increases the chance of positive class misclassification. To avoid this, we applied certain techniques, mainly oversampling and undersampling to maintain a normal balance between the two major classes of the dataset.

The authors in [45] have summarized the related work in the field of imbalanced class problems. An imbalanced characterization issue is an example of an arrangement-type issue where the dissemination of models across the known classes is uneven or skewed. The scattering can change from a slight bias to an outrageous inconsistency where there is also one model in the minority class for a hundred, thousands, or even millions of models and in the bigger part class or classes. The Imbalanced arrangement represents a challenge for prescient modelling as most of the computations that are used for portrayal were arranged around the assumption of an identical number of models for each class. The outcome in models with poor insightful execution unequivocally for that of the minority class. That is an issue as the minority class is more critical so accordingly, the issue is more delicate to characterize blunders for the minority class to that of the greater part class. It is conceivable that the imbalance in the models across the classes was brought about by how the models were accumulated or inspected from the issue space [17].

Class Imbalance issues appear in many domains, including Fraud detection, Spam filtering, Disease screening, etc. The two popular types of techniques which are used extensively in various fields are oversampling and undersampling. In oversampling, the minority class samples are duplicated to an amount which makes it in a normal ratio to that of the amount of majority class values. Oversampling can be carried out using RUS, Synthetic minority oversampling technique (SMOTE), Adaptive synthetic sampling (AdaSyn), etc. In undersampling, the samples of the majority class are decreased to that of a normal ratio of the minority class samples. It is also having a risk of losing some important data in process of lowering the data amount. Undersampling is carried out using RUS, Tomeklinks, Cluster centroids undersampling technique etc. In this paper, ROS and RUS are used for balancing the class distribution. The class imbalance issues in the dataset and the strategies to deal with the issue have been shown in Fig. 2.

Fig. 2
figure 2

A block diagram of class imbalance handling approach

4.4 Proposed work

Every year, an enormous measure of financial misfortunes is brought about by illegal credit card exchanges. Also, with the advancement and innovations, criminals track down new ways to commit fraud. Therefore, there is a need to tackle fraud detection issues in credit card transactions. In this paper, we propose a voting ensemble learning approach for credit card fraud detection. The proposed approach is evaluated by using a highly imbalanced credit card transactions dataset. The proposed ensemble model is developed using Random Forest, Logistic Regression and KNN classifiers. Also, the results of these models are aggregated using a weighted voting scheme to get better results. As described earlier, the dataset available for credit card fraud suffers from a class imbalance problem which is handled using undersampling and oversampling techniques. The proposed approach works as follows:

  • First, data pre-processing is performed. In this phase, we have taken care of the class imbalance issues within the dataset. To balance the data, RUS and ROS approaches have been utilized. The balanced data is further used to train and test the proposed ensemble classification model.

  • Next, we have trained multiple machine learning models to create ensemble learning. The results of these models are aggregated using the voting scheme. The voting scheme uses weights in two ways; hard and soft voting.

  • To combat disadvantages like noise and variance, multiple-weighted voting ensemble methods comprising different combinations of weights are used. The weighted voting ensemble model is utilized to enhance the overall performance of credit card fraud detection by joining the classification results of all the classifiers and choosing the class with the highest vote based on the weights assigned to each of the classifiers.

At times while trying different machine learning algorithms as a part of the voting ensemble model, there is a probability that one might perform better than others. To deal with this, we can assign different weights to each of the classifiers. Let be three base classification models M1, M2, and M3, with assigned weights W1, W2, and W3, respectively. In this paper, we represented the assignment of the weight as < W1, W2, W3 > . Suppose a weight is given as < 1, 1, 1 > means that the vote of each model would be the same. Whereas combinations such as < 2, 1, 1 > , < 1, 2, 1 > and < 1, 1, 2 > mean different weights are assigned to classifiers. The majority of the three classification models are given as the final predicted value of the ensemble model. A block diagram for the proposed weighted voting ensemble model is shown in Fig. 3.

Fig. 3
figure 3

Proposed voting ensemble model architecture

Let us assume the valid samples of a credit card transaction dataset are labelled as 0, and fraudulent transactions are labelled as 1. The proposed hard and soft voting schemes can be described as follows:

  • Hard voting: Hard voting is a simple voting in which the majority one is chosen or considered to be the ideal classification. This process is thus carried out by taking a mode for predicting class y for models mj.

    $$y=mode\{m1(x), m2(x), m3(x), \dots , mj(x)\}$$

    Suppose the three-classification models are giving the following predictions for a transaction:

    Model 1 -> Class 1, Model 2 -> Class 1, and Model 3 -> Class 0.

    Now, if we want to predict class label y, we will take the mode of all predictions.

    Therefore, y = mode {1, 1, 0} = 1.

    This means that the predicted result for the transaction is fraudulent.

  • Soft voting: On the other hand, in soft voting, the probability of occurrence of both classes is calculated, and then a final classification is done.

    For example, the probability of occurrence of class 0 and class 1 for a transaction is given as {p(0), p(1)} and the results of three models are given as follows:

    • Model 01—> {0.1,0.9}

    • Model 02—> {0.2,0.8}

    • Model 03—> {0.6,0.4}

Now, Overall p(0) = (0.1+0.2+0.6)/3 = 0.3

Overall p(1) = (0.9+0.8+0.4)/3 = 0.7

Here the overall probability of class 1 is higher, so the transaction is classified as class 1 i.e. fraudulent.

5 Experiment and result analysis

5.1 Experimental setup

The whole process of code execution and implementation of the proposed work is completed on an Apple MacBook Pro with a processor speed of 2.3 GHz, Dual-Core Intel Core i5. The Python codes are executed using the Juypter Notebook of Anaconda software. The versions used of Python and its libraries in this work are Python (3.7.11), NumPy (1.21.2), Pandas (1.3.5), SciPy (1.7.3), Matplotlib (3.5.0), Seaborn (0.11.2), Imblearn (0.0), and Sklearn (0.24.1). The results of the proposed work are recorded using accuracy and confusion matrices. The accuracy is used to record the overall performance of the proposed work whereas, the confusion matrices are used to show the true and false classification of positive and negative samples.

5.2 Result analysis

The dataset used in this work consists of 284,807 values, out of which 248,315 are valid transactions and only 492 samples are fraudulent. So, to deal with the imbalance of data, we have employed ROS and RUS techniques. The machine learning algorithms that are used in the voting ensemble model are random forest, logistic regression and KNN. The weights given for the 3 algorithms < w1, w2, w3 > were in the same order, where w1 is assigned the weight for the random forest, w2 is the weight for logistic regression and w3 is the weight for KNN. Both soft and hard types of voting schemes have been applied, to get better results and understand the right sort of model working. The results are tabulated which show different accuracies obtained using the weighted voting ensemble on the balanced dataset.

The dataset after the oversampling and undersampling is used for training and testing of the selected base classifiers. After that, hard and soft and weighted voting ensemble machine learning techniques were carried out. The best accuracy is received with the assigned weights < 2,1, 1 > in both the soft and hard voting types. The training and testing accuracies for soft and hard voting are recorded as 100% and 99.99%, respectively. The approach to handle the class distribution is carried out by ROS. The worst performance, in terms of accuracy, is yielded in the case of the hard voting strategy with assigned weights as < 1, 1, 2 > where the class imbalance problem has been handled by RUS. The training and testing accuracy achieved in this case is 87.2% and 79.05%, respectively. A details summary of the performance of multiple setups of weights in terms of training and testing accuracy is presented in Table 2.

Table 2 Training and testing accuracies of hard and soft weighted ensemble implementation

To measure the misclassification of fraudulent and valid transaction samples, we have plotted the confusion matrices for all combinations of weighted hard and soft voting ensemble learning models carried out in this paper. Figures 4, 5, 6, 7, 8, 9, 10 and 11 are the confusion matrices plotted for the training and testing sets where the class imbalance problem is handled using the ROS approach. In some cases of ROS, the proposed approach has produced zero false predictions. For instance, with assigned weights < 2, 1, 1 > with soft voting, zero false prediction is achieved for the training set and zero false prediction is achieved for the fraud samples during the testing as shown in Fig. 4. Similarly, as shown in Fig. 5, with the hard voting strategy zero false prediction is achieved for the testing set and zero false prediction is resulted for the fraud level during training. With the assigned weights < 1, 2, 1 > and soft voting, the proposed model also resulted in zero false predictions for fraudulent samples in both the training and testing phases, as shown in Fig. 6. However, misclassification of negative samples has increased compared to previously mentioned cases. With assigned weights < 1, 2, 1 > and hard voting, the proposed model has not performed well as shown in Fig. 7. Except for this scenario, other assigned weights < 1, 1, 2 > and < 1, 1, 1 > have performed well in both soft and hard voting schemes. As shown in Figs. 8, 9, 10 and 11, zero false predictions are recorded for fraudulent transactions.

Fig. 4
figure 4

Confusion matrices for training and testing sets with assigned weights < 2,1,1 > and soft voting in combination with ROS

Fig. 5
figure 5

Confusion matrices for training and testing sets with assigned weights < 2,1,1 > and hard voting in combination with ROS

Fig. 6
figure 6

Confusion matrices for training and testing sets with assigned weights < 1, 2,1 > and soft voting in combination with ROS

Fig. 7
figure 7

Confusion matrices for training and testing sets with assigned weights < 1,2,1 > and hard voting in combination with ROS

Fig. 8
figure 8

Confusion matrices for training and testing sets with assigned weights < 1,1,2 > and soft voting in combination with ROS

Fig. 9
figure 9

Confusion matrices for training and testing sets with assigned weights < 1,1,2 > and hard voting in combination with ROS

Fig. 10
figure 10

Confusion matrices for training and testing sets with assigned weights < 1,1,1 > and soft voting in combination with ROS

Fig. 11
figure 11

Confusion matrices for training and testing sets with assigned weights < 1,1,1 > and hard voting in combination with ROS

Figures 12, 13, 14, 15, 16, 17, 18 and 19 present the confusion matrices plotted for the training and testing sets where the class imbalance problem is handled using the ROS approach followed by multiple combinations of weighted soft and hard voting schemes. The target in all these experiments was to attain zero false predictions as well. It is clearly visible that the results are consequently better for the ROS than the RUS class handling approach. This is also understandable that the undersampling when lowering the majority class to the minority class amount doesn't always fit well, as the chances of losing the important data always exist. However, in some cases of assigned weights, satisfactory results are achieved. In the case of assigned weights < 2, 1, 1 > and soft voting very few misclassifications of valid transactions are recorded compared to misclassifications of fraudulent transactions as shown in Figs. 12 and 13. For other combinations of weights and voting schemes, there are few misclassifications of negative samples recorded but the same for fraudulent transactions are very high. The worst result is recorded for the weights < 1, 1, 2 > and hard voting scheme as shown in Fig. 17.

Fig. 12
figure 12

Confusion matrices for training and testing sets with assigned weights < 2,1,1 > and soft voting in combination with RUS

Fig. 13
figure 13

Confusion matrices for training and testing sets with assigned weights < 2, 1, 1 > and hard voting in combination with RUS

Fig. 14
figure 14

Confusion matrices for training and testing sets with assigned weights < 1, 2, 1 > and soft voting in combination with RUS

Fig. 15
figure 15

Confusion matrices for training and testing sets with assigned weights < 1, 2, 1 > and hard voting in combination with RUS

Fig. 16
figure 16

Confusion matrices for training and testing sets with assigned weights < 1, 1, 2 > and soft voting in combination with RUS

Fig. 17
figure 17

Confusion matrices for training and testing sets with assigned weights < 1, 1, 2 > and hard voting in combination with RUS

Fig. 18
figure 18

Confusion matrices for training and testing sets with assigned weights < 1, 1, 1 > and soft voting in combination with RUS

Fig. 19
figure 19

Confusion matrices for training and testing sets with assigned weights < 1, 1, 1 > and hard voting in combination with RUS

6 Limitations and future work

The results of the proposed works showed that the voted ensemble learning model for credit card fraud detection is performing well. In some cases of assigned weights, the models resulted in zero false classification for both training and testing sets of data. However, there are some limitations associated with the proposed model in terms of the dataset, approach for handling class imbalance problem, base classifiers selection to build ensemble learning models as well as a limited combination of weights assignments for the voting schemes. In terms of data, the main issue that arises for building an efficient credit card fraud detection model is the limited availability of data. Most of the existing datasets available in the public domain have limited numbers of features and samples. In this paper, the proposed model has been evaluated using only one dataset [18] which may result in poor generalization for unseen events of credit card transactions. For better generalization, the model needs to be evaluated using other existing datasets [66, 70, 72] as well as the real credit card dataset [53].

In terms of handling class imbalance problems, the ROA and RUS have been applied in this work. There are possibilities for enhancing the performance of the proposed model by combining it with different oversampling, undersampling, combination of oversampling and undersampling and other similar methods [63]. Apart from data-level algorithms, the cost-sensitive approaches [3, 65, 75] can also be applied in combination with proposed ensemble learning.

To develop an ensemble learning model, the selection of base classifier(s) plays an important role. Funding such a combination is a challenging task as a combination may perform well for a particular problem and it may fail to generalize other problems. In this paper, a limited number of base classifiers are employed which can be considered as advantages in terms of simplicity in the architecture and faster training of the model. But, this is also a limitation of the proposed work. A bigger or different combination of base classifiers may result in better performance. In future work, some other classifiers such as Support Vector Machine, Decision Tree, Naive Bayes, Neural Networks, LSTM, Generative Adversarial Networks etc. can be ensemble to enhance the credit card fraud detection task. Ensemble learning enhances classification accuracy at the cost of interpretability [15]. To deal with this issue, Explainable Artificial Intelligence (XAI) based credit card fraud detection models can be developed [43, 54].

In this paper, the soft and hard voting schemes for aggregating the results of the base classifiers have been used. It uses multiple combinations of weights assigned to each base classifier for final prediction. The selection of these weights used in this work is limited to a few integer values i.e. 1 and 2. More combinations of integers, as well as fractional weights, can also be used to enhance the performance of the proposed model. The other aggregation strategies [59] apart from voting may also be employed and evaluated to check the performance of the proposed model.

7 Real-world applications

Frauds related to credit cards increasing with the development of e-commerce, digital payment, and related technology. The fraud cost customers and financial organizations billions of dollars. Manual and traditional fraud detection systems may not be much efficient as the fraudsters also developing new ways and techniques to commit illegal transitions. To cater for this challenge, an advanced automated fraud detection system is required by the banking organization. As discussed in the limitation section, the proposed model needs improvement in terms of generalization to catch real-world unseen transactions, the model can be helpful in developing automated credit card fraud detection systems. Since it employs machine learning, the model will also advance itself with experience. Apart from credit card fraud detection, the proposed model can also be used for developing applications to tackle other monetary frauds in the banking and financial systems. The proposed model can also be applied in different fields where numerically labelled datasets are available such as insurance, stock market, medical and health care, cybersecurity, etc.

8 Conclusion

This study contributes to the existing literature on credit card fraud detection by proposing a voted ensemble learning approach. The proposed approach is an ensemble of three base classifiers namely Random Forest, Logistic Regression and KNN. The results of these models are aggregated using weighted voting techniques. The voting scheme used in this paper follows hard and soft voting techniques. To get a final prediction that a credit card transaction is fraudulent or legitimate, an aggregated value is calculated using various combinations of weights assigned to the predictions of three base classifiers. The proposed ensemble model is evaluated using a highly imbalanced dataset. The class imbalance problem has been handled during pre-processing using two approaches known as random oversampling and random undersampling. In this study, the main focus is to reduce the false prediction of fraudulent transactions. The proposed model has performed well in terms of accuracy as well as minimized false prediction. The highest training and testing accuracies achieved by the proposed model are 100% and 99.99%, respectively, when we applied ROS and assigned the highest weight to the results of Random Forest. Most of the other combinations of assigned weights for both soft and hard voting schemes have also produced satisfactory results except in the case where the highest weight is assigned to the results of the KNN classifier and class distribution is balanced using RUS. The proposed ensemble learning can further be enhanced by using other combinations of classification algorithms and different combinations of weights.