Abstract
As the usage of credit cards has become more common in health care applications of everyday life, banks have found it very difficult to detect credit card fraud (CCF) systematically. The fraudulent activities should be identified and detected using new techniques. As a result, machine learning (ML) can help detect CCF in transactions while reducing the strain on financial institutions. This research aims to improve cybersecurity measures by detecting fraudulent transactions in datasets. The new classifier strategies cluster and classifier-based decision tree (CCDT), cluster and classifier-based logistic regression (CCLR), and cluster and classifier-based random forest (CCRF) are modeled in this research. The proposed strategies are applied to detect fraudulent health care activities. This research performed the preprocessing through the feature extraction, sampling, and transformation stages, and the proposed classifiers are simulated, and the results are analyzed. The significant results expected range of the proposed classifiers over the other methods are accuracy—(99.95%, 99.97%), precision—(99.96%, 99.98%), sensitivity—(99.9%, 100%), specificity—(99.8%, 100%). The parameters \(\mu \), location, the binary variable, cluster size, and decision tree sampling observations affect the classifiers' performance. CCRF and CCLR obtain the expected significant results than other existing methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The use of credit cards has increased as the world moves toward digitization and money transactions become paperless. Credit card utilization has started to increase among all customers due to the necessary and urgent requirements. When making an online purchase, many consumers prefer to use credit cards (Tran and Dang 2021). Credit cards assist us in making purchases even if we do not have the necessary cash. Unfortunately, it appears that fraudsters are keeping track of these aspects and are even succeeding in exploiting them in this evolving environment. Today, fraudsters can be creative, intelligent, and fast, so fraud activities involving credit cards have also been on the rise, resulting in significant losses for individuals and financial institutions (Li et al. 2021). Credit card fraud (CCF) occurs when someone uses another person's credit card or account details to make illegal purchases or use the fund. Most online fraud transactions were made remotely only using credit card data. In most cases, the credit cardholder is unaware that their card information has been stolen and used by someone else.
Since online transactions increase every month, there is a significant increase in fraudulent operations. CCF is one of the most problematic, so we must design new strategies to detect it. Many fraud detection methods and soft computing strategies are analyzed to minimize the effects of CCF. These methods and strategies are trained on the earlier transactions to predict the newer ones. The ML strategies work well when the distribution of dataset classes is balanced. Several methods like ensemble, data, and algorithmic level strategies are developed to solve when the datasets are not balanced. The reinforcement learning strategy classifies the imbalance distribution, the problem is formulated using linear decision-making, and Q-learning is applied.
The contribution of this research is to improve cybersecurity by detecting fraudulent transactions in large-scale datasets using the new classifier strategies such as cluster and classifier-based decision tree (CCDT), cluster and classifier-based logistic regression (CCLR), and cluster and classifier-based random forest (CCRF). The proposed classifiers are applied in the detection of fraudulent health care activities. This research performed the preprocessing through the feature extraction, sampling, and transformation stages, and the proposed classifiers are simulated, and the results are analyzed.
This research article is structured as follows. The critiques of the literature survey are discussed in Sect. 2. The proposed classifiers are explained in Sect. 3. The simulation outcomes are analyzed and compared with the other methods in Sect. 4. The conclusions with the future scope are drawn in Sect. 5.
2 Literature survey and critiques
CCF significantly affects the financial industry and daily life. Fraud can weaken the public's trust in the institution (Fatima et al. 2021). As a result, we must analyze and distinguish between fraudulent and non-fraudulent transactions. Different strategies are developed in the literature that follows the pattern of all transactions and identifies the fraudulent ones to solve this problem. Techniques such as normalization-based clustering are developed to minimize the clustering attributes. The unsupervised methods are designed to detect fraud. The Bayesian-based sensitive method is developed with cost optimization measures. The computing methods such as artificial intelligence (AI), genetic algorithms (GA), data mining, sequence alignment, and genetic programming are also developed to minimize the risks (Hoang et al. 2020; Marappan and Sethumadhavan 2018, 2020; Belmonte et al. 2020).
The datasets are balanced using synthetic and sampling methods, and ML, RF, KNN, and DT LR are applied to training. Some additional classifiers are introduced using boosting and neural networks (NN). The most critical issues are only when the data are not balanced. CCF results in unexpected loss for companies and customers; hence, optimal methods are expected to prevent and detect CCFs. The reliable expectations are obtained using kRNNs and Naive Bayes (NB) methods. The regression is applied with ensemble classifiers, nearest neighbors, and sampling methods. The transactions of CCF databases are identified using neuroadaptive, Markov, and stochastic methods. Anomaly detection is also applied for detecting CCFs. The divide and conquer strategy is applied with the entropy measure and hyperparameters to convert the problem into a balanced one. The classifiers' performance is improved using overlapping, and R-value feature selection approaches (Bhaskaran and Marappan 2021; Dang et al. 2021).
The probabilistic RF with autoencoder method utilized the low-dimensional features extraction and applied it for imbalanced datasets (Lin and Jiang 2021). Some categorical attributes with multiple domains as high-cardinality attributes are there in credit card transactions. The domain reduction method is proposed using FFNN to reduce the size of attributes (Carneiro et al. 2022). Sequential fraud detection is achieved using SVM and isolation forest methods (Sharma et al. 2021). The neural network (NN) is applied with the hybrid resampling technique to detect the public datasets' fraud (Esenogho et al. 2022). The hybrid method is developed to identify fraud using XGBoost (Dalal et al. 2022). This method applied different classifiers using ML, but the resources are not centralized, and the constraints are unique. The NN model is developed using LSTM with a linear data model and attention strategies (Benchaji et al. 2021). The evolutionary optimization with support vector data description is developed in the parameters optimization to obtain good accuracy (Mniai and Jebari 2022). The method does not consider the selection features and the integration framework. The fraud losses and FNRs are reduced using the DL algorithms (Alarfaj et al. 2022). ML- and AI-based heuristics and local search strategies are applied to detect CCFs (Jain et al. 2022; Trivedi et al. 2020). These models' accuracy, recall, and precision measures are further improved using new strategies.
The following are the major drawbacks of the existing models:
-
More significant differences between the negative and positive classes count.
-
The evolution of fraud characteristics through data shift.
-
The oversight of linear resources in between the adjacent transactions.
Thus, the design strategies are required to fulfill the following criteria:
-
CCF activities identification and risk reduction in financial sectors.
-
Improve the performance of unbalanced classifiers.
-
Extraction of the credit card's low- and high-dimensional features to produce a better probabilistic classification.
-
Optimal selection of good categorical attributes in the domain reduction.
-
Design of classifiers to detect true negative (TN) and true positive (TP) values.
-
Operate the model on imbalanced datasets to improve accuracy.
3 Proposed model
This section focuses on the proposed classifiers' novelty, architectural components, and algorithms. The notations used in the proposed model are defined in Table 1 (Belmonte et al. 2020; Marappan and Sethumadhavan 2020; Bhaskaran and Marappan 2021; Dang et al. 2021). The architectural components are designed to interact with each other to achieve the expected novelty.
3.1 Novelty of the proposed model
The proposed model is developed using the following novelty and main contributions:
-
Hybrid classifier and clustering strategy: This strategy applies the classification for the classifier induction using stochastic centroid clusters to preprocess the data further to obtain better measures.
-
Hybrid method in CCDT, CCRF and CCLR: The probabilistic hybrid distribution-based hierarchical and density-based clusters are applied in the model for better validation with measures.
-
Classifier-based sampling strategy: This strategy is applied to classify non-fraud and fraud labels. Probabilistic sampling is applied with descriptive and element classifiers to improve the measures of classification outcomes.
-
Preprocessing strategy: Applying the initial process using feature extraction, multivariate probabilistic sampling, and probabilistic transformation to all the classifiers.
3.2 Architectural components and classifiers
The architecture of the proposed model involves some components—preprocessing through feature extraction and stochastic transformations, splitting and training the datasets in the ratio of 75% training and 25% testing, probabilistic clustered classification, and measures evaluation. The components are required to develop a new model to identify the CCFs using new strategies that play a role in fraud detection since they are frequently used to extract hidden information from the large-scale dataset. The architecture includes examining and preprocessing data sets and applying ML to analyze credit card spending patterns and identify fraudulent transactions. The proposed classifiers target improving cybersecurity by detecting fraudulent transactions in the dataset using new classifier strategies such as CCDT, CCRF, and CCLR. The preprocessing is performed through feature extraction, multivariate probabilistic sampling, and probabilistic transformation stages. The preprocessing operations are applied in all three classifiers at the initial stage. The overall flow diagram of the proposed model is sketched in Fig. 1.
The architectural components of the proposed model involve the following:
-
Preprocessing the datasets.
-
Selection of the model.
-
Split the dataset.
-
Training the model.
-
Update the cluster-based classifiers.
-
Detecting the frauds.
-
Analyzing the model.
-
Evaluate the accuracy.
The preprocessing of the dataset involves the following operations:
-
Import the dataset.
-
Search and remove the null values.
-
Apply the feature extraction, multivariate sampling, and probabilistic transformation stages.
-
Update the dataset.
The classifiers are created using the following operations:
-
Extracting the test set from the historical data.
-
Apply feature extraction.
-
Train the test dataset.
-
Model the training.
-
Examine the model predictions.
-
Apply to stream.
-
Deploy the model.
-
Predict the model.
The CCLR algorithm for binary classification is defined in Algorithm 1. This algorithm operates on the preprocessed dataset using the supervised strategy. This algorithm returns the probability of a binary variable. The standard logistic curve is shown in Fig. 2, and the LR is sketched in Fig. 3. The algorithm determines the expected clusters and predictors. The probability of a response, \(p\left(x\right)\) is calculated for all clusters and predictors.
Algorithm 1: CCLR |
---|
1: Apply the preprocessing operation through feature extraction, multivariate sampling, and probabilistic transformation stages. |
2: Define the number of clusters and predictors. |
3: Determine the probability of a response for a given variable using |
\(p\left(x\right)=1/(1+\mathrm{exp}(x-\mu )/a)\) (1) |
4: Update \(p\left(x\right).\) |
\(p\left(x\right)=1/(1+\mathrm{exp}(-x{\beta }_{1}-{\beta }_{0})\) (2) |
5: Determine \(p\left(x\right)\) for all clusters and predictors. |
The CCRF algorithm for classification is defined in Algorithm 2. The DT-based RF is an ensemble-based method that includes many DTs, as sketched in Fig. 4. Several outcomes are obtained for every DT in the forest. This algorithm constructs several trees, and the equivalent classes are built as a DT using the posterior probability\(, p\left(a/x\right)\). All outcomes are merged at the end to obtain stable and accurate predictions.
Algorithm 2: CCRF |
---|
1: Apply the preprocessing operation through feature extraction, multivariate sampling, and probabilistic transformation stages. |
2: Define the number of clusters and predictors. |
3: Randomly extract the samples from the training subsets. |
4: Train the individual tree. |
5: Construct the decision tree based on the feature set. |
6: Determine the posterior probability using |
\(p\left(a/x\right)=p\left(a\right).p(x/a)/p(x)\)(3) |
7: Determine the final class for all clusters and predictors. |
8: Obtain stable and accurate predictions. |
The CCDT algorithm for problem classification is defined in Algorithm 3. The structure of the DT elements is depicted in Fig. 5. The CCDT is constructed using the predictors and clusters. The decision and association rules are applied to optimize the constructed DT. Finally, the classification and knowledge inference rules are optimized.
Algorithm 3: CCDT |
---|
1: Apply the preprocessing operation through feature extraction, multivariate sampling, and probabilistic transformation stages. |
2: Define the number of clusters and predictors. |
3: Construct the cluster-based DT. |
4: Apply the decision and association rules. |
4: Optimize the constructed DT. |
5: Optimize the classification and knowledge inference rules. |
4 Simulation and analysis
This section focuses on the datasets, results and analysis, and comparison with other methods of the proposed classifiers.
4.1 Datasets
This project applied the dataset of CCF detection from Kaggle.com, which contains two-day credit card transaction details of people from Europe. The dataset contains 31 attributes, including amount, class, and time. The features of this dataset are as follows: (labels, class—0 & 1), (columns, 31), (missing values, none), (rows, 284,807), (features, 30), (type, object). The significant attributes of the datasets are based on the principal components, numeric variables, amount, time, and class. Due to the card payment and industry data security standards, the original data of credit card users must be masked before being published due to confidentiality. The proposed model is implemented using Python. The simulation parameters are defined in Table 2.
4.2 Results and analysis
The proposed model is simulated on the benchmark dataset, and the target attribute is analyzed and sketched in Fig. 6. This diagram consists of the number of genuine and fraudulent transactions in the dataset plotted using the class attribute. From the plot, we can understand that the dataset's fraudulent transaction is much fewer than genuine ones. The performance metrics are evaluated using the measures—true negative (TN), true positive (TP), false negative (FN), and false positive (FP). The proposed strategies are evaluated using the following metrics.
The proposed model is simulated, and the experimental results are analyzed for the metrics defined from (4) to (7). The parameters \(\mu \), location, binary variable, cluster size, and sampling observations of the decision tree affect the performance of the classifiers. The typical expected range for \(\mu \) (0.1, 0.5), x (0.5, 0.8), cluster size (5, 15), and sampling observations (100, 500). The most used measures to evaluate CCF detection are accuracy, TN rate (TNR), TP rate (TPR), and Matthews correlation coefficient (MCC) (Lin and Jiang 2021; Carneiro et al. 2022).
4.3 CCDT, CCRF, CCLR matrix analysis
The histogram of the fraud class for the imbalanced dataset is shown in Fig. 6. This diagram depicts the classes on the X-axis and the frequency on the Y-axis. The CCDT confusing matrix is sketched in Fig. 7. The CCRF confusion matrix is sketched in Fig. 8. The outcome of CCLR is sketched in Fig. 9.
4.4 Comparison of results
The accuracy, precision, sensitivity, and specificity comparison of the proposed model with other methods are shown in Figs. 10, 11, 12 and 13, respectively. The following inferences are obtained from the experimental results and comparison with other methods (Dang et al. 2021; Alfaiz and Fati 2022; Malik et al. 2022):
-
The accuracy values of proposed strategies to detect CCF are incredibly high.
-
TP values are much smaller compared to TN values.
-
Proposed methods are expected to detect more positive samples than negative samples.
-
A reliable degree of performance measures is obtained compared to other methods.
-
CCRF and CCLR provide good results over other methods.
-
The significant results expected range of the proposed classifiers over the other methods are accuracy—(99.95%, 99.97%), precision—(99.96%, 99.98%), sensitivity—(99.9%, 100%), specificity—(99.8%, 100%). These results are competitive to the state-of-the-art-of methods (Belmonte et al. 2020; Dang et al. 2021; Alfaiz and Fati 2022; Malik et al. 2022).
-
The optimal measure of MCC is > 0.85 for the threshold (0.15, 0.75). The expected MCC is 0.85 compared to AE-PRF (Lin and Jiang 2021). When TPR becomes higher, more fraudulent transactions are identified. The expected MCC to achieve a better TPR is (0.5, 0.6) compared to the probabilistic classification (Lin and Jiang 2021). The AUC ranges from (0.96, 0.98) for the different cluster sizes in (5, 25), and better AUC is obtained over AE-PRF (Lin and Jiang 2021).
5 Conclusions and future work
CCF is undoubtedly a form of criminal activity. To minimize the impact, in this research, various ML techniques are evaluated to determine fraud in a dataset and how ML can be utilized to improve CCF detection. This research compared CCDT, CCRF, and CCLR methods on credit card datasets and analyzed them. The accuracy values of proposed strategies to detect CCF are incredibly high. The reliable degree of performance measures is obtained compared to other methods. CCRF and CCLR provide good results over other methods. The significant results expected range of the proposed classifiers over the other methods are accuracy—(99.95%, 99.97%), precision—(99.96%, 99.98%), sensitivity—(99.9%, 100%), specificity—(99.8%, 100%). The parameters \(\mu \), location, the binary variable, cluster size, and decision tree sampling observations affect the classifiers' performance. CCRF and CCLR obtain the expected significant results than other existing methods. The typical expected range for the parameters in obtaining the better measures are \(\mu \) (0.1, 0.5), x (0.5, 0.8), cluster size (5, 15), and sampling observations of probabilistic classification (100, 500). When the cluster size exceeds 15, and for large samples, it is necessary to modify the parameters \(\mu \) and x. In the future, recent soft computing strategies will be applied to enhance the performance and to apply the methods on different large-scale datasets to get a more accurate prediction model to overcome CCF detection (Marappan and Sethumadhavan 2021; Alfaiz and Fati 2022; Malik et al. 2022).
Data availability
Not applicable and not available.
Change history
28 August 2024
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1007/s00500-024-10125-2
References
Alarfaj FK, Malik I, Khan HU, Almusallam N, Ramzan M, Ahmed M (2022) Credit card fraud detection using state-of-the-art machine learning and deep learning algorithms. IEEE Access 10:39700–39715. https://doi.org/10.1109/ACCESS.2022.3166891
Alfaiz NS, Fati SM (2022) Enhanced credit card fraud detection model using machine learning. Electronics 11:662. https://doi.org/10.3390/electronics11040662
Belmonte JL, Segura-Robles A, Moreno-Guerrero A-J, Parra-González ME (2020) Machine learning and big data in the impact literature. A bibliometric review with scientific mapping in web of science. Symmetry 12:495
Benchaji I, Douzi S, El Ouahidi B et al (2021) Enhanced credit card fraud detection based on attention mechanism and LSTM deep model. J Big Data 8:151. https://doi.org/10.1186/s40537-021-00541-8
Bhaskaran S, Marappan R (2021) Design and analysis of an efficient machine learning based hybrid recommendation system with enhanced density-based spatial clustering for digital e-learning applications. Complex Intell Syst. https://doi.org/10.1007/s40747-021-00509-4
Carneiro EM, Forster CHQ, Mialaret LFS, Dias LAV, da Cunha AM (2022) High-cardinality categorical attributes and credit card fraud detection. Mathematics 10:3808. https://doi.org/10.3390/math10203808
Dalal S, Seth B, Radulescu M, Secara C, Tolea C (2022) Predicting fraud in financial payment services through optimized hyper-parameter-tuned XGBoost model. Mathematics 10:4679. https://doi.org/10.3390/math10244679
Dang TK, Tran TC, Tuan LM, Tiep MV (2021) Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems. Appl Sci 11:10004. https://doi.org/10.3390/app112110004
Esenogho E, Mienye ID, Swart TG, Aruleba K, Obaido G (2022) A neural network ensemble with feature engineering for improved credit card fraud detection. IEEE Access 10:16400–16407. https://doi.org/10.1109/ACCESS.2022.3148298
Fatima EB, Omar B, Abdelmajid EM, Rustam F, Mehmood A, Choi GS (2021) Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection: application to fraud detection. IEEE Access 9:28101–28110
Hoang NL, Trang LH, Dang TK (2020) A comparative study of the some methods used in constructing coresets for clustering large datasets. SN Comput Sci 1:1–12
Jain D, Choudhary D, Anand A, Trivedi NK, Gautam V, Mohapatra SK (2022) Cybersecurity solutions using AI techniques. In: 2022 10th International conference on reliability, Infocom technologies and optimization (trends and future directions) (ICRITO), Noida, India, 2022, pp 1–8. https://doi.org/10.1109/ICRITO56286.2022.9965045
Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst Appl 175:114750
Lin T-H, Jiang J-R (2021) Credit card fraud detection with autoencoder and probabilistic random forest. Mathematics 9:2683. https://doi.org/10.3390/math9212683
Malik EF, Khaw KW, Belaton B, Wong WP, Chew X (2022) Credit card fraud detection using a new hybrid machine learning architecture. Mathematics 10:1480. https://doi.org/10.3390/math10091480
Marappan R, Sethumadhavan G (2018) Solution to graph coloring using genetic and tabu search procedures. Arab J Sci Eng 43:525–542. https://doi.org/10.1007/s13369-017-2686-9
Marappan R, Sethumadhavan G (2020) Complexity analysis and stochastic convergence of some well-known evolutionary operators for solving graph coloring problem. Mathematics 8:303. https://doi.org/10.3390/math8030303
Marappan R, Sethumadhavan G (2021) Solving graph coloring problem using divide and conquer-based turbulent particle swarm optimization. Arab J Sci Eng. https://doi.org/10.1007/s13369-021-06323-x
Mniai A, Jebari K (2022) Credit card fraud detection by improved SVDD. In: Proceedings of the world congress on engineering 2022. WCE 2022, July 6–8, 2022, London, UK
Sharma M, Sharma H, Bhutani P, Sharma I (2021) Credit card fraud detection using machine learning algorithms. In: Innovations in cyber physical systems. Springer Singapore
Tran TC, Dang TK (2021) Machine learning for prediction of imbalanced data: credit fraud detection. In: Proceedings of the 2021 15th international conference on ubiquitous information management and communication (IMCOM), Seoul, Korea, 4–6 January 2021, pp 1–7
Trivedi NK, Simaiya S, Lilhore UK, Sharma SK (2020) An efficient credit card fraud detection model based on machine learning methods. IJAST 29(05):3414–3424
Funding
The authors have not disclosed any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest in publishing the paper.
Ethical approval
This article does not contain any studies with human participants or animals performed by any authors.
Additional information
This article has been retracted. Please see the retraction notice for more detail:https://doi.org/10.1007/s00500-024-10125-2
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jayanthi, E., Ramesh, T., Kharat, R.S. et al. RETRACTED ARTICLE: Cybersecurity enhancement to detect credit card frauds in health care using new machine learning strategies. Soft Comput 27, 7555–7565 (2023). https://doi.org/10.1007/s00500-023-07954-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-07954-y