Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Hancock, John T.; Khoshgoftaar, Taghi M.

doi:10.1007/s42979-023-01880-4

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Original Research
Published: 22 June 2023

Volume 4, article number 462, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

SN Computer Science Aims and scope Submit manuscript

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Download PDF

109 Accesses
1 Citation
Explore all metrics

Abstract

We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study that includes datasets compiled from the latest Medicare Part B and Medicare Part D data. The datasets became available in 2021, and are the largest such datasets that we know of. We report details on two important findings. The first finding is that increased maximum tree depth is associated with the best performance in terms of area under the receiver-operating characteristic curve (AUC) for both datasets. The second finding, which is an important counterbalance to the first finding, is that one may utilize random undersampling (RUS) to reduce the size of the training data and simultaneously achieve similar or better AUC scores.To the best of our knowledge, our study is novel in reporting the importance of maximum tree depth for classifying imbalanced Big Data. Moreover, this work is unique in demonstrating that one may employ RUS to mitigate the increased resource consumption of higher maximum tree depth.

Evaluating classifier performance with highly imbalanced Big Data

Article Open access 11 April 2023

Data reduction techniques for highly imbalanced medicare Big Data

Article Open access 03 January 2024

Severely imbalanced Big Data challenges: investigating data sampling approaches

Article Open access 30 November 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The exploration of new data is fertile ground for the development of contributions to an application domain. New Medicare insurance claims data became publicly available in 2021. These data, which we use in our study, come from two related sources. The first source is Medicare Physician & Other Practitioners—by Provider and Service (Part B) [1]. The second source is Medicare Part D Prescribers—by Provider and Drug (Part D) [2] insurance claims data. The datasets we compile from the Part B and Part D data are highly imbalanced Big Data. Our Part B dataset has approximately 68 million training instances with a minority-to-majority class ratio of approximately 0.0019, and our Part D dataset has approximately 173 million instances, with a minority-to-majority class ratio of approximately 0.0039. With all the facts available to us, we claim that this is the first study to contain findings that cover the latest Part B and Part D data in a single study on Medicare fraud detection as a supervised Machine Learning task.

Medicare is the United States’ public health insurance program. It provides health insurance for millions of Americans aged 65 and over, as well as those with certain disabilities. Medicare insurance fraud detection is a worthwhile pursuit, because facts indicate that a large amount of money could be recovered. Once recovered, it could be spent on providing more extensive healthcare to Medicare beneficiaries. The Centers for Medicare and Medicaid Services (CMS) are the United States government departments responsible for Medicare. In 2019, the CMS provided an estimate that it made approximately $100 billion in improper payments [3]. In the same year, the United States Department of Justice published a report stating it recovered approximately $3 billion prosecuting insurance fraud [4]. The CMS uses the term improper payments to cover payments made due to fraud and other errors. However, it is reasonable to assume that, due to the amount of estimated improper payments, more fraud could be detected. Therefore, Medicare fraud detection is a fitting application domain for the field of Machine Learning, since it is a suitable tool for processing the Big Data repository of Medicare Insurance claims information.

The contributions we make in this research on the subject of classifying highly imbalanced Big Data for Medicare fraud detection concern three key concepts: maximum tree depth, Random Undersampling (RUS), and Area Under the Receiver-Operating Characteristic Curve (AUC) [5]. These contributions are an expansion of work we published previously [6]. Maximum tree depth refers to the longest allowable path from root to leaf node in a decision tree. It is an adjustable parameter in the decision-tree-based ensemble classifiers we use. The classifiers we use are: CatBoost [7], XGBoost, [8], Random Forest [9], and Extremely Randomized Trees (ET) [10]. RUS is a technique to improve classification results when working with data that has a low minority-to-majority class ratio. Such data are known as imbalanced data. To perform RUS, one chooses a minority-to-majority class ratio, and then randomly discards instances of the majority class until the remaining data have the desired class ratio. AUC is a metric for measuring the performance of a classifier. It is calculated by varying the classification decision threshold from zero to one in small increments and plotting the false-positive rate versus the true-positive rate that the classifier yields for each value of the decision threshold. Once the points are plotted, a curve is formed, and AUC is the area under this curve. A perfect classifier yields an AUC score of 1.0, and a classifier that assigns instances to classes randomly yields an AUC score of approximately 0.5. These definitions of RUS, maximum tree depth, and AUC are key to understanding the contributions in this work.

This body of research makes two contributions to the field of research in classifying imbalanced Big Data. The first contribution of our research is to show that maximum tree depth in decision tree-based ensemble classifiers is a highly effective parameter to optimize AUC scores for Medicare fraud detection. We show that optimizing this parameter yields AUC scores over 0.98 in some cases. We vary maximum tree depth in our experiments over a wide range of values to show its substantial effect on experimental outcomes. Moreover, for each experiment, we use one, and only one, value for maximum tree depth.

The second contribution we make is to show that one may apply RUS to the Part B and Part D data and build Machine Learning models that yield AUC scores that are similar to, or better than AUC scores of models built with the full datasets. This is an important finding, because it allows for faster execution of model training. Since the Part B and Part D data are highly imbalanced, when we apply RUS to induce larger class ratios in the model training data, we reduce the size of the training data. Training the popular, open-source Machine Learning algorithms we employ is faster with smaller training data. This finding is a boon to researchers working in the field, since it enables one to conduct more experiments with the Part B and Part D data. The sections that follow this introduction are: Related Work, Algorithms, Methodology, Results, Statistical Analysis, and Conclusions.

Related Work

In this section, we provide background on research leading up to our study, and make the case for the novelty of our work. Ensemble decision tree-based Machine Learning techniques, the effect of maximum tree depth, RUS, and Big Data are the essential subjects of our study. Therefore, related works pertain to these concepts. Many of the studies were not conducted with datasets on the scale of the datasets we use. Moreover, in our search for related work, we found similar studies, but none that show the impact of maximum tree depth and RUS on AUC scores to the extent that we do, none use CatBoost for encoding categorical features, and none contain results of experiments performed with the latest Part B and Part D data.

In their 2018 study on data sampling and imbalanced big data, Bauder et al. compare multiple sampling techniques for Medicare fraud detection [11]. They compile a dataset by combining the Part B, Part D, and one additional Medicare dataset known as the “Medicare Durable Medical Equipment, Devices & Supplies—by Referring Provider and Service” (DMEPOS) dataset [12]. Because their study was published in 2018, a smaller amount of Medicare claims data were available than what we experiment with here. Therefore, their combined dataset has fewer than 1 million instances. They employ versions of Random Forest, Logistic Regression [13], and Gradient Boosted Trees [14] for the Apache Spark environment [15]. In their findings, Bauder et al. report experimental outcomes for six data sampling techniques: RUS, Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), two adaptations of borderline SMOTE [16], and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). Of the six techniques, Bauder et al. report that applying RUS to the training data fed to their classifiers yields the best performance. For these reasons, we use RUS as well. On the other hand, Bauder et al. do not show the impact of maximum tree depth on AUC scores for Medicare fraud Detection. In addition, they do not employ CatBoost encoding as we do to handle categorical features.

“An ensemble random forest algorithm for insurance big data analysis” by Lin et al. is a related study [17], because it involves Random Forest, sampling techniques, and classifying imbalanced data. In their study, the authors propose a variation on the Random Forest algorithm for predicting the likelihood that a consumer will purchase life insurance. They compare the performance of models built from training data with, and without SMOTE applied to address class imbalance. However, in their study, details on the class ratios induced by SMOTE are not apparent. Their study appears to be more concerned with the impact of their sampling technique on running time. We find that RUS has a negligible impact on running time, and therefore focus on its impact on classification results. We did not find details on the features of Lin et al.’s dataset, such as whether it has categorical features. One useful feature of our study is that we document the use of CatBoost encoding [7], a technique for handling categorical features that is practical for large datasets. The dataset Lin et al. work with has approximately 500,000 instances and 16 attributes. Our study involves two much larger datasets. Moreover, we do not aim to propose a new variation on Random Forest. We use a publicly available, open-source version of Random Forest. For these reasons, our study is set apart from the one done by Lin et al.

A second study which involves Random Forest and Big Data is by Del Río et al. [18]. The dataset they use has approximately 6 million instances and 41 attributes. Therefore, it is on a smaller scale than what we work with here. Their study also does not concern maximum tree depth. Rather than experiment with maximum tree depth, Del Río et al. use one maximum tree depth setting for all experiments. Therefore, it is not a factor which can be analyzed for effect as we do here. While Del Río et al. document that RUS is applied to their data, it is only applied to induce a 1:1 class ratio. Here, we document the application of RUS to induce five class ratios and show its effect on experimental outcomes. Therefore, the key differences between our study and Del Río et al.’s are our treatment of maximum tree depth, and RUS level as experimental factors.

Herrera et al. [19] published a related work that explores the impact of maximum tree depth with a classifier in common with one that we use. This is another study where Random Forest is employed to classify the so-called Big Data. However, the dataset Herrera et al. conduct their experiments with contains 581,012 training instances, which is much smaller than the datasets we work with. Furthermore, we note that Herrera et al. use only a single dataset, whereas we present results covering two datasets. The focus of Herrera et al.’s study is a novel implementation of Random Forest for a high-performance computing environment. Hence, their interest in maximum tree depth is its impact on the running time of their implementation. Our focus on maximum tree depth is its impact on AUC scores for classifying Medicare insurance claims data.

In their 2017 study, Genuer et al. evaluate multiple Random Forest variants’ performance to classify a dataset with approximately 120 million instances [20]. They compare the performance of five variants of Random Forest classifiers in terms of prediction error. Subsampling is an important term in their study, but it is not a technique for addressing class imbalance. Genuer et al. present subsampling as a technique that is a part of building Random Forest models. We cover RUS, which is a sampling technique for addressing class imbalance. Moreover, Genuer et al.’s study concerns subsampling in conjunction with variations on Random Forest, whereas we present results from experiments combining RUS with Random Forest and other classifiers as well.

Fauzan and Murfi perform experiments with XGBoost and an insurance company’s customer data to forecast whether the customer will file an insurance claim in [21]. The customer data comprise a dataset of approximately 1.5 million instances, with 57 attributes. As part of hyperparameter tuning, Fauzan and Murfi vary maximum tree depth between four and five. We take a more in-depth look at maximum tree depth, and we look at a broader range of maximum tree depths. In our experiments, maximum tree depth takes the values from 6, 16, 24, 32, and 48. Moreover, our experiments involve other learners, Random Forest, CatBoost, and ET in addition to XGBoost. Similar to Genuer et al., Fauzan and Murfi document experiments with subsampling, but we do not find that they experiment with RUS as a technique for addressing class imbalance in Big Data.

In their research on the use of XGBoost to predict loan default, Li et al. employ hyperparameter tuning to optimize results [22]. While they document that they use a grid search method to do hyperparameter tuning, they do not cover the impact of maximum tree depth on classification results. Furthermore, Li et al. use a dataset that has approximately 130,000 instances and 143 features. Therefore, their study does not probe the impact of maximum tree depth on the classification of Big Data in the manner that our study does. One similarity between our study and Li et al. is the use of RUS. Their dataset is imbalanced, with a 1:22 minority-to-majority class ratio, and they write that they undersample their data to induce a 1:4 class ratio. We experiment with a wider range of class ratios and provide analysis of the effect of RUS on the classification of two larger datasets.

In a study with data on a scale more characteristic of Big Data, Wang et al. document the performance of a classifier built from a combination of logistic regression and XGBoost [23]. The dataset they use has approximately 50 million instances. The aim of their study is to show the performance of their proposed classifier in predicting the users’ activity in e-commerce websites. While Wang et al.’s research involves XGBoost and Big Data, it does not explore the effectiveness of maximum tree depth or RUS in the classification task. Another difference between the studies is that, for easy repeatability, we use popular, open-source classifiers without modification. These facts differentiate our studies.

In 2019, Johnson and Khoshgoftaar published a study which involves data sampling and Medicare fraud detection with deep learning classifiers [24]. However, their study involves Medicare insurance fraud data which does not include the additional data that became publicly available in 2021 that we use here. Furthermore, Johnson and Khoshgoftaar use a different technique for handling the categorical attributes of the Medicare insurance claims data. They aggregate the data by healthcare provider, and replace some categorical data with descriptive statistics of the numeric data in the records that they aggregate over. Since the dataset they work with is aggregated, it is smaller than the datasets we work with here. In their study, Johnson and Khoshgoftaar experiment with a dataset that has approximately 6 million instances. Another key difference between the two studies is that we use decision tree-based classifiers, whereas Johnson and Khoshgoftaar use deep learning classifiers.

In our review of related work, we find opportunities to make contributions. There are many studies that discuss different components of our study, but none that reveal the synergy they provide when taken together. We find studies that claim to involve Big Data, but the data used are not on the scale of the data we use. We find studies that mention maximum tree depth, but do not systematically investigate the impact of maximum tree depth on the classification of imbalanced Big Data. To the best of our knowledge, this study is novel, because it covers experiments performed on two new, highly imbalanced Big Data datasets, with a unique methodology to show the impact of RUS and maximum tree depth on AUC scores.

Datasets

We use data provided by the CMS. CMS collected the Part B and Part D data from 2013 to 2019, and published it in 2021. To the best of our knowledge, these datasets cover the longest period of time used in any study on Machine Learning for Medicare fraud detection. Moreover, the attributes of the data are different from those used in the previous studies. The Part B and Part D data are not immediately suitable for supervised Machine Learning. For example, the instances of the dataset must be labeled.

For both of the Part B and Part D datasets, we use the same logic to label them. The data for the labels are published by another department of the United States Government, the Office of the Inspector General (OIG). The publication is known as the List of Excluded Individuals and Entities (LEIE) [25]. The LEIE is updated monthly. The individuals and entities appearing in the LEIE are excluded from government backed healthcare programs, including Medicare. In the LEIE, the OIG provides categories of exclusions along with identifying information. For the application domain of Medicare fraud detection, we select the records of the LEIE that contain the same exclusion categories as documented by Herland et al. [26]. Table 1 is a copy of the table from Herland et al. that specifies which exclusion rule codes we use to identify the records of fraudulent providers.

Table 1 LEIE exclusion codes and rules

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Abstract

Similar content being viewed by others

Evaluating classifier performance with highly imbalanced Big Data

Data reduction techniques for highly imbalanced medicare Big Data

Severely imbalanced Big Data challenges: investigating data sampling approaches

Explore related subjects

Introduction

Related Work

Datasets

Algorithms

Methodology

Results

Statistical Analysis

Part B Experiments Without RUS: Analysis of Results in Terms of AUC

Part D Experiments Without RUS: Analysis of Results in Terms of AUC

Part B RUS Experiments: Analysis of Results in Terms of AUC

Part D RUS Experiments: Analysis of Results in Terms of AUC

Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation