1 Introduction

In today’s fiercely competitive business landscape, customers represent the most invaluable asset of every enterprise (Seturi 2024). Companies must not only attract visitors but also engage them, convert them into customers, and retain them over the long term. Building a strong relationship with customers is imperative to prevent their departure (Ehsani and Hosseini, 2023a). This involves offering products and services tailored to their desires and consistently striving to satisfy them (Ehsani and Hosseini, 2023b). Customer churn prediction (CCP) models are designed to identify customers with the highest likelihood of attrition (Rao et al. 2024). By doing so, these models enable businesses to enhance the effectiveness of customer retention campaigns and reduce the costs associated with churn (Verbeke et al. 2012). In this brief overview, we delve into this critical issue.

1.1 Overview

The rise of electronic commerce platforms has sparked significant shifts in customer behavior and decision-making (Pan and Zhou 2020; Wu et al. 2023). Suppliers have leveraged various electronic marketing spaces to showcase their products and services and reach their target customers (Varadarajan et al. 2022). As customers’ demands evolve over time and they gain access to numerous options in the electronic marketing environment, their purchasing power and ability to switch between alternatives have increased (Calvano and Polo 2021). Furthermore, they are inundated with vast amounts of information, enhancing their knowledge of products and services (Isa and Nayan 2020). Nearly all electronic commerce organizations employ diverse strategies to engage with visitors and convert them into customers (Anshari et al. 2019). They demonstrate significant perseverance in providing incentives and establishing strong connections to foster long-term relationships with their customers (Mkansi 2022). Offering various amenities to serve customers efficiently, they aim to influence their lifetime value (Calderón-Monge and Ramírez-Hurtado 2022). However, despite marketing programs incorporating various comprehensive sets of business applications to enhance efficiency and effectiveness in managing customer demands, customer churn from one market to another remains a persistent challenge for businesses.

1.2 Problem description

Customer churn is defined as the discontinuation of business between customers or subscribers and a firm or service (Verbeke et al. 2012). It occurs when customers depart from a firm due to dissatisfaction or competitive situations, such as the introduction of new products. Even a reduction in the frequency of purchases by customers can contribute to this issue for businesses (Çelik and Osmanoglu 2019). Customers who opt to stop purchasing can be categorized as intentional or unintentional churners. Intentional churners are customers who actively choose to end their relationship with the marketers, whereas unintentional churn occurs when the firm or service discontinues the customer’s access due to reasons like delayed payment, non-payment for products, or misuse of the firm’s data and information (Tsai and Chen 2010). A potential solution to address this challenge is predicting which customers are likely to leave.

1.3 Customers churn prediction

The analysis of customer churn rate involves researching the likelihood of customers moving from one business to another and ceasing to purchase their products or services (Ahn et al. 2019). Customer churn prediction (CCP) is a marketing strategy aimed at assisting marketers in adopting practical approaches to acquire new customers, retain existing ones, and assure customers about the quality of the firm’s products and services (Mehralian 2022). The primary objective of CCP is to identify the most valuable customers crucial for maintaining electronic markets (Bachmann et al. 2021). CCP endeavors to enhance satisfaction for both firms and customers by providing insights into customer demands and behaviors (Rane et al. 2023). Another goal of CCP is to establish, manage, and reinforce enduring connections with customers (Asthana 2018). The CCP process seeks to shape perceptions by recognizing customers and providing incentives to retain them (Mahajan and Gangwar 2017). Customer retention occurs when an existing organization meets customers’ demands. The revenue of companies typically relies on profits derived from customer purchasing decisions, making customers the most valuable asset of any business. CCP is of utmost importance in revenue estimation models across various industries such as electronic banking, telecommunication, insurance, and retailing (Çelik and Osmanoglu 2019). Every business aims to acquire new customers, upsell and cross-sell to them, and extend their retention period. Maintaining current customers is more cost-effective (Vijaya and Sivasankar 2018; Lalwani et al. 2022) because, in today’s saturated and intensely competitive markets, acquiring new customers can be over twenty times more expensive for enterprises than retaining existing ones. Additionally, the value proposition of organizations is directly proportional to the number of active customers engaging in transaction and making multiple purchases over time. The profitability of satisfied and loyal customers is higher and depends on parameters such as expenses, investment capacity, profitability, cash flow, and firm size (Çelik and Osmanoglu 2019).

1.4 Project planning

In the current research, CCP focuses on recognized customers, who represent the most profitable segment in transaction, telecommunication, and customer churn datasets. These datasets contain demographic, historical, and behavioral information of customers, enabling the evaluation of their potential value and prediction of their future desires. We employ churn prediction to identify less engaged customers using an ensemble of classifiers approach. The objective is to pinpoint the rate of customers transitioning from a service to competitors and encourage them to remain with our service. We introduce a novel meta-classifier composed of five more practical classifiers to achieve the highest prediction accuracy. The main contributions of this study are outlined below:

  • We analyzed transaction, telecommunication, and customer churn datasets individually, identifying their samples and explaining each feature.

  • We performed data preprocessing to clean, encode, scale, and transform the datasets in preparation for further analysis.

  • Considering the imbalance in the transaction and telecommunication datasets, we implemented the SMOTE-ENN resampling method to enhance the predictive performance of classifiers.

  • We utilized the permutation and SelectKBest feature selection with f_classif to rank the most practical features in each dataset and optimize predictions.

  • To select and fine-tune the best parameters for each classifier, we employed the BayesSearchCV and GridSearchCV hyperparameter optimization method.

  • We applird Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees as the five base supervised classifiers on each dataset separately, using the same cross-validation and evaluation setup.

  • Furthermore, we introduced a novel meta-classifier to evaluate the performance of customer churn predictions, which outperforms other classifiers in terms of accuracy and error rates.

  • Additionally, we executed ten runs per dataset to report both a mean and a standard deviation of each evaluation to approve the superiority of the new proposed meta-classifier.

  • Finally, we compared the experimental results of the new meta-classifier with Stacking, a powerful ensemble-based classification method, to demonstrate the superior performance of our proposed approach.

2 Datasets description

We conducted our research using three distinct datasets. The “transaction data” pertains to information regarding transactions made by customers when they purchase products or services. This dataset includes customer characteristics, payment information, loan and task details, as well as a chronological sequence of events involving customers (Hand 2018). Telecommunication involve the electronic transmission of information across distances. This information can take the form of voice telephone calls, data, text, images, or video. Telecom providers gather data from various sources, such as call detail records, network logs, and customer history (Loo and Ngan 2012). The “customer churn dataset” encompasses demographic, historical, and subscription details of customers from a service provider company.

2.1 Transaction dataset

The transaction dataset comprises information on 10,000 unique customers of an English bank, encompassing 14 attributes related to the customers. These samples were collected from the bank’s database, sourced from www.Kaggle.com. The first column, “Row Number,” serves as a non-deterministic analytic function allocating a unique number to each customer in the dataset. The “Customer Id” column assigns a distinct identification value to prevent duplications. When customers register on the bank’s website, they choose a “Surname” for identification purposes during account creation and login. However, these three features solely serve for customer identification and do not influence their behaviors or decisions regarding leaving the bank. Moreover, they exhibit high correlations with each other according to the Correlation Matrix, prompting our decision to remove them. The “Credit Score” attribute is a numerical representation of a customer’s creditworthiness, based on factors such as credit history, number of open accounts, types of loans, debt levels, and payment patterns. Lenders utilize this score to assess the likelihood of loan repayment within the specified timeframe, assigning an integer value between 300 and 850 to each customer. Scores above 700 typically indicate promising borrowers who may receive lower interest rates (W. Liu, 2022). Customers with higher credit scores tend to remain loyal and continue utilizing the bank’s services. “Geography” denotes the country of origin of the participating customers, including France, Spain, and Germany. “Gender” indicates that the customer is male or female, while “Age” represents how old they are. Generally, younger customers exhibit higher retention rates compared to older ones. “Tenure” reflects the duration for which a customer remains subscribed to the bank, with younger clients typically displaying lower loyalty and higher churn rates. “Balance” encompasses all deposits and withdrawals in a customer’s account, reflecting the total amount of money. This variable is significant in churn prediction, as customers with lower balances are more likely to leave. “NumOfProducts” indicates the number of products purchased through the bank, while “Has Cr Card” specifies whether a customer possesses a credit card. Active customers, denoted by “Is Active Member,” engage in frequent transactions, indicating loyalty to the bank. “Estimated Salary” represents the approximate earnings of each customer. Similar to balance, customers with higher salaries tend to engage in more transactions. The transaction dataset focuses on binary classification, with “Exited” serving as the target variable to evaluate the accuracy of the algorithms employed. “Exited” acts as the class label, with 7,963 customers belonging to the positive class, indicating those who did not abandon the bank and its services. Conversely, the remaining 2,037 customers are classified in the negative class as they opted to leave the bank. This distribution results in an imbalanced binary classification dataset, which may pose challenges during model training and evaluation.

2.2 Telecommunication dataset

In the highly competitive telecommunication industry, the annual churn rate ranges from 15–25%. Customers have a plethora of service providers to choose from and often switch between them actively. The telecommunication dataset used in this research comprises 7043 unique customers and 21 attributes obtained from www.Kaggle.com. The “Customer ID” is a unique identifier for each customer. However, according to the correlation matrix, it has no impact on churn prediction and is solely used for customer identification. Consequently, we have decided to remove it. The dataset includes demographic information about customers: “gender” indicates whether the customer is male or female, “SeniorCitizen” indicates whether the customer is a senior citizen, “Partner” represents whether the customer has a partner, and “Dependents” signifies whether the customer has dependents. This dataset comprises the services each customer has enlisted for: “PhoneService” shows whether the customer has a phone service, “MultipleLines” demonstrates whether the customer has multiple lines, “InternetService” denotes the customer’s internet service provider, “OnlineSecurity” specifies if the customer has online security, “OnlineBackup” affirms if the customer has online backup, ‘DeviceProtection’ confirms if the customer has device protection, “TechSupport” declears whether the customer has tech support, “StreamingTV” expresses if the customer has streaming TV, and “StreamingMovies” cites whether the customer has streaming movies. The dataset also contains customer account information: “tenure” allocates the number of months the customer has remained with the company, “Contract” assigns the contract term of the customer, such as Month-to-month, One year, or Two years. “PaperlessBilling” infers whether the customer has opted for paperless billing, while “PaymentMethod” refers the customer’s payment method, such as Electronic check, Mailed check, Bank transfer, or Credit card. “MonthlyCharges” mentions the amount charged to the customer monthly, while “TotalCharges” reflects the total amount charged to the customer. “Churn” serves as the target variable, indicating whether the customer has left within the last month or not. Among the dataset, 5174 customers belong to the positive class, indicating they have not abandoned the telecommunication services. The remaining 1869 customers have been classified in the negative class as they decided to leave. This imbalance in our binary classification dataset requires attention.

2.3 Customer churn dataset

The customer churn dataset comprises 100,000 unique customers and 9 attributes related to customer historical usage behavior, demographic information, and subscription details sourced from www.Kaggle.com. The attributes are as follows: “CustomerID” dedicates a unique identifier for each customer, and “Name” indicates the customer’s name. However, based on the correlation matrix, these two features solely serve for customer identification and have no effect on churn prediction, leading us to remove them. “Age” represents the age of the customer, while “Gender” denotes the gender (Male or Female). “Location” specifies the customer’s base, with options including Houston, Los Angeles, Miami, Chicago, and New York. “Subscription_Length_Months” implicates the number of months the customer has been subscribed. “Monthly_Bill” reflects the monthly bill amount for the customer, and “Total_Usage_GB” defines the total usage in gigabytes. “Churn” serves as the target variable, representing whether the customer has churned or not. Out of the dataset, 50,221 customers belong to the positive class, indicating they did not abandon, while the remaining 49,779 have been classified in the negative class as they decided to leave. The churn variables are almost evenly distributed, suggesting no problem of class imbalance. Therefore, we can conclude that our data is balanced.

3 Methodology

Predicting customer churn poses a significant challenge in the business domain. The ability to forecast churn rates has markedly improved with advancements in data mining tools and machine learning classifiers. From this study’s standpoint, customer churn prediction involves predicting whether customers belong to churner or non-churner categories based on their historical behavior data. The primary contributions of this study are as follows:

3.1 Data preprocessing and visualization

Data preprocessing approaches encompass transformation methods that redefine a target variable and approaches enabling the estimation of uplift by introducing additional predictor variables incorporated within a standard predictive framework (Devriendt et al. 2021). To preprocess the data, we initiated with data cleaning. This involved conducting Exploratory Data Analysis (EDA) to ensure that all features possessed the correct data type, handling missing values, eliminating duplicate records and outliers from the datasets, and checking for rare categories. Some categories may be prevalent in the dataset, while others may appear only in a few observations. Rare values in categorical variables tend to lead to overfitting, especially in tree-based methods. Furthermore, rare labels may be present in the training set but not in the test set, or vice versa, causing the classifier to struggle to accurately evaluate them. Moving to the feature engineering process, some features are continuous, as illustrated in Fig. 1:“Balance” and “EstimatedSalary” in the transaction dataset. Some features are numerical such as those shown in Fig. 2: “TotalCharges”, “MonthlyCharges”, and “tenure” in the telecommunication dataset. Next, we applied Feature Encoding techniques to convert categorical features into numerical ones, which is crucial for achieving high accuracy. Then, we scaled numeric features to ensure they had similar scales and were not biased during model training. Across all three datasets, we utilized “Label Encoder” for encoding and employed “Standard Scaler” for feature scaling and transformation. Subsequently, we assessed the correlation between independent and dependent features using the correlation matrix. Highly correlated features that could impact churn prediction quantitatively were removed, as mentioned earlier. Finally, recognizing that the transaction and telecommunication datasets were imbalanced, we addressed this issue by employing the SMOTE-ENN resampling method.

Fig. 1
figure 1

Histogram plots for continuous features in transaction dataset

Fig. 2
figure 2

Box plots for numerical features in telecommunication dataset

3.2 Solving class imbalance and class overlap

In binary-classification problems, both class overlap and class imbalance pose significant challenges. These issues often exhibit the most substantial negative impact among potential features, leading to poor performance across various classifiers (Vuttipittayamongkol et al. 2021). Class imbalance refers to an uneven distribution of classes within a dataset, where the minority class is notably smaller in size yet holds primary significance with a relatively high misclassification cost. While, class overlap occurs when there is ambiguity between two classes. These problems are typically addressed using a kernel function that maps the original data to a highly dimensional feature space, aiming to maximize the linear separability of the original data within this space. Overlapping data contain regions where the probabilities of different classes are nearly equal, posing challenges for inference due to nearly identical a priori probabilities in these overlapping areas, making it difficult or even impossible to distinguish between the two classes. (Lee and Kim 2018). Many experts suggest that combining oversampling and undersampling methods with data cleansing approaches, such as SMOTE-Tomek and SMOTE-ENN, may concsider as a feasible solution for these challenges (Santos et al. 2022). In our study, we chose to apply the SMOTE-ENN method to overcome this issue because the undersampling process of SMOTE-ENN is more rigorous compared to SMOTE-Tomek. SMOTE (Synthetic Minority Oversampling Technique) achieves oversampling by creating artificial examples based on similarities in the feature space among existing minority samples (Verbeke et al.,. 2012), while ENN (Edited Nearest Neighbour) conducts undersampling by removing majority class samples from both the original and sample result datasets, specifically targeting those instances where the nearest minority class samples lead to misclassification. This approach effectively eliminates majority class instances close to the border where misclassifications occur (Santos et al. 2022). In the current research, we utilized both oversampling and undersampling methods to address the challenges posed by overlapping and imbalanced datasets, such as those found in transaction and telecommunication data. Since the customer churn dataset is balanced, we did not need to apply these techniques. As demonstrated in Code 1, we imported SMOTE-ENN from the “imblearn.combine” module within the Jupyter notebook environment for the transaction dataset. We applied the same code to the telecommunication dataset.

Code 1
figure 1

Solving imbalanced data using SMOTE-ENN in transaction dataset

3.3 Feature selection

Feature selection is a technique used in classification to identify and the top relevant features from a larger set of features in a dataset. Feature selection significantly influences the improvement of a classifier’s performance and avoid overfitting by reducing the number of features used in a classifier. Researchers have developed methods for feature selection aimed at reducing dimensionality, calculation complexity, and overfitting. These techniques aim to extract the most relevant features from customers’ input vectors for prediction (Coussement 2014). One of the most commonly used feature selection methods is SelectKBest, a type of filter-based feature selection method. In filter-based feature selection methods, the feature selection process is independent of any specific machine learning algorithm. Instead, it relies on statistical measures to score and rank the features (Effrosynidis and Arampatzis 2021).

3.3.1 Permutation feature selection

Permutation Feature Importance (PFS) is a technique used in machine learning to assess the importance of different features in a predictive model. PFS is a model inspection method that measures the contribution of each feature to a fitted model’s statistical performance on a given dataset. This technique is particularly useful for non-linear or opaque estimators. It involves randomly shuffling the values of a single feature and observing the resulting degradation in the model’s score. By breaking the relationship between the feature and the target, PFS determines how much the model relies on that particular feature (Altmann et al. 2010). PFS measures the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is considered “important” if shuffling its values increases the model error, indicating that the model relied on the feature for its predictions. Conversely, a feature is deemed “unimportant” if shuffling its values leaves the model error unchanged, indicating that the model ignored the feature for its predictions (Fisher et al. 2019). The permutation importance algorithm is as follows (Fisher et al. 2019):

Input: Trained model\(\:\:m\), feature matrix \(\:X\), target vector \(\:y\), error measure \(\:L(y,\:m)\)

  1. 1.

    Estimate the original model error\(\:\:{e}_{orig}=\:L(y,\:m\:(X\left)\right)\) (e.g. mean squared error)

  2. 2.

    For each feature\(\:\:j\) do:

  • Generate feature matrix \(\:{X}_{perm}\) by permuting feature \(\:j\) in the data\(\:\:X\). This breaks the association between feature \(\:j\) and true outcome\(\:\:y\).

  • Estimate error \(\:{e}_{perm}=L\:(y,\:m\:({X}_{perm}\left)\right)\) based on the predictions of the permuted data.

  • Calculate permutation feature importance as quotient \(\:{FI}_{j}=\:\)\(\:{e}_{perm}\:\:/\:{e}_{orig}\).

  1. 3.

    Sort features by descending \(\:{FI}_{j}\)

There are two separate libraries for executing permutation feature importance in the Jupyter notebook environment: the first is “eli5.sklearn” and the second is “sklearn.inspection.” Since we utilized most of our experiments from the “sklearn” library, we preferred to use “sklearn.inspection” for the permutation feature importance, as displayed in Code 2, which illustrates PFI on the transaction dataset.

Code 2
figure 2

Permutation feature importance in sklearn liberary in transaction dataset

3.3.2 SelectKBest feature selection

SelectKBest is a univariate method that retains only the specified number of highest-scoring features. It employs statistical tests such as the chi-squared test, ANOVA F-test, or mutual information score to rank the features based on their relationship with the output variable. Then, it selects the top K features with the highest scores for inclusion in the final feature subset. SelectKBest is user-friendly and efficiently reduces the feature set to a manageable number, which is particularly crucial when dealing with large datasets (Achal et al. 2023). SelectKBest has two parameters: K and the score function. K represents the number of top features to be selected, while the score function evaluates feature importance (Elmi et al. 2024). There are various types of score functions available. “f_regression” is used for linear regression and computes the F-value between the feature and the target. “mutual_info_regression” is used for regression and calculates the mutual information between two random features. “f_classif,” the default parameter, is used for classification and computes the ANOVA F-value between the feature and the target. “mutual_info_classif” is used for classification and computes the mutual information between two discrete features. “chi2” is employed for classification and computes the chi-squared statistics between each feature and the target. “SelectPercentile” selects the highest X% of the features based on the score function (Karthikeyan et al. 2023). The SelectKBest method was extracted from the “sklearn.feature_selection” module, employing StratifiedKFold for cross-validation (CV). In this study, since we are dealing with a binary classification problem, we decided to perform SelectKBest using “f_classif” and set “k = 3” in the “sklearn.feature_selection” environment, as indicated in Code 3. For instance, in the customer churn dataset, “Age,” “Location,” and “Monthly_Bill” were identified as the three best-selected features.

Code 3
figure 3

The SelectKBest feature selection

3.4 Hyperparameter optimization

Hyperparameter optimization (HPO) involves identifying the optimal parameters and fine-tuning them within the context of machine learning models. This process typically entails iterating over a subset of tuning parameters related to regularization, kernels, or learning rates, in order to find the best combination that maximizes the model’s performance (Claesen et al. 2014). GridSearchCV, a powerful module of the “Sklearn.model_selection” package, is utilized for HPO. It iterates through a subset of various parameters, combines these parameters, and fits the model to the training dataset. Subsequently, it identifies the best values and combinations of parameters that yield the highest accuracy. This process involves setting up the hyperparameters in a discrete grid, testing all combinations of values within the grid, and evaluating performance using cross-validation (CV). The optimal mixture of values for the hyperparameters is determined by the point in the grid that maximizes the average value in cross-validation (Shuai et al. 2018). In this study, GridSearchCV was employed to optimize and fine-tune the hyperparameters of the classifiers, with StratifiedKFold chosen as the cross-validation strategy.

3.4.1 GridSearchCV

GridSearchCV, a powerful module from the “sklearn.model_selection” package, is used for hyperparameter optimization (HPO). It iterates through a subset of various parameters, combining them and fitting the model to the training dataset. Subsequently, it identifies the best values and combinations of parameters that yield the highest accuracy. This process involves setting up the hyperparameters in a discrete grid, testing all combinations of values within the grid, and evaluating performance using cross-validation (CV). The optimal mixture of hyperparameter values is determined by the point in the grid that maximizes the average value in cross-validation. GridSearchCV is renowned for selecting the most suitable execution parameters from a predefined set, which has been widely demonstrated to effectively discover and adjust hyperparameters, making it an excellent optimization tool for classifiers (Shuai et al. 2018). GridSearchCV was applied through the “sklearn.model_selection” module within the Jupyter environment. In this study, GridSearchCV was employed to optimize and fine-tune the hyperparameters of the classifiers, with StratifiedKFold chosen as the cross-validation strategy. Code 4 shows the procedure of using GridSearchCV to find and tune the hyperparameters of the AdaBoost classifier.

Code 4
figure 4

The GridSearchCV hyperparameters optimization

3.4.2 BayesSearchCV

BayesSearchCV is a tool for Bayesian optimization over hyperparameters. Bayesian optimization is a sequential design strategy for the global optimization of black-box functions that does not assume any specific functional forms. It is usually employed to optimize expensive-to-evaluate functions (Victoria and Maragatham 2021). BayesSearchCV, a powerful module from the “Scikit-Optimize (skopt)” package, is used for Hyperparameter Optimization (HPO). One advantage of BayesSearchCV is its ability to reach the best results more quickly than GridSearchCV with the same parameters, although it may not be as precise in some cases. In our work, we employed both GridSearchCV and BayesSearchCV for hyperparameter tuning. The hyperparameters obtained from GridSearchCV remain constant and fixed in each run of the tuning program (Zhao et al. 2024). Code 5 illustrates the procedure of using BayesSearchCV with the Random Forest classifier.

Code 5
figure 5

The BayesSearchCV hyperparameters optimization

4 Experimental setup

We chose Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees as the five robust and practical classifiers. To identify the most influential features in the datasets, we performed feature selection (FS) methods. We utilized the SelectKBest feature selection method with the default “f_classif” parameter, as well as Permutation feature selection. These two methods assign higher ranks to features based on their importance. We then fine-tuned their hyperparameters to optimize performance. We applied SelectKBest (SKB) and GridSearchCV (GS) together, and Permutation feature selection (PFS) and BayesSearchCV (BS) together for all classifiers separately in each of the three datasets. For ensemble classification, we proposed the “Oracle” meta-classifier, combining the classifiers to achieve the highest precision rate. Additionally, we compared the experimental results with the “Stacking” ensemble meta-classifier. Implementation-wise, we utilized Decision Tree from the “sklearn.tree” library, and Random Forest, AdaBoost, Extra Trees, and Stacking from the “sklearn.ensemble” library. XGBoost was implemented using the “XGBoost” library. Furthermore, we imported the Oracle classifier from the “DESlib” library, which is thoroughly documented and facilitates the execution of classification models to achieve high accuracy. Additionally, we executed ten runs per dataset (per random seed from a list of values [n1, n2, …, n10]) to report both the mean and standard deviation of each evaluation, confirming the superiority of the new proposed meta-classifier.

4.1 Decision tree

A Decision Tree is a tree-shaped structure that derives classification rules from a training dataset through inductive learning. It partitions a large dataset into smaller sets sequentially (Çelik and Osmanoglu 2019). In this structure, the leaves represent class labels, such as “churner” or “non-churner,” while branches illustrate the connections of variables leading to these class labels. Despite its inability to capture complex and nonlinear relationships among variables, a Decision Tree often achieves high accuracy in churn prediction problems (Asthana 2018; Vafeiadis et al. 2015). The process of improving a Decision Tree involves two main stages: building and pruning. During the building phase, the dataset is recursively partitioned until all samples in each partition share identical values. In the pruning phase, certain branches, typically those with the highest calculated error rate or containing noisy data, are removed to refine the tree structure (Johny and Mathai 2017).

4.2 Random forest

Random Forest is comprised of multiple tree structures and combines the subspaces and bagging elements of Decision Trees randomly. At each node of the tree, Random Forest selects a subset of variables and then extracts the best split among those variables for that node. Utilizing the bagging technique, Random Forest constructs the training dataset for each tree by de-correlating them, fitting them to bootstrap-resampled datasets. This method selects a random sample of variables after splitting each tree for the subsequent partition. The conclusions drawn from Random Forest are based on the majority vote of the individual trees. By employing the bagging technique with a large number of trees, Random Forest addresses some weaknesses of non-ensemble Decision Trees, such as overfitting and lack of robustness (Çelik and Osmanoglu 2019).

4.3 Extreme gradient boosting (XGBoost)

XGBoost is a supervised classifier designed for tabular or structured datasets. It employs a Decision Tree classifier with a gradient boosting approach to accelerate computation time, reduce memory usage, and improve accuracy (Çelik and Osmanoglu 2019). Gradient boosting involves a method where residuals and novel algorithm models are used to calculate errors efficiently. These components are then combined to optimally utilize existing resources for training the prediction classifier. Additionally, XGBoost utilizes gradient descent to minimize or reduce the loss function (Lalwani et al. 2022).

4.4 Adaptive boosting (AdaBoost)

AdaBoost is an adaptive supervized classifier utilized for data classification by amalgamating multiple weak or base learners, such as Decision Trees, into a potent learner. While individual learners may be weak, as long as each performs slightly better than random guessing, the final model is proven to converge to a strong learner. Interestingly, AdaBoost, although primarily employed to combine weak base learners, has demonstrated effectiveness in combining strong base learners as well, yielding an even more accurate model. The functioning of AdaBoost involves weighting instances in the training dataset based on the accuracy of previous classifications. This adaptive approach allows AdaBoost to iteratively improve its performance by focusing more on misclassified instances in subsequent iterations (Wyner et al. 2017).

4.5 Extremely randomized trees (Extra Trees)

Extra Trees is a classifier primarily rooted in Decision Trees. Similar to Random Forest, Extra Trees randomizes certain decisions and subsets of data to mitigate overfitting and over-learning from the data. While it shares similarities with Random Forest, such as building multiple trees and utilizing random subsets of features for node splits, Extra Trees differs in two key aspects. Firstly, it does not bootstrap observations, meaning it samples without replacement. Secondly, nodes are split based on random splits rather than the best splits. Additionally, in contrast to Random Forest, where all samples are randomly selected for both features and observations, Extra Trees exclusively randomizes feature selection. This random splitting mechanism often yields results superior to those obtained by Random Forest to some extent (Sharaff and Gupta 2019).

5 Ensemble selection meta-classifier

One of the key considerations in optimizing a multiple classifier system is the selection of a suitable group of classifiers, often referred to as an “Ensemble of Classifiers,” from a pool of available options. An essential aspect of this process is ensemble selection, which aims to choose the most effective classifiers from a diverse range to achieve optimal recognition rates (Jan and Verma 2020). The ensemble selection mechanism functions as a meta-classifier, providing a collection of classifier implementations to assess each one’s efficiency. This method selects the most appropriate classifiers for class label prediction. This module adopts a fixed and definite subset of the primary ensemble to reduce space complexity, operating in two phases. The first phase, “orders-based selection,” involves selecting a relevant objective function to determine the pertinent classifiers. The second phase, “optimization-based selection,” entails finding an appropriate search algorithm to apply this criterion effectively (Wang et al. 2021).

5.1 Stacking ensemble selection Meta-classifier

Stacking, also known as stacked ensembles or stacked generalization, represents a robust ensemble selection strategy that amalgamates predictions from numerous base classification models (classifiers) to yield a final prediction with enhanced performance (Dey and Mathur 2023). It stands as an advanced meta-classifier surpassing the concepts of bagging and boosting. Illustrated in Fig. 3, stacking diverges from merely aggregating the outputs of classifiers by incorporating a higher-level model, referred to as the meta-model or aggregator, which is trained on the outputs of the base classifiers to formulate the ultimate prediction. Stacking is instrumental in predicting multiple nodes to construct a novel model and enhance classifier performance. By enabling the training of multiple classifiers to tackle similar problems and leveraging their combined outputs, stacking facilitates the creation of a new model with improved performance (Alexandropoulos et al. 2019). Code 6 demonstrates the incorporation of Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees assembled in the Stacking meta-classifier.

Fig. 3
figure 3

Stacking ensemble selection meta-classifier procedure

Code 6
figure 6

The stacking meta-classifier execution

5.2 Oracle ensemble selection meta-classifier

The Oracle stands out as a meta-classifier composition method designed to select the initial classifiers capable of accurately predicting the class label for a given instance. The Oracle is an abstract method that represents an ideal classifier selection scheme. This abstract meta-classifier is considered an ideal model, often estimating the upper bound of classification precision. Typically, the Oracle selects classifiers that reliably anticipate the correct class label, making it an exemplary ensemble selection framework. During a training phase preceding classification, the regions of competence of each classifier are specified. Figure 4 outlines the Oracle procedure, comprising three distinct stages: (1) Generation, (2) Selection, and (3) Aggregation. In Generation, a pool containing \(\:\text{M}\) classifiers \(\:\text{C}=\{{\text{c}}_{1},\:\dots\:,\:{\text{c}}_{\text{M}}\}\) trains to produce a collection of precise, various, and informative classifiers. There are six main strategies for generating a diverse pool of classifiers:

  1. 1.

    Different Initializations: If the training process depends on initialization, employing various initializations may yield different classifiers.

  2. 2.

    Different Parameters: This strategy involves generating base experts with varying configurations of hyperparameters, resulting in different decision boundaries.

  3. 3.

    Different Architectures: This approach includes training multiple Multi-Layer Perceptron (MLP) neural networks with differing numbers of hidden layers.

  4. 4.

    Different Classifier Models: Here, various classification models are combined to diversify the pool of classifiers.

  5. 5.

    Different Training Sets: Each base classifier is trained using a distinct distribution of the training set.

  6. 6.

    Different Feature Sets: This methodology is applicable in scenarios where data can be represented in distinct feature spaces.

In this study, we adopted the second and sixth strategies, utilizing hyperparameter configurations and feature selection to construct the meta-classifier. In Selection phase, an ensemble of the most suitable classifiers (C⊆C) is chosen based on the validation situation. Here, C is selected during the training phase, according to a selection criterion estimated in the validation dataset. The same ensemble C is used to predict the label of all test samples in the generalization phase. In the Aggregation phase, the selected base classifiers produce outputs using a compound rule to predict class labels. This phase involves merging the outputs obtained from these classifiers according to a combination rule. The combination of base classifiers can be executed based on class labels, as seen in the Majority Voting scheme, or by utilizing the scores acquired by the base classifier for each class in the classification problem. In this method, the scores from the base classifiers are understood as fuzzy class memberships or the posterior probability that a test sample \(\:{x}_{j}\:\)belongs to a particular class. (Cruz et al. 2018).

Fig. 4
figure 4

Oracle ensemble selection meta-classifier procedure

As shown in Eq. 1, the level of competence \(\:{\delta\:}_{i,j}\) of a base classifier \(\:{c}_{i}\)is simply computed as the local accuracy achieved by \(\:{c}_{i}\) for the region of competence\(\:{\:\theta\:}_{j}\). The classifier with the highest level of competence \(\:{\delta\:}_{i,j}\) is selected.

$$\:{\delta\:}_{i,j}=\:\sum\:_{k=1}^{K}P\left({w}_{l\:}\left|{\:x}_{k}\in\:{w}_{l},\:{c}_{i}\right.\right)$$
(1)

Where, \(\:{\:\theta\:}_{j}=\left\{{x}_{1},\:\dots\:,{x}_{K}\right\}\)is the region of competence of the test sample \(\:{x}_{j}\) defined on the validation data, \(\:K\) indicates the size of the region of competence and \(\:{w}_{l\:}\)relates the correct label of\(\:\:{x}_{j}\). The Oracle is classically defined in the literature as a strategy that correctly classifies each query instance of\(\:\:{x}_{j}\) if any classifier \(\:{c}_{i}\) from the pool of classifiers C predicts the correct label for\(\:\:{x}_{j}\). Based on Eq. 2, the level of competence \(\:{\delta\:}_{i,j}\) of a base classifier \(\:{c}_{i}\) is equals to 1 if it predicts the correct label for of\(\:\:{x}_{j}\), and 0 otherwise.

$$\:\left\{\begin{array}{c}{\delta\:}_{i,j}=\:\:1\:,\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:if\:\:{c}_{i}\:\:correctly\:classifies\:\:{x}_{j}\:\:\\\:{\delta\:}_{i,j}=\:\:0\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:otherwise\end{array}\right.$$
(2)

The Oracle represents the ideal classifier selection scheme. It always selects the classifier that predicted the correct label, for any given query sample. The Oracle meta-classifier is a relatively new approach that has received limited attention, particularly in prediction problems, let alone in churn prediction. Due to its static and stable nature, this meta-classifier tends to perform better when integrated with static supervised classifiers. In this study, we employed Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees, as they inherently possess static characteristics. Code 7 exhibits the execution of these five classifiers in the funnel of the Oracle meta-classifier. Additionally, we enhanced the Oracle’s structure by incorporating the SelectKBest feature selection along with GridSearchCV separately, and also Permutation feature selection leveraging the BayesSearchCV separately.

Code 7
figure 7

The Oracle Meta-classifier execution

6 Evaluation and results

This study focuses on developing customer churn prediction by employing static supervised classifiers on transaction, telecommunication, and customer churn datasets. We implemented Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees in constructing Stacking and Oracle meta-classifiers. BayesSearchCV and GridSearchCV hyperparameter optimizations were applied to each classifier to determine the most suitable parameters. Additionally, we incorporated Permutation and SelectKBest filter-based feature selection techniques. This integration of Stacking and Oracle meta-classifiers with feature selection and hyperparameter optimization aims to enhance churn prediction accuracy. To ensure the superiority of the Oracle meta-classifier, we conducted two different analyses. Firstly, we integrated SelectKBest feature selection along with GridSearchCV for all classifiers across the three datasets. Secondly, we employed Permutation feature selection leveraging BayesSearchCV. The experimental results indicate that incorporating SelectKBest feature selection along with GridSearchCV yields slightly better results compared to Permutation feature selection leveraging BayesSearchCV. However, regardless of the feature selection methods and hyperparameters optimization methods used, the Oracle meta-classifier consistently achieves higher prediction accuracy and lower error rates. Upon compilation of SelectKBest along with GridSearchCV, the accuracy of Oracle reached 91.80%, 90.94%, and 90.73% in the transaction, telecommunication, and customer churn datasets, respectively. This performance surpasses that of each individual classifier (Decision Tree, Random Forest, XGBoost, AdaBoost, Extra Trees), as well as the Stacking meta-classifier. To further validate the superiority of the Oracle meta-classifier, we executed ten runs per dataset to report both the mean and standard deviation of each evaluation, confirming the effectiveness of the newly proposed meta-classifier. Tables 1 and 2, and 3 present the mean and standard deviation of experimental results for all classifiers, Stacking, and Oracle across the transaction, telecommunication, and customer churn datasets, respectively. Oracle also achieves superior mean Precision, Recall, F1 Score, ROC-AUC Score, and Cohen’s Kappa Score compared to the other mentioned classifiers, as well as the Stacking meta-classifier, with lower standard deviations in Precision, Recall, F1 Score, ROC-AUC Score, and Cohen’s Kappa Score. Considering the ROC-AUC Score as a standard criterion for performance appraisal, we classify classifiers based on their scores. A score below 0.6 denotes a “weak discriminator” between two classes. Scores between 0.6 and 0.7 signify an “acceptable discriminator,” while scores ranging from 0.7 to 0.8 indicate an “excellent discriminator.” When the score falls between 0.8 and 0.9, the classifier is deemed a “perfect discriminator” (Mallett et al. 2014). Referring to Table 1, in the transaction dataset, the Random Forest classifier yields an ROC-AUC below 0.6, indicating weak discrimination between churner and non-churner classes. Decision Tree and Extra Trees fall within the range of acceptable discrimination. XGBoost, AdaBoost, and Stacking exhibit excellent discrimination. Notably, the Oracle meta-classifier outperforms others with a higher ROC-AUC score. In Table 2, concerning the telecommunication dataset, Decision Tree achieves acceptable discrimination. Random Forest, XGBoost, and Extra Trees demonstrate excellent discrimination. AdaBoost and Stacking perform exceptionally perfect discrimination, yet Oracle surpasses them all. As shown in Table 3, across the customer churn dataset, all classifiers, including Stacking, exhibit weak performance, whereas Oracle excels as an excellent discriminator. Cohen’s Kappa Score is a statistical measure used to quantify the level of agreement between two raters (Vetter and Schober 2018). The interpretation of Cohen’s Kappa Score results, can be summarized as follows: In Tables 1 and 2, and 3, values less than 0, such as those for Decision Tree, Random Forest, and AdaBoost in the customer churn dataset, suggest no agreement. Values between 0.01 and 0.20, such as those for XGBoost, Extra Trees, and Stacking in the customer churn dataset, indicate “none to slight” agreement. Values between 0.2 and 0.4, observed in Decision Tree and Random Forest across both the transaction and telecommunication datasets, are considered as “fair” agreement. Values between 0.4 and 0.6, found in XGBoost, AdaBoost, Extra Trees, and Stacking across both the transaction and telecommunication datasets, indicate “moderate” agreement. Values between 0.6 and 0.8, such as those for Oracle in all three datasets, are considered as “substantial” agreement.

Table 1 Summary of the results for classifiers in transaction dataset
Table 2 Summary of the results for classifiers in telecommunication dataset
Table 3 Summary of the results for classifiers in customer churn dataset

We also utilized a confusion matrix and classification report to examine errors and interpret the outcomes of binary churn classification. The True Positive Rate (TPR) denotes the percentage of customers who remained with the service. In most cases, this proportion exceeds the True Negative Rate (TNR), which represents the percentage of customers who discontinued the service and switched to competitors. Positive Predictive Value (PPV) identifies the percentage of customers who haven’t churned but have shown signs of disinterest or dissatisfaction at a specific time. These customers may opt to leave the service in the future. Service providers should implement new strategies to minimize churn because acquiring new customers is considerably more expensive than retaining existing ones. Negative Predictive Value (NPV) indicates the percentage of satisfied customers likely to remain loyal to the service over a certain period. They actively engage in purchasing products, receiving services, and conducting transactions, potentially becoming repeat customers. Tables 4 and 5, and 6 reveal that the Oracle meta-classifier outperforms other classifiers in terms of TPR, TNR, PPV, and NPV. False Positive Rate (FPR) represents the proportion of customers wrongly identified as churners. False Negative Rate (FNR) indicates the ratio of actual churners not correctly identified. False Discovery Rate (FDR) includes all clients except potential churners, while False Omission Rate (FOR) includes all clients except potential loyal and repeat customers. FPR, FNR, FDR, and FOR depict error ratios, with Tables 4 and 5, and 6 demonstrating that Oracle has lower error ratios compared to others. Accuracy (ACC) confirms whether all churners and non-churners are correctly classified. Oracle exhibits the highest accuracy rate among all mentioned algorithms. Additionally, the Matthews Correlation Coefficient (MCC) is a more reliable metric for assessing classifier competence in identifying observed and predicted binary classifications. Oracle achieves MCC rates of 73.29%, 54.20%, and 68.37% across transaction, telecommunication, and customer churn datasets, indicating superior performance compared to other classifiers by a significant margin.

Table 4 Summary of the predictions for classifiers in transaction dataset
Table 5 Summary of the predictions for classifiers in telecommunication dataset
Table 6 Summary of the predictions for classifiers in customer churn dataset

7 Contributions and further directions

In this study, we introduced the Oracle meta-classifier to predict customer churn intention using transaction, telecommunication, and customer churn datasets. Oracle is a static and robust meta-classifier known for its high accuracy, especially when coupled with hyperparameter optimization and feature selection processes. We implemented Oracle within the “DESlib” library, known for its high test coverage. For hyperparameter extraction, we utilized GridSearchCV, recognized as a stable tool for tuning. We carefully selected the best parameters for establishing classifiers, employing precise GridSearchCV methodology. GridSearchCV stands out as the most effective tool for hyperparameter tuning, systematically exploring various parameter combinations to determine the optimal set. Its outputs are consistent across runs, ensuring reliability. The detailed and explicit outputs of GridSearchCV significantly enhance classifier accuracy. To forecast customer churn intention, we employed Decision Tree, Random Forest, XGboost, AdaBoost, and Extra Trees classifiers. We also utilized the SelectKBest feature selection method with the f_classif feature selection approach to enhance classifier proficiency. StratifiedKFold was used for cross-validation to mitigate overfitting. Combining various classifiers within a meta-classifier framework yields the highest prediction accuracy. To bolster the superiority of Oracle, we conducted a series of experiments employing advanced techniques such as Permutation feature selection in conjunction with BayesSearchCV hyperparameters Bayesian optimization. The outcomes of these experiments unequivocally affirm the superiority of Oracle when compared to other classifiers as well. Additionally, we compared Oracle with the Stacking ensemble meta-classifier to demonstrate Oracle’s robustness. Results indicate that Oracle outperforms individual classifiers and even the Stacking ensemble in predicting customer churn rates accurately. For example, as shown in Fig. 5, the superiority of the Oracle meta-classifier over the other selected classifiers is evident. The area under the ROC curve for Oracle is larger compared to the other classifiers, indicating that Oracle outperforms them in customer churn prediction. Oracle demonstrates superior accuracy and precision in predicting customer churn behavior compared to both individual classifiers and meta-classifiers like Stacking. Specialists have the option to employ more sophisticated gradient boosting methods on Decision Trees, such as LightGBM and Catboost, to enhance the efficiency and effectiveness of churn prediction. Additionally, they can utilize techniques like Lasso CV or Ridge Classifier CV for cross-validation. Furthermore, experts can leverage the Oracle meta-classifier not only for feature selection but also for addressing various classification challenges, including confidence estimation and identifying missing features. The versatility of the Oracle algorithm extends beyond churn prediction; it can be applied in decision-making contexts such as conversion rate prediction and business intelligence. Given its precision and adaptability, Oracle holds potential for practical applications in diverse prediction domains, including e-marketing, e-retailing, the stock market, and even healthcare.

Fig. 5
figure 5

The ROC Curve of the selected classifiers using the transaction dataset

8 Managerial implication

Given the rapidly evolving nature and expanding scope of the internet services industry, there has been a significant increase in the discernible shifts in customers’ expectations, desires, and decision-making processes. This behavior is primarily driven by the widespread accessibility of diverse products, services, and information. Within this dynamic environment, the bargaining power of customers and their ability to switch between services with just a few clicks has continued to grow. Consequently, customers can swiftly transition from one service provider to another without encountering significant entry barriers. Simultaneously, the proliferation of service providers and the range of services they offer have increased over time. This escalation poses a threat from new market entrants, particularly as dissatisfied customers readily shift to competitors if their service expectations are not met. Consequently, the internet services industry has become highly competitive, marked by disruptions, threats, and intense competition fueled by innovative products, new entrants, and existing rivals. As the usage of internet services among customers continues to rise, datasets related to transactions, telecommunications, and customer churn hold valuable insights. These datasets comprise pertinent information about customers’ demographic profiles and payment histories, facilitating the detection of customer behavior and consumption patterns. Given the importance of these datasets in predicting customer churn rates, service providers must meticulously observe and analyze customer interactions to mitigate churn, boost revenue, and maintain competitiveness. Analyzing and scrutinizing transactional, telecommunication, and customer churn datasets offer a more effective approach to predicting churn rates compared to other datasets. This is because these datasets reflect direct interactions with the service, akin to examining a shopping basket. In the realm of service-oriented businesses, customers represent the most valuable asset for every enterprise. Consequently, service providers must cultivate strong relationships with their customers by delivering tailored services based on their preferences to ensure satisfaction and retention. Predicting customer churn rates enables firms to devise practical strategies aimed at retaining customers. For instance, Total Positive Rate (TPR) identifies the percentage of customers who have not churned, suggesting the importance of maintaining their loyalty by offering superior and expedited services. Total Negative Rate (TNR) highlights the rate of customers who have churned, indicating the necessity of reaching out to them to understand their reasons for leaving and resolving any issues to encourage their return. Positive Predictive Value (PPV) denotes the proportion of customers who have not churned but are dissatisfied with the services. Offering incentives such as free services or discounts may prevent their departure. Negative Predictive Value (NPV) represents the percentage of satisfied and loyal customers, who can be incentivized to remain through rewards or bonuses for continued service usage. Additionally, involving them in feedback sessions and encouraging them to share suggestions for service improvement can enhance customer satisfaction and loyalty.

9 Discussion and conclusion

Electronic service platforms offer a wide array of purchase options, giving customers the ability to easily switch to competitors if unsatisfied. This increases customer bargaining power, pressuring firms to prioritize customer satisfaction and loyalty. Many businesses are shifting from product-centric to customer-centric approaches due to intense competition. Customer retention strategies, like Customer-Churn Prediction (CCP), are crucial in this environment. Retaining existing customers is prioritized as acquiring new ones is more costly. Predicting customer behavior and buying patterns offers competitive advantages, such as identifying bestselling products and understanding customer churn. Businesses utilize classifiers to recognize customer needs and consumption patterns, aiming to satisfy and retain current customers while attracting new ones. This study focused on forecasting customer churn intentions through the implementation of an innovative and robust meta-classifier. To accomplish this, we analyzed three distinct datasets: transaction records, telecommunications data, and customer churn information. Using a combination of Decision Tree, Random Forest, XGBoost, AdaBoost, and Extra Trees as the primary supervised classifiers across these datasets, we conducted separate cross-validation and evaluation procedures. Additionally, we employed SelectKBest and permatuation feature selection to identify the most effective features for achieving optimal accuracy. Hyperparameter optimization was carried out using GridSearchCV and BayesSearchCV. Subsequently, we integrated the refined classifiers into a new meta-classifier tailored to each dataset. The experimental findings highlight the superior accuracy of our proposed meta-classifier compared to traditional classifiers and even stacking ensemble methods. These predictive insights provide valuable assistance to businesses in identifying potential churners and implementing proactive strategies to retain customers, thereby bolstering customer retention rates and ensuring long-term business viability.