1 Introduction

Recent advancements in our ability to collect, integrate, store, and analyze big amounts of data led to the emergence of new challenges for machine learning methods. Traditional algorithms were designed to discover knowledge from static datasets. Contrary, contemporary data sources generate information characterized by both volume and velocity. Such a scenario is known as data streams (Gama, 2010; Bahri et al., 2021; Read and Žliobaitė, 2023) and traditional methods lack the speed, adaptability, and robustness to succeed.

One of the biggest challenges, when compared to learning from static data, lies in the need of adapting to the evolving nature of data, where concepts are non-stationary and may change over time. This phenomenon is called concept drift (Krawczyk et al., 2017; Khamassi et al., 2018) and leads to degradation of the classifier, as knowledge learned on previous concepts may not be useful anymore for the recent instances. Recovering from concept drift requires either the presence of explicit detectors or implicit adaptation mechanisms.

Another vital challenge in data stream mining lies in the need for algorithms to display robustness to class imbalance (Krawczyk, 2016; Fernández et al., 2018a). Despite almost three decades of research, handling skewed class distributions is still a crucial domain of machine learning. This becomes even more challenging in the streaming scenario, where imbalance happens simultaneously with concept drift. Not only do the definitions of classes change but also the imbalance ratio becomes dynamic and class roles may switch. Solutions that assume fixed data properties cannot be applied here, as streams may oscillate between varying degrees of imbalance and periods of balance among classes. Furthermore, imbalanced streams can have other underlying difficulties, such as small sample size, borderline and rare instances, overlapping among classes, or noisy labels (Santos et al., 2022). Imbalanced data streams are usually handled via class resampling (Korycki & Krawczyk, 2020; Bernardo et al., 2020b; Bernardo & Della Valle, 2021a), algorithm adaptation mechanism (Loezer et al., 2020; Lu et al., 2020), or ensembles (Zyblewski et al., 2021; Cano & Krawczyk, 2022). This problem is motivated by a plethora of real-world problems where data is both streaming and skewed, such as Twitter streams (Shah & Dunn, 2022), fraud detection (Bourdonnaye & Daniel, 2022), abuse and hate speech detection (Marwa et al., 2021), Internet of Things (Sudharsan et al., 2021), or intelligent manufacturing (Lee, 2018). While there are several works on how to handle imbalanced data streams, there are no agreed-upon standards, benchmarks, or good practices that are necessary for fully reproducible, transparent, and impactful research.

Research goal. To create a standardized, exhaustive, and informative experimental framework for binary and multi-class imbalanced data streams, and conduct an extensive comparison of state-of-the-art classifiers.

Motivation. While there are many algorithms for drifting and imbalanced data streams in the literature, there is a lack of standardized procedures and benchmarks on how to evaluate these algorithms holistically. Existing studies are often limited to a selection of algorithms and data difficulties, typically only considering binary class data, and do not provide insights into what aspects of imbalanced data streams must be considered and translated into meaningful benchmark problems. There is a need for a unified and holistic evaluation framework for imbalanced data streams that could be used as a template for researchers to evaluate their newly proposed algorithms against the relevant methods in the literature. Additionally, in-depth experimental comparison of state-of-the-art methods would allow to gain valuable insights into what classifiers and learning mechanisms work under different conditions. Therefore, we propose an evaluation framework and perform a large-scale empirical study to obtain insights into the performance of the methods under an extensive and varied set of data difficulties.

Overview and contributions. This paper proposes a complete and holistic framework for benchmarking and evaluating classifiers for imbalanced data streams. We summarize existing works and organize them according to established taxonomies dedicated to skewed and streaming problems. We distill the most crucial and insightful problems that appear in this domain and use them to design a set of benchmark problems that capture distinctive learning difficulties and challenges. We compile these benchmarks into a framework embedding various metrics, statistical tests, and visualization tools. Finally, we showcase our framework by comparing 24 state-of-the-art algorithms, which allows us to choose the best-performing ones, discover in what specific areas they excel and formulate recommendations for end-users. The main contributions of the paper are summarized as follows:

  • Taxonomy of algorithms for imbalanced data streams. We organize the methods in the state of the art according to established taxonomies that summarize recent progress in learning from imbalanced data streams and provide a survey of the most important contributions.

  • Holistic and reproducible evaluation framework. We propose a complete and holistic framework for evaluating classifiers for two-class and multi-class imbalanced data streams that standardizes metrics, statistical tests, and visualization tools to be used for transparent and reproducible research.

  • Diverse benchmark problems. We formulate a set of benchmark problems to be used within our framework. We capture the most vital and challenging problems that are present in imbalance data streams, such as dynamic imbalance ratio, instance-level difficulties (borderline, rare, and subconcepts), or number of classes. Furthermore, we include real-world and semi-synthetic imbalanced problems, leading to a total of 515 data stream benchmarks.

  • Comparison among state-of-the-art classifiers. We conduct an extensive, comprehensive, and reproducible comparative study among 24 state-of-the-art stream mining algorithms based on the proposed framework and 515 benchmark problems.

  • Recommendations and open challenges. Based on the results from the exhaustive experimental study, we formulate recommendations for end-users that will allow to understand the strengths and weaknesses of the best-performing classifiers. Furthermore, we formulate open challenges in learning from imbalanced data streams that should be addressed by researchers in the years to come.

Comparison with most related experimental works. In recent years, several survey papers and works with large experimental studies touching on joint areas of class imbalance and data streams were published. Therefore, it is important to understand the key differences between them and this work, as well as how our survey provides new insights into this topic that were not touched upon in the previous works. Wang et al. (2018) proposed an overview of several existing techniques, both drift detectors and adaptive classifiers, and experimentally compared their predictive accuracy. While being the first dedicated study in this area, it was limited by not evaluating computational complexities of compared algorithms, using a very small selection of datasets (7 benchmarks), and investigating only limited properties of imbalanced data streams (not touching upon instance-level characteristics or multi-class problems). Brzeziński et al. (2021) proposed a follow-up study that focused on data-level properties of imbalanced streams, such as instance difficulties (borderline and rare instances) and the presence of subconcepts. However, the study was done for a limited number of algorithms (5 classifiers) and focused only on two-class problems. Bernardo et al. (2021) proposed an experimental comparison of methods for imbalanced data streams. They extended Brzeziński et al. (2021) benchmarks using different levels of imbalance ratio and three drift speeds. However, their study analyzed a limited number of algorithms (11 classifiers) and only three real-world datasets. Cano and Krawczyk (2022) presented a large comparison of 30 algorithms focusing on ensemble approaches but 21 of them were general-purpose ensembles rather than imbalanced specific classifiers. These four works address only binary class imbalanced data streams. This paper extends the benchmark evaluation from all previous studies, proposes new benchmark scenarios, extends the number of real-world datasets, and evaluates both two-class and multi-class imbalanced data streams. We also extend the comparison to 24 classifiers, 19 of them specifically designed for imbalanced data streams. Table 1 summarizes the main differences in the experimental evaluations of these works. This allows us to conclude that while these works are an important first step, there is a need for a unified, comprehensive, and holistic study of learning from imbalanced data streams that could be used as a template for researchers to evaluate their newly proposed algorithms.

Table 1 Comparison of the number of algorithms and benchmarks evaluated in most related works

This paper is organized as follows. Section 2 provides a background on data streams. Section 3 discusses the main challenges of imbalanced data. Section 4 presents the specific difficulties of imbalanced streams. Section 5 describes the approaches for tackling imbalanced steams with ensembles. Section 6 introduces the experimental setup and methodology. Section 7 presents and analyzes the results of our study. Section 8 summarizes the lessons learned. Section 9 formulates recommendations to end-users for selecting the best algorithms for imbalanced data streams. Section 10 discusses the open challenges and future directions. Finally, Sect. 11 covers the conclusions.

2 Data streams

In this section we present the preliminaries of data stream characteristics, learning approaches, and the concept drift properties.

2.1 Data stream characteristics

The main characteristics of data streams can be summarized as follows (Gama, 2010; Krempl et al., 2014; Bahri et al., 2021):

  • Volume. Streams are potentially unbounded collections of data that constantly flood the system and thus they are impossible to be stored and must be processed incrementally. The volume also imposes limitations on the computational resources, which are magnitudes smaller than the actual size of data would call for.

  • Velocity. Streaming data sources are in constant motion. New data is being generated continuously and often in rapid bursts, leading to high-speed data streams. These force learning systems to work in real-time, must be analyzed and incorporated into the learning system to model the current state of the stream.

  • Non-stationarity. Data streams are subject to change over time, which is known as concept drift. This phenomenon may affect feature distributions, class boundaries, but also lead to changes in class proportions, or emergence of new classes (or disappearance of old ones).

  • Veracity. Data arriving from the stream can be uncertain and affected by various problems, such as noise, injection of adversarial patterns, or missing values. Having access to fully labeled stream is often impossible due to cost and time requirements, leading to need for learning from weakly labeled instances.

We can define a stream S as a sequence \(<s_1, s_2, s_3, \ldots , s_{\infty }>\). We consider a supervised scenario \(s_i = (X, y)\), where \(X = [x_1, x_2, \ldots , x_{f}]\) with f as the dimensionality of the feature space, and y as the target variable, which may or may not be available on arrival. Each instance in the stream is independent and randomly drawn from a stationary probability distribution. Figure 1 illustrates the workflow to learn from data streams and approaches to tackle related challenges (Gama, 2012; Nguyen et al., 2015; Ditzler et al., 2015; Wares et al., 2019).

Fig. 1
figure 1

Streaming learning taxonomy

2.2 Learning model

Due to both the volume and velocity of data streams, algorithms need to be capable of incremental processing of the continuously arriving information. Instances from the data stream are provided either online, or in the form of data chunks (portions, blocks).

  • Online. Algorithms will process each single instance one by one. The main advantage of this approach is a low response time and adaptivity to changes in the stream. The main drawback lies in their limited view of the current state of the stream, as a single instance can be either a poor representation of a larger concept or may be susceptible to noise.

  • Chunk. Instances are processed in windows called data chunks or blocks. Chunk-based approaches offer a better estimation of the current concept due to a larger training sample size. The main drawback is the delayed response to changes in some settings because the construction, evaluation, or updating of classifiers is done when all instances from a new block are available. Additionally, in case of rapid changes chunks may consist of instances coming from multiple concepts, further harming the adaptation capabilities.

  • Hybrid. Hybrid approaches can combine the previous methodologies to address their shortcomings. One of the most popular approaches is to use online learning, while maintaining chunks of data to extract statistics and useful knowledge about the stream for additional periodical classifier updates.

2.3 Concept drift

Data streams are subject to a phenomenon called concept drift (Krawczyk et al., 2017; Lu et al., 2018). Each instance arrives at a time t and is generated by a probabilistic distribution \(\Phi ^{t} (X,y)\) where X corresponds to the feature vector and y to the class label. If the same probability distribution generates all instances in the stream, data is stationary, i.e., originating from the same concept. On the other hand, if two separate instances, arriving at times t and \(t + C\), are generated by \(\Phi ^{t} (X,y)\) and \(\Phi ^{t+C} (X,y)\). If \(\Phi ^{t} \ne \Phi ^{t + C}\), then a concept drift occurred. When analyzing and understanding concept drift, following factors are considered:

  • Influence of the decision boundaries. Here we distinguish: (i) virtual; and (ii) real types of drift. Virtual drift can be defined as a change in the unconditional probability distribution P(x), meaning it does not affect the learned decision boundaries. Such drift, while not having a deteriorating influence on learning models, must be detected as it may trigger false alarms and force unnecessary, yet costly adaptation. Real concept drift affects the decision boundaries, making them worthless to the current concept. Detecting it and adapting to new distribution is crucial for maintaining predictive performance.

  • Speed of change. Here we can distinguish three types of concept drift (Webb et al., 2016): (i) incremental; (ii) gradual; and (iii) sudden. Incremental drift generates a sequence of intermediate states between the old and new concept that are often. This requires detection of the stabilization moment when new concept becomes fully formed and relevant. Gradual drift oscillates between instances coming from both old and new concepts, with new concept becoming more and more frequent over time. Sudden drift instantaneously switches between old and new concept, leading to an instant degradation of the underlying learning algorithm.

  • Recurrence. Changes in the stream can be either unique or recurring. In the latter case the previously seen concept may reemerge over time, allowing us to recycle previously learned knowledge. This calls for having a repository of models that can be utilized for faster adaptation to previously seen changes. With more relaxed assumptions, one can extend recurrence to appearance of concepts similar to the ones seen in the past. Here, the past knowledge can be used as initialization point for the drift recovery.

There are two strategies to tackle concept drift: explicit and implicit (Lu et al., 2018; Han et al., 2022):

  • Explicit. Here drift adaptation is managed by an external tool, called drift detector (Barros & Santos, 2018). They are used for continuous monitoring of the stream properties (e.g. statistics) or classifier performance (e.g. error rates). Drift detectors raise a warning signal when there are signs of upcoming drift, and alarm signal when the concept drift has taken place. When drift is detected, the classifier is replaced with a new one trained on recent instances. The pitfall of drift detectors is the need for labeled instances (semi-supervised and unsupervised detectors also exist but are less accurate) and false alarms that replaces competent classifiers.

  • Implicit. Here drift adaptation is managed by learning mechanisms embedded in the classifier, assuming that it can adjust itself to new instances from the latest concept and gradually forget outdated information (Ditzler et al., 2015; da Costa et al., 2018). This requires establishing proper learning and forgetting rates, use of adaptive sliding windows, or continual hyperparameter tuning.

2.4 Access to labels

Obtaining the ground truth (e.g. class labels) in a data stream setting relates to significant time and cost requirements. As instances arrive continuously and in large volumes, domain experts may not be able to label a significant portion of the data or may not be able to provide labels fast enough. In the case of applications where labels can be obtained at no cost (e.g. weather prediction), a significant delay between instance and label arrival must be considered. Data streams can be divided into three groups concerning ground truth availability:

  • Fully-labeled. For every instance x in the stream the label y is known and can be used for training. This scenario assumes no need for explicit label query and is the most common one for evaluating stream learning algorithms. However, the assumption of a fully labeled stream may not be feasible for many real-world applications.

  • Partially labeled. Only a subset of instances in the stream are labeled on arrival. The ratio between labeled and unlabeled instances can change overtime. This scenario requires either active learning for selecting most valuable instances for labelling (Žliobaitė et al., 2013) or semi-supervised mechanisms for extending the knowledge from labeled instances unto unlabeled ones (Bhowmick & Narvekar, 2022; Gomes et al., 2022).

  • Unlabeled. Every instance arrives without label and one cannot obtain it upon request, or it will arrive with a significant delay. This forces approximation mechanisms that can either generate pseudo-labels, look for evolving structures in data, or use delayed labels to approximate future concepts.

In this work, only fully labeled streams were used, but some of the algorithms evaluated possess mechanisms to deal with partially labeled or unlabeled streams.

3 Imbalanced data

In this section we will discuss shortly the main challenges present when learning from imbalanced data. Almost three decades of developments in this field allowed us to gain deeper insights into what inhibits the performance of classifier training procedures under skewed distributions (Fernández et al., 2018a).

  • Imbalance ratio. The most obvious and well-studied property of imbalanced datasets is their imbalance ratio, i.e., the disproportion between majority and minority classes. It is commonly assumed that the higher the imbalance ratio, the more difficulty it poses to a classifier. This is justified by the fact that most classifier training procedures are driven by 0-1 loss functions that assume uniform importance of every instance. Therefore, the more predominant the majority class is, the more classifier becomes biased towards it. However, many recent studies have pointed out that the imbalance ratio is not the sole source of learning difficulties (He & Ma, 2013). As long as classes are well-separated and sufficiently represented in the training set, even very high imbalance ratio will not significantly impair the classifier. Therefore, we must look into instance-level properties to find other sources of classifier bias.

  • Small sample size. The imbalance ratio is often accompanied by the fact that minority class is appearing infrequently and collecting sufficient number of instances may be costly, time-consuming, or simply impossible. This leads to an issue of small sample size, where minority class does not have big enough training set to allow classifiers to correctly capture its characteristics (Wasikowski & Chen, 2010). This, combined with high imbalance ratio, can significantly affect the training procedure, leading to poor generalization capabilities and classification bias. Furthermore, small sample size cannot guarantee that the training set is representative of the actual distribution - problem known as data shift (Rabanser et al., 2019).

  • Class overlapping. Another challenge in imbalanced learning comes from the topology of classes, as often minority and majority classes overlap significantly. Class overlap poses difficulty for standard machine learning problems (Galar et al., 2014), while presence of skewed distribution makes it even more challenging (Vuttipittayamongkol et al., 2021). Overlapping regions can be seen as uncertainty regions for classifiers. In such case, the majority class will dominate the training procedure, leading to decision boundary ignoring the minority class in the overlapping area. This problem becomes even more difficult when dealing with multiple classes overlapping with each other.

  • Instance-level difficulties. The problem of class overlapping points out to the importance of analyzing the properties of minority class instances and their individual difficulties. Minority classes often form small disjuncts, creating subconcepts that further reduce the minority class sample size in given area (García et al., 2015). When looking at individual properties of each instance, one can analyze its neighborhood in order to determine how challenging it will be for the classifier. A popular taxonomy divides minority instances into safe, borderline, rare, and outliers based on how homogeneous are the class labels of their nearest neighbors (Napierala & Stefanowski, 2016). This information can be utilized to either obtain more effective resampling approaches or guide the classifier training procedure.

4 Imbalanced data streams

Class imbalance is one of the most vital problems in contemporary machine learning (Fernández et al., 2018a; Wang et al., 2019). It deals with a disproportion among the number of instances in each class, where some of the classes are significantly underrepresented. As most classifiers are driven by 0-1 loss, they get biased towards the easier to model majority classes. The underrepresented minority classes are usually the more important ones, thus one needs to alter either the dataset or learning procedure to create balanced decision boundaries that do not favor any of the classes.

Class imbalance is a common problem in the data stream mining domain (Wu et al., 2014; Aminian et al., 2019). Here streams can have a fixed imbalance ratio, or it may evolve over time (Komorniczak et al., 2021). Furthermore, class imbalance combined with concept drift poses novel and unique challenges (Brzeziński & Stefanowski, 2017; Sun et al., 2021). Class roles may switch (majority becomes the minority and vice versa), several classes may change (new classes appearing or old disappearing), or instance level difficulties may emerge (evolving class overlapping or clusters/sub-concepts) (Krawczyk, 2016). Changes in the imbalance ratio can be independent or connected with concept drift, where class definitions (\(P(y\mid x)\)) will change over time (Wang & Minku, 2020). Henceforth, monitoring each class for changes in its properties is not enough, as one also needs to track per-class frequencies of arriving new instances.

In most real-life scenarios streams are not predefined as balanced or imbalanced and they may become imbalanced only temporarily (Wang et al., 2018). Users’ interests over time (where new topics emerge and old ones lose relevance) (Wang et al., 2014), social media analysis (Liu et al., 2020), or medical data streams (Al-Shammari et al., 2019) are examples of such cases. Therefore, a robust data stream mining algorithm should display high predictive performance regardless of the underlying class distributions (Fernández et al., 2018a). Most algorithms dedicated to imbalanced data streams do not perform as well on balanced problems as their canonical counterparts (Cano & Krawczyk, 2020). On the other hand, these canonical algorithms display low robustness to high imbalance ratios. There exist but few algorithms that can handle both scenarios with satisfactory performance (Cano & Krawczyk, 2020, 2022).

There are two main approaches dedicated to handling imbalanced data:

  • Data-level approaches. These methods focus on the alteration of the underlying dataset to make it balanced (e.g. by oversampling or undersampling), thus being classifier-agnostic approaches. They focus on resampling or learning more robust representations.

  • Algorithm-level approaches. These methods focus on modifying the training approach to make classifiers robust to skewed distributions. They are dedicated to specific learning models, being often more specialized, but less flexible than their data-level counterparts. Algorithm-level modifications focus on identifying mechanisms that suffer from class imbalance, cost-sensitive learning, or one-class classification.

Fig. 2
figure 2

Taxonomy of approaches for imbalanced data streams

Figure 2 presents a taxonomy (He & Garcia, 2009; Branco et al., 2016; Krawczyk, 2016; Fernández et al., 2018a) of approaches for tackling the class imbalance problem. The specific details are discussed in the following subsections.

4.1 Data-level approaches

While resampling techniques are very popular for static imbalanced problems (Fernández et al., 2018a; Aminian et al., 2021), they cannot be directly used in the streaming scenario. Concept drift may render resampled data obsolete or even harming to the current state of the stream (e.g. when classes switch roles and resampling starts to empower further the new majority). This calls for dedicated strategies for keeping track of which classes should be resampled at a given moment, as well as for mechanisms capable of dealing with drift by forgetting outdated artificial instances (Fernández et al., 2018a).

Resampling algorithms can be categorized as either blind or informed (utilizing information about minority class properties to at least some degree). While blind approaches can be effectively combined with ensembles due to their low computational cost, they do not perform well on their own. Therefore, most resampling methods for data streams are informed and based on a very popular SMOTE (Synthetic Minority Over-sampling Technique) algorithm (Fernández et al., 2018b). Those versions focus on keeping track of changes in the stream by employing either adaptive windows (Korycki & Krawczyk, 2020) or data sketches (Bernardo & Della Valle, 2021a; Bernardo & Della Valle, 2021b). This allows them to generate relevant artificial instances for the current concept and display good reactivity to sudden changes in the stream. It is important to note that the streaming version of SMOTE presented in (Korycki & Krawczyk, 2020) can work with any number of classes, as well as under extremely limited access to class labels. Incremental Oversampling for Data Streams (IOSDS) (Anupama & Jena, 2019) focuses on replicating instances that are not identified as noisy or overlapping. Clustering of data chunks can be used to identify the most relevant instances to resample (Czarnowski, 2021). Undersampling via Selection-Based Resampling (SRE) (Ren et al., 2019) iteratively removes the safe instances from the majority class without introducing reverse bias towards the minority class. Some works present the usefulness of combining over and under sampling together to obtain a more diverse representation of the minority class (Bobowska et al., 2019). When handling multi-class imbalanced data streams, resampling can be either conducted using information about all of classes (Korycki & Krawczyk, 2020; Sadeghi & Viktor, 2021) or by applying binarization schemes and pairwise resampling (Mohammed et al., 2020a). Active learning techniques such as dynamic budgets (Aguiar & Cano, 2023) and Racing Algorithms (Nguyen et al., 2018) are also combined with resampling techniques to overcome class imbalance (Mohammed et al., 2020b). Disadvantages of data-level methods lie in their high memory use (when oversampling), or the possibility of under-representation of older concepts that are still relevant (when undersampling).

A study by Korycki and Krawczyk (2021b) discusses an alternative data-level approach to resampling. They propose to create dynamic and low-dimensional embeddings that use information about the class imbalance ratio and separability to find highly discriminative projections. A well-defined low-dimensional embedding may offer better class separability and thus make resampling obsolete, especially when dealing with high-dimensional and difficult imbalanced data streams.

4.2 Algorithm-level approaches

Among training modifications, the most popular one is the combination of Hoeffding Decision Trees with Hellinger splitting criteria to make skew-insensitive (Lyon et al., 2014). Ksieniewicz (2021) proposed a method to modify predictions of a base classifier on-the-fly, aiming at modifying prior probabilities based on the frequency of each class. A new loss function was proposed to make neural networks able to handle imbalanced streams in an online setting (Ghazikhani et al., 2014). A combination of online active learning, siamese networks, and multi-queue memory was introduced by (Malialis et al., 2022). Various modifications of the popular Nearest Neighbors classifier have been adapted to tackling imbalanced data streams by using either dedicated memory formation or skew-insensitive distance metrics (Vaquet & Hammer, 2020; Roseberry et al., 2019; Abolfazli & Ntoutsi, 2020). Genetic programming has been successfully used for induction of robust classifiers from the stream (Jedrzejowicz & Jedrzejowicz, 2020), as well as increasing skew-insensitive rule interpretability and recovery speed from concept drift (Cano & Krawczyk, 2019).

Cost-sensitive methods have been applied to streaming decision trees. Krawczyk and Skryjomski (2017) proposed replacing leaves with perceptrons that use cost-sensitive threshold adjustment of class-based outputs. Their cost matrix is adapted in an online fashion to the evolving imbalance ratio, while multiple expositions of difficult instances are used to improve adaptation. Alternatively, Gaussian cost-sensitive decision trees combine cost and accuracy into a hybrid criterion during their training (Guo et al., 2013). Another approach uses Online Multiple Cost-Sensitive Learning (OMCSL) (Yan et al., 2017) where cost matrices for all classes are adjusted incrementally according to a sliding window. The recent framework proposed two-stage cost-sensitive learning, where a cost matrix is used for both online feature selection and classification (Sun et al., 2020). Finally, cost-sensitive approaches have been combined with Extreme Learning Machine algorithms via weighting matrices and misclassification costs (Li-wen et al., 1994).

One-class classification is an interesting solution to class imbalance, where one uses these class-specific models to either describe minority class or all the classes (achieving a one-class decomposition of multi-class problems) (Krawczyk et al., 2018). One-class classifiers can be used for data stream mining scenarios and display good reactivity to concept drift (Krawczyk & Wozniak, 2015). One can use adaptive online one-class Support Vector Machines to track minority classes and their changes over time (Klikowski & Woźniak, 2020). One can combine one-class classification with ensembles, over-sampling, and instance selection (Czarnowski, 2022). One-class classifiers can be combined with active learning to select the most informative instances from the stream to be used for class modeling (Gao, 2015). Anomaly detection, similar in its assumptions to one-class classifiers can also be used to identify minority and majority instances in the stream (Liang et al., 2021).

4.3 Similar domains

When talking about learning from imbalanced data streams, it is necessary to mention to similar domains in contemporary machine learning, namely continual learning and long-tailed recognition.

Similarities to continual learning. It is important to mention that data stream mining can often be viewed as task-free continual learning (Krawczyk, 2021). While imbalanced problems have not been yet discussed widely in this setup, there are some works noticing the importance of handling skewed class distributions for continual deep learning (Chrysakis & Moens, 2020; Kim et al., 2020; Arya & Hanumat Sastry, 2022; Priya & Uthra, 2021).

Similarities to long-tailed recognition. The extreme case of multi-class imbalance is known as long-tailed recognition (Yang et al., 2022). It deals with situations, where we have hundreds or thousands of classes, with progressively increasing imbalance ratio and smallest classes being extremely imbalanced compared to the majority ones (hence long-tailed class-based distribution of instances). This problem is mainly discussed in the context of deep learning, where various decomposition strategies (Zhu et al., 2022), loss functions (Zhao et al., 2022), or cost-sensitive solutions (Peng et al., 2022) are being utilized. Currently, there are but few works that discuss the combined challenge of continual learning from long-tailed distributions (Kim et al., 2020).

5 Ensembles for imbalanced data streams

Combining multiple classifiers into an ensemble is one of the most powerful approaches in modern machine learning, leading to improved predictive performance, generalization capabilities, and robustness. Ensembles have proven themselves to be highly effective for data streams, as they offer unique ways of managing concept drift and class imbalance (Krawczyk et al., 2017). The former can be achieved by adding new classifiers or updating the existing ones, while the latter is achieved by combining classifiers with different skew-insensitive approaches (Brzeziński & Stefanowski, 2018; Grzyb et al., 2021; Du et al., 2021).

Ensembles for data streams can be categorized by the following design choices:

  • Classifier pool generation. There are two major approaches for generating a pool of classifiers for forming an ensemble: heterogeneous and homogeneous (Bian & Wang, 2007). Heterogeneous solutions assume that we ensure diversity of the pool by using different classifier models, aiming at exploiting their individual strengths at forming decision boundaries. Homogeneous solutions assume that we select a specific type of classifier (e.g., popular choice are decision trees) and then ensure diversity among them by modification of the training set. This is usually achieved by one of two popular solutions: bagging and boosting. Bagging (bootstrap aggregating) trains multiple independent base learners in parallel and combines their predictions using an aggregation function (e.g. by simple average or simple majority vote). Boosting trains the base learners in a sequential way. Each model in the sequence is fitted giving more importance to observations in the dataset that were poorly handled by the previous models. Predictions are combined using a deterministic strategy (e.g. weighted majority voting). It is worthwhile noting that while the majority of the methods are based on either heterogeneous pool or homogeneous weak learners, there exist alternative approaches, such as generating hybrid pools (using multiple types of models, but also generating multiple learners for each of them) (Luong et al., 2020) and using projections (Korycki & Krawczyk, 2021b).

  • Feature space modification. This defines what feature space input is being used by base classifiers. They can either be trained on full feature space (here their diversity must be ensured in another way), feature subspaces, or completely new feature embeddings (e.g. creating artificial feature spaces).

  • Ensemble line-up. This defines how ensembles are managed during the continual learning from streams. Voting procedures can be used for dynamical adjustment of base learners’ importance. Ensembles can be fixed, meaning that each base learner is continuously updated, but never removed. Alternatively, one can use a dynamic setup, where worst classifiers are pruned and replaced by new ones trained on more recent instances. Finally, all of these mentioned techniques can be combined to create hybrid architectures, capable of better responsiveness to concept drift.

For imbalanced data streams, ensembles are usually combined with techniques mentioned in the previous section. Figure 3 presents a taxonomy (Krawczyk et al., 2017; Gomes et al., 2017a) based on how ensembles are built for data streams and how this can be connected with the previously discussed approaches to handle drifting and imbalanced streams.

Fig. 3
figure 3

Taxonomy of ensemble definition for imbalanced data streams

The most popular approach lies in combining resampling techniques with Online Bagging (Wang et al., 2015, 2016; Wang & Pineau, 2016). Similar strategies can be applied to Adaptive Random Forest (Gomes et al., 2017b), Online Boosting (Klikowski & Woźniak, 2019; Gomes et al., 2019), Dynamic Weighted Majority (Lu et al., 2017), Dynamic Feature Selection (Wu et al., 2014), Adaptive Random Forest with resampling (Ferreira et al., 2019), Kappa Updated Ensemble (Cano & Krawczyk, 2020), Robust Online Self-Adjusting Ensemble (Cano & Krawczyk, 2022), Deterministic Sampling Classifier with weighted Bagging (Klikowski & Wozniak, 2022), Dynamic Ensemble Selection (Jiao et al., 2022; Han et al., 2023) or any ensemble that can incrementally update its base learners (Ancy & Paulraj, 2020; Li et al., 2020). It is interesting to note that preprocessing approaches enhance diversity among base classifiers (Zyblewski et al., 2019). Alternatively, cost-sensitive solutions can be used together with ensembles such as Adaptive Random Forest (Loezer et al., 2020).

The effectiveness of ensembles for imbalanced data streams can be further increased by using dedicated combination schemes or adaptive chunk-based learning (Lu et al., 2020). Weights assigned for each base classifier can be continuously updated to reflect their current competencies on minority classes (Ren et al., 2018). A reinforcement learning mechanism can be used to increase the weights of the base classifiers that perform better on the minority class (Zhang et al., 2019). One can use a hybrid approach that combines resampling minority instances with dynamic weighting base classifiers based on their predictive performance on sliding windows of minority samples (Yan et al., 2022). Dynamic selection of classifiers and their related preprocessing techniques can be a very effective tool for handling concept drift, as it offers exploitation of diversity among base classifiers (Zyblewski et al., 2021; Zyblewski & Woźniak, 2021). Alternatively, classifier selection balances subsets of the incoming stream. Cost-sensitive neural networks can be initialized using different random weights and then incrementally improved with new instances (Ghazikhani et al., 2013). OSELM (Li-wen et al., 1994) classifiers can be combined using diverse initialization to generate a more robust compound classifier (Wang et al., 2021).

Finally, ensembles found their applications in imbalanced data streams with limited access to class labels. CALMID is a robust framework to deal with limited label access, concept drift, and class imbalance by dynamically inducing new base classifiers with the weighting of the most relevant instances (Liu et al., 2021). Another approach uses reinforcement learning (Zhang et al., 2022) to select instances for updating the ensemble under labeling constraints. In multi-class imbalance settings, self-training semi-supervised (Vafaie et al., 2020) methods were applied to self-labeling driven by a small subset of labeled instances. It can be realized by an abstaining mechanism temporarily removing uncertain classifiers, with dynamically adjusting the abstaining criterion in favor of minority classes (Korycki et al., 2019).

While the vast majority of mentioned ensembles use Hoeffding Decision Trees (or their variants) as base classifiers, there are several skew-insensitive ensembles dedicated to neural networks. ESOS-ELM (Mirza et al., 2015) maintains randomized neural networks that are trained on balanced subsets of the incoming stream. Cost-sensitive neural networks can be initialized using random weights and then incrementally improved with new instances (Ghazikhani et al., 2013). OSELM (Li-wen et al., 1994) classifiers can be combined using diverse initialization to generate a more robust compound classifier (Wang et al., 2021).

Finally, ensembles found their applications in imbalanced data streams with limited access to class labels. CALMID is a robust framework to deal with limited label access, concept drift, and class imbalance by dynamically inducing new base classifiers with the weighting of the most relevant instances (Liu et al., 2021). Another approach uses reinforcement learning (Zhang et al., 2022) to select instances for updating the ensemble under labeling constraints. In multi-class imbalance settings, self-training semi-supervised (Vafaie et al., 2020) methods were applied to self-labeling driven by a small subset of labeled instances.

6 Experimental setup

The experimental study was designed to evaluate the performance of data stream mining algorithms under varied imbalanced scenarios and difficulties. We aim at gaining a better understanding of the data difficulties and how they impact the classifiers. We address the following research questions (RQ):

  • RQ1: How do different levels of class imbalance ratio affect the algorithms?

  • RQ2: How do static versus dynamic imbalance ratios influence the classifiers?

  • RQ3: How do instance-level difficulties impact the classifiers?

  • RQ4: How do algorithms adapt to simultaneous concept drift and imbalance ratio changes?

  • RQ5: Are there differences on the performance between imbalanced generators and real-world streams?

  • RQ6: Is there trade-off between the accuracy and the computational and memory complexities?

  • RQ7: What are the lessons learned? Which algorithm should I use in my dataset?

To answer these questions, we formulate a set of benchmark problems building on experiments proposed in previous studies and new ones to assess additional data difficulties in two-class and multi-class imbalanced data streams. One of the major issues in this research area is the lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms holistically. Therefore, we evaluate a comprehensive set of benchmark problems which includes an exhaustive list of data difficulties in imbalanced data streams. The experimental study in Sect. 7 is divided into the following experiments whereas Sect. 8 discusses the lessons learned and recommendations.

6.1 Algorithms

The experiments comprise 24 state-of-the-art algorithms for data streams, including best-performing general-purpose ensembles and algorithms specifically designed for imbalanced streams. Algorithms are presented in Table 2 with their characteristics according to the established taxonomies. Specific properties of the ensemble models are presented in Table 3. All algorithms are implemented in MOA (Bifet et al., 2010b). The source code of the algorithms and the experiments are publicly available on GitHub to facilitate the transparency and reproducibility of this research.Footnote 1 All results, interactive plots and tables are available on the website.Footnote 2 Algorithms were run on a cluster with 2300 AMD EPYC2 cores, 12 TB RAM, and Centos 7. No individual hyperparameter optimization was conducted for any algorithm. All algorithms use the parameter settings recommended by their authors on their respective implementations. All ensembles are evaluated with the same parameter settings of 10 base classifiers using Hoeffding tree as the base learner. We acknowledge that algorithms often depend on parameters that may have a significant impact on the results obtained. Some methods use random generators which require an initial random seed. Different seeds will produce different results and multiple seeds should be run when the number of benchmarks is small due to the central limited theorem. Other methods have parameters that affect the classifier learning (e.g. the split confidence of the Hoeffding tree) that should be more carefully chosen when fitting a particular dataset. Due to the large number of benchmarks, experiments, and data size, the results reported on the paper are the median for 5 runs (5 seeds). Complete results to facilitate future comparisons and detailed information about the specific parameter configuration are available on the GitHub repository.

Table 2 Data stream algorithms and their taxonomy
Table 3 Ensemble algorithms and their taxonomy

6.2 Generators

To evaluate the classifiers in specific and controlled scenarios, we prepared data streams generators under different imbalanced and drifting settings. Nine generators in MOA (Bifet et al., 2010b) plus one generator proposed by Brzeziński et al. (2021) were used. Those generators are presented in Table 4, with their number of attributes, classes, and whether they can generate internal concept drifts. All generators are evaluated on a stream of 200,000 instances. For generators where it is possible to use a configurable number of attributes, the default value on the table was used. The number of classes was adjusted according to the experiment (2 for binary class experiments and 5 for multi-class experiments).

Table 4 Specifications of the data stream generators

6.3 Performance evaluation

The algorithms were evaluated using the test-then-train model, where each instance is first used to test then update the classifier in an online manner (instance by instance). We measured seven performance metrics (Accuracy, Kappa, G-Mean, AUC, PMAUC, WMAUC, and EWMAUC). Complete results are available on the website https://people.vcu.edu/~acano/imbalanced-streams. However, due to the limitations of space in the manuscript, we show results for Kappa, G-Mean, and the Area Under the Curve (AUC). They are calculated over a sliding window of 500 instances. We also acknowledge that there are different schools of thought regarding the best selection of performance metrics for imbalanced data. Our argument is that in order to have a comprehensive evaluation of the classifier performance on imbalanced datasets, one should not use only one metric, whichever the metric is, since all metrics have biases one way or another, and focus on assessing different aspects. Therefore, in our study, we report pairs of metrics that we have observed they exhibit complementary behaviors.

Kappa is often used to evaluate classifiers in imbalanced settings (Japkowicz, 2013; Brzeziński et al., 2018, 2019). It evaluates the classifier performance by computing the inter-rater agreement between the successful predictions and the statistical distribution of the data classes, correcting agreements that occur by mere statistical chance. Kappa values range from \(-100\) (total disagreement) through 0 (default probabilistic classification) to 100 (total agreement) as Eq. 1.

$$\begin{aligned} Kappa = \displaystyle \frac{n\displaystyle \sum _{i=1}^c{x_{ii}}-\displaystyle \sum _{i=1}^c{x_{i.}x_{.i}}}{n^2-\displaystyle \sum _{i=1}^c{x_{i.}x_{.i}}} \cdot 100 \end{aligned}$$
(1)

where \(x_{ii}\) is the count of cases in the main diagonal of the confusion matrix, n is the number of examples, c is the number of classes, and \(x_{.i}\), \(x_{i.}\) are the column and row total counts, respectively. Kappa punishes homogeneous predictions, which is very important to detect in imbalanced scenarios but can be too drastic in penalizing misclassifications on difficult data. Moreover, Kappa provides better insights in detecting changes in the distribution of classes in multi-class imbalanced data. However, some authors recommend to avoid Kappa as Kappa’s values vary depending not only on the performance of the model in question, but also on the level of class imbalance in the data, which can make the analyses difficult (Luque et al., 2019).

To tackle a balance between the performance of classifiers on the majority and minority classes, many researchers consider null-bias metrics such as sensitivity and specificity (Brzeziński & Stefanowski, 2018). These metrics are based on the confusion matrix: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Sensitivity, also called recall, is the ratio of correctly classified instances from the minority class (true positive rate) defined in Eq. 2. Specificity is the ratio of instances correctly classified from the majority class (true negative rate) defined in Eq. 3. The geometric mean (G-Mean) is the product of the two metrics as defined in Eq. 4. This measure tries to maximize the accuracy of each of the classes while keeping these accuracies balanced. G-Mean is a recommended null-bias metric for class imbalance (Luque et al., 2019). For multi-class data, the geometric mean is the square root of the product of class-wise sensitivity. However, this introduces the problem that as soon as the recall for one class is 0 the product of the whole geometric mean becomes 0. Therefore, it is much more complicated to use in multi-class experiments with a large number of classes and consequently, AUC would be preferred.

$$\begin{aligned} Sensitivity&= Recall = \displaystyle \frac{TP}{TP + FN} \end{aligned}$$
(2)
$$\begin{aligned} Specificity& = \displaystyle \frac{TN}{TN + FP} \end{aligned}$$
(3)
$$\begin{aligned} G{-}Mean&= \displaystyle \sqrt{Sensitivity \times Specificity} \end{aligned}$$
(4)

The Area Under the Curve (AUC) is invariant to changes in class distribution and provides a statistical interpretation for scoring classifiers. However, to measure the ranking ability of the classifiers the AUC needs to sort the data and iterate through each example. We employ the prequential AUC formulation proposed by Brzeziński and Stefanowski (2017) which uses a sorted tree structure with a sliding window. The AUC formulation was extended by (Wang & Minku, 2020) for multi-class problems defining the Prequential Multi-Class (PMAUC) as Eq. 5.

$$\begin{aligned} PMAUC = \displaystyle \frac{1}{C(C-1)} \cdot \sum _{i\ne j} A (i\mid j) \end{aligned}$$
(5)

where \(A (i\mid j)\) is pairwise AUC when treating class i as the positive class and class j as negative, and C is the number of classes. Extensions of the PMAUC calculation include Weighted Multi-class AUC (WMAUC) and Equal Weighted Multi-class AUC (EWMAUC) (Wang & Minku, 2020).

Both AUC and G-Mean are blind regarding the level of the class imbalance, while Kappa takes into account the class distribution but makes it more difficult to understand. Therefore, in cases of extreme imbalance ratios, the Kappa metric can be very dissimilar to the G-Mean and AUC, which means a classifier can have a high value of AUC, but a very low Kappa value. This is very useful to understand the behavior of a classifier under high imbalance ratios and how different metrics exhibit complementary facets of the classification performance. Therefore, it is important to evaluate the algorithms using both metrics in other to counterbalance overestimation. Henceforth, in our experiments presented in the manuscript, we evaluated the classifiers with G-Mean and Kappa for binary class scenarios and PMAUC and Kappa for multi-class scenarios. Metrics were calculated prequentially (Gama et al., 2013) using a sliding window of 500 examples. Complete results for all metrics (Accuracy, Kappa, G-Mean, AUC, PMAUC, WMAUC, and EWMAUC) are available on the website https://people.vcu.edu/~acano/imbalanced-streams for analysis and comparison with future works.

7 Results

This section presents the experimental results from the set of benchmarks proposed to answer the research questions. Section 7.1 shows the experiments on binary class imbalanced streams. Section 7.2 shows the experiments on multi-class imbalanced streams. Finally, Sect. 7.3 shows overall results and an aggregated comparison of all algorithms.

Due to the very large number of experiments conducted in this work, we present in the manuscript a selection of the most representative results. The experiments are organized to show three levels of detail in the results. First, a more detailed comparison of the top five methods. Second, an aggregated comparison of the top ten methods. Third, a summary of the comparison among all methods. Complete results for all experiments on all algorithms, datasets/generators, and metrics are available on the website.Footnote 3

7.1 Binary class experiments

The first set of experiments focuses on binary class problems with a positive minority class and a negative majority class. These experiments include static imbalance ratio, dynamic imbalance ratio, instance-level difficulties, concept drift and static imbalance ratio, concept drift and dynamic imbalance ratio, and real-world binary class imbalanced datasets.

7.1.1 Static imbalance ratio

Goal of the experiment. This experiment was designed to address RQ1 and evaluate the robustness of the classifiers under different levels of static class imbalance without concept drift. It is expected that classifiers that were designed to tackle class imbalance will present better robustness to different levels of imbalance, i.e., to achieve a stable performance regardless of the imbalance ratio. To evaluate this, we prepared the generators presented in Table 4 with static imbalance ratios (ratio of the size of the majority class to the minority class as defined by Zhu et al. (2020)) of {5, 10, 20, 50, 100}. This allows us to assess how each classifier performs under specific levels of class imbalance. Figure 4 illustrates the performance of five selected algorithms with increasing levels of static imbalance ratio. Table 5 presents the average G-Mean and Kappa for the top 10 classifiers for each of the evaluated imbalance ratios and the overall rank of the algorithms. Figure 5 provides a comparison of all algorithms for each level of imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. The bigger the axes the better rank of the algorithm on the metrics. The more rounded the ellipse the more agreement between the metrics. Finally, the color represents a gradient of the product of the two metrics’ ranks—red (worse) to green (better).

Discussion

Impact of approach to class imbalance. First, we will analyze the impact of different skew-insensitive mechanisms used by analyzed ensembles on their robustness to various levels of static imbalance under stationary assumptions. Looking at resampling-based methods we can observe a clear distinction between methods based on blind and informative approaches. Ensembles utilizing blind approaches usually drop their performance with an increasing imbalance ratio. Taking UOB as an example, one can see discrepancies between G-mean and Kappa metrics. For G-mean UOB maintains its predictive performance, to the point that for very high imbalance ratios it outperforms other approaches. However, for the Kappa metric, we can see that performance of UOB deteriorates significantly with each increase in class disproportions. This shows that UOB produces a good true positive ratio but proportionally a larger number of false positives. We can explain that by the limitations of undersampling to extreme class imbalance, as to balance the current distribution one must aggressively discard majority instances. In static problems the higher the disproportion between classes, the higher chance of discarding relevant majority examples. However, in a streaming setting, we analyze the imbalance ratio in an online manner, thus UOB is not able to counter the bias towards the majority class accumulated over time by undersampling incoming instances one by one. Its counterpart OOB shows the opposite behavior, returning best results for Kappa metric. Additionally, for high imbalance ratios OOB starts displaying balanced performance on both metrics. This shows that blind oversampling in online scenarios are capable of better and faster countering of bias accumulated over time. From informative resampling methods, we can observe that only SMOTE-OB returns satisfactory performance. For the Kappa metric, it can outperform UOB but does not fare well against OOB. All other algorithms that use SMOTE-based resampling perform even worse. This allows us to conclude that blind oversampling performs best from all data-level mechanisms in terms of robustness to static imbalance.

Fig. 4
figure 4

Robustness to different levels of static class imbalance ratio (G-Mean and Kappa)

Table 5 G-Mean and Kappa averages of all 10 streams on static class imbalance ratio
Fig. 5
figure 5

Comparison of all 24 algorithms for different levels of static class imbalance ratio. The axes of the ellipse represent G-Mean and Kappa metrics. The bigger the axes the better rank of the algorithm on the metrics. The more rounded the ellipse the more agreement between the metrics. The color gradient represents the product of both metrics’ ranks (Color figure online)

Among the algorithm-level solutions, CSARF displays best results for the G-mean metric, outperforming all reference methods. However, it does not hold its performance when evaluated using Kappa. This is another striking example of discrepancies between those metrics and how they highlight different aspects of imbalanced classification. Alternative algorithm-level approaches, such as ROSE and CALMID, while performing worse on G-mean, offer a more balanced performance on both metrics at once. Additionally, they display good robustness to increasing imbalance ratios. Therefore, algorithm selection for data streams with static imbalance is far from trivial, as one must choose between methods that perform very well only on one of the metrics, or choose a well-rounded method that, while not exceeding on any single metric, offers more even performance.

Finally, out of standard ensembles with no skew-insensitive mechanisms, LB returned the best predictive performance, outperforming several methods dedicated to imbalanced data streams. This did not hold for other methods, such as SRP or ARF that displayed no robustness to increasing imbalance ratios.

Impact of ensemble architecture. When we look at the overall best-performing methods in every scenario, we can see a dominance of ensembles based on bagging or hybrid architectures. Bagging offers an easy and effective way of maintaining instance-based diversity among base learners that benefits both data and algorithm-level approaches and leads to high robustness under various levels of class imbalance. Within bagging methods, only OUOB can be seen as an outlier. We can explain this using our observations from the previous paragraph—that undersampling and oversampling offer contrary performance (one favoring G-mean and the other one Kappa). Therefore, by combining those two approaches we obtain an ensemble that is driven by two conflicting mechanisms. Boosting-based ensembles are usually the worst performing ones. We can explain this by the fact that boosting mechanism focuses on correcting the errors of the previous classifier in a chain. When dealing with high imbalance ratios the errors are driven by a small number of minority instances, leading to too small sample sizes to effectively improve the performance. As usually minority instances are misclassified, assigning high weights to them will lead to high error on the majority class by increasing the number of false positives. In the end, boosting-based ensembles will consist of classifiers biased towards one of the classes. Without proper selection or weighting mechanisms, it is impossible to maintain robustness to high imbalance ratios with such classifiers in the ensemble pool.

7.1.2 Dynamic imbalance ratio

Goal of the experiment. This experiment was designed to address RQ2 and to evaluate how classifiers behave under dynamic imbalance ratios. Even though many existing methods were designed to deal with static imbalance ratio, they lack mechanisms that allow adaptation to time-varying changes in the imbalance ratio. To evaluate this, we prepared four scenarios: (i) increasing the imbalance ratio {1, 5, 10, 20, 50, 100}, (ii) increasing then decreasing the imbalance ratio {1, 5, 10, 20, 50, 100, 50, 20, 10, 5, 1}, (iii) flipping the imbalance ratio, in which the majority becomes the minority class and vice versa {100, 50, 20, 10, 5, 1, 0.2, 0.1, 0.05, 0.02, 0.01}, and (iv) flipping then repflipping the imbalance ratio, in which the majority becomes the minority class and then flips back to become the majority and vice versa {100, 50, 20, 10, 5, 1, 0.2, 0.1, 0.05, 0.02, 0.01, 0.02, 0.05, 0.1, 0.2, 1, 5, 10, 20, 50, 100}. In this experiment, we also evaluated two types of drift: gradual and sudden. This allows us to analyze how the classifiers can cope with dynamic imbalance ratio changes and how they can adapt when majority and minority change roles. Figures 6, 7, 8, 9 present the G-Mean and Kappa over time for the five selected classifiers for the generators and for both types of drift over (i) increasing imbalance ratio, (ii) increasing then decreasing imbalance ratio, (iii) flipping imbalance ratio, and (iv) flipping then reflipping imbalance ratio. To increase readability, the line plots were smoothed using a moving average of 20 data points. Table 6 presents the average G-Mean and Kappa for the top 10 classifiers for each of the evaluated dynamic scenarios and the overall rank of the algorithms. Figure 10 provides an overall comparison among all algorithms.

Fig. 6
figure 6

G-Mean and Kappa on increasing class imbalance ratio with gradual and sudden drift

Fig. 7
figure 7

G-Mean and Kappa on increasing decreasing class imbalance ratio with gradual and sudden drift

Discussion

Impact of approach to class imbalance. In our second experiment, it is interesting to note that best-performing classifiers are similar to the ones in the static scenario with their difference relying on how quickly they can recover from the imbalance drift. Regarding data-level approaches, UOB and OOB achieve good results, even without having explicit mechanisms for handling concept drift in imbalance ratio. OUOB once again did not display satisfactory results, mainly because of its inability to switch between different resampling approaches that lead to a slower response to changes. SMOTE based methods had diverging performances. C-SMOTE and OSMOTE cannot handle increasing imbalance ratio, losing their performance over time and not being able to cope with the increasing disproportion among classes. This can be explained by the fact that increasing imbalance ratio leads to a lower number of minority instances that could be used for the generation of relevant and diverse artificial instances. SMOTE-OB was among the best-performing classifiers. This can be explained by SMOTE-OB undersampling together with oversampling, leading to smaller disproportions between classes and more homogeneous k-nearest neighborhoods used for instance generation.

For algorithms modification methods, CSARF is the best performing one according to G-Mean but suffers under Kappa metric. When dynamic changes are introduced, ROSE presented the most balanced results according to both metrics. In the previous experiments, ROSE was also one of the best classifiers, but its underlying change adaptation mechanisms and usage of dynamic sliding windows lead to significant improvements for non-stationary imbalance, especially when dealing with high imbalance ratios and flipping role of classes.

Impact of ensemble architecture. Experiments with the dynamic imbalance ratio confirm our previous observations regarding the most robust architecture choice for ensembles. Boosting-based methods return even worse performance when dealing with an evolving disproportion between classes. We can explain this by the fact that each classifier in boosting change may be built using different class ratios, thus further reinforcing the small sample size problem for minority classes observed for static imbalance. This allows us to conclude that boosting-based ensembles are not best suited for handling difficult imbalanced streams. Bagging-based and hybrid architectures perform significantly better, with bagging being a dominating solution. It is very interesting to see that regardless of the used skew-insensitive mechanism bagging-based ensembles (or hybrid architectures like ROSE that utilize bagging) deliver superior performance. This can be explained by the diversity among base classifiers that allow for anticipating different local characteristics of decision boundaries. Therefore, with increasing or decreasing of the imbalance ratio there is a high chance that some of the base classifiers (and thus a subset of instances that they use) offer better generalization and faster adaptation to evolving disproportions between classes.

Impact of drift speed in class imbalance. We can observe that most of the examined algorithms offer similar performance on all types of drifts. Some of the methods do not have explicit mechanisms for change adaptation and this leads to their slower recovery from changes. However, in the long run, there were no significant differences between sudden and gradual drift adaptations for all methods on G-Mean or Kappa. However, the third analyzed scenario with flipping the majority and minority classes has a major impact on analyzed classifiers. ARF significantly suffers on both metrics, showing that its adaptation mechanisms are not suitable for settings where classes can change roles over time. The same observation holds for CSARF that displays an increasing gap between performances on majority and minority classes, as it is not able to effectively adapt its cost matrix to such changes and penalizes wrong class over time. Considering only G-Mean, flipping classes did not impact much UOB demonstrating that undersampling displays potential robustness to switching minority class. However, the same cannot be said for the Kappa metric, leading us to conclude that UOB tends to prioritize one class even when their roles flip. ROSE was the most robust and stable classifier for all possible changes in class distribution, being capable of avoiding huge drops of performance for the metrics due to its per-class sliding window adaptation.

Fig. 8
figure 8

G-Mean and Kappa on flipping class imbalance ratio with gradual and sudden drift

Fig. 9
figure 9

G-Mean and Kappa on flipping and reflipping class imbalance ratio with gradual and sudden drift

Table 6 G-Mean and Kappa averages of all 10 streams on dynamic class imbalance ratio
Fig. 10
figure 10

Comparison of all 24 algorithms for dynamic class imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

7.1.3 Instance-level difficulties

Goal of the experiment. This experiment addresses RQ3 and evaluates the robustness of the classifiers to instance-level difficulties (Brzeziński et al., 2021). We evaluated the Brzeziński generator with borderline or rare instances, and combining both at the same time. The ratio for difficult instances for scenarios where there are only rare or borderline instances are {0%, 20%, 40%, 60%, 80%, 100%}. In the combined scenario they represent {0%, 20%, 40%} of rare and borderline instances, e.g. 20% means there are 20% rare instances and 20% borderline instances. Difficult instances were created for the minority class to present a challenging scenario for the classifier. We evaluated the influence on classifiers combined with static and dynamic imbalance ratios. Borderline instances pose a challenge to the classifier because they lie in the uncertainty area of the decision space and strongly impact the induction of the classification boundaries. Rare instances are overlapping with the majority class. Also, instances of minority classes are distributed in clusters. Moving, splitting and merging these clusters leads to new challenges for the classifiers since the decision boundary moves accordingly.

Figures 11 and 12 present the performance of the five selected classifiers with the increasing presence of borderline and rare instances respectively under static imbalance ratio. Figures 13, 14, 15 illustrate the performance of the same classifiers with changes in the spatial distribution of minority class instances under static imbalance. Table 7 presents the average G-Mean and Kappa for the top 10 classifiers for each imbalance ratio and a given instance-level difficulty, and their average ranking. The overall performance of all classifiers regarding each of the instance difficulties is presented in Figs. 16, 17,18, 19, 20, in which axes of the ellipse represent G-Mean and Kappa metrics, the more rounded the better, and the color represents the product of both metrics.

Fig. 11
figure 11

Robustness to borderline instances for static imbalance ratio (G-Mean and Kappa)

Fig. 12
figure 12

Robustness to rare instances for static imbalance ratio (G-Mean and Kappa)

Considering the instance-level difficulties combined with dynamic imbalance ratio, Fig. 21 illustrates the performance of the classifiers with increasing imbalance ratio. Table 8 presents the average G-Mean and Kappa for the top 10 classifiers for increasing imbalance ratio in the presence of instance-level difficulties, and their overall ranking. To summarize, Fig. 22 shows the overall performance of all classifiers for each instance-level difficulty.

Discussion

Fig. 13
figure 13

G-Mean and Kappa on moving minority clusters for static imbalance ratio

Fig. 14
figure 14

G-Mean and Kappa on splitting minority clusters for static imbalance ratio

Fig. 15
figure 15

G-Mean and Kappa on merging minority clusters for static imbalance ratio

Impact of approach to class imbalance. First, let us look on how different mechanisms for ensuring robustness to class imbalance tend to perform under diverse data-level difficulties. While guided resampling solutions usually perform better than their blind counterparts, here we can see that most of approaches based on SMOTE tend to fail. This can be explained by reliance of SMOTE on the neighborhood. Borderline and rare instances create non-homogeneous neighborhoods that are characterized by high overlapping and classification uncertainty. Oversampling such areas will lead to enhancing these undesirable qualities, instead of simplifying the classification task. This is especially visible for rare instances, where SMOTE-based methods deliver the worst performance. Rare instances do not form homogeneous neighborhoods and in streaming setup may appear infrequently, leading to lack of both spatial and temporal coherencies. This undermines the basic assumptions of SMOTE-based algorithms and renders them ineffective. Only SMOTE-OB can handle both types of minority instances, as well as various minority clusters. This can be explained by using bags of instances (subsets) for training, which may offer better separation of instances and impose partial coherence on artificially generated instances. Blind oversampling employed by OOB also performs well under data-level difficulties, as it does not rely on the neighborhood analysis. However, when dealing with rare instances, OOB tends to significantly fail, as it amplifies these often-scattered instances, leading to overfitted decision boundaries. Where OOB and UOB excel is for minority class clusters, as due to their online nature they can swiftly adapt to changes in cluster structures and resample even small drifts to make them viable for their base learners. Algorithms based on training modifications, such as HDVFDT, ROSE, or CALMID can handle well all types of difficulties. Their robustness lies in their data manipulation and usage of modified mechanisms for training, which can display all-around, yet implicit, robustness to such challenges. Especially for Kappa metric, ROSE and CALMID can be seen as good choices for these scenarios. Their offshoot, a cost-sensitive approach of CSARF, displays best performance on G-Mean metric, yet suffers under Kappa evaluation. This shows that CSARF strongly focuses on the minority class, but this hits back in the form of an increased number of false positives. So, while it is capable of handling minority difficulties and clusters by cost penalty-lead training, it increases the false positives in these overlapping or uncertain regions.

Table 7 G-Mean and Kappa averages on borderline, rare, moving, splitting, merging minority clusters for static imbalance ratio

Impact of ensemble architecture. We can observe that all best-performing methods for various types of data-level difficulties are based either on bagging or hybrid architectures. All boosting methods are among the worst performing ones. This can be explained by the nature of boosting, as it focuses on correcting the mistakes of the previous classifier in the ensemble. Rare, borderline, or clustered minority instances will always introduce a high uncertainty into the training procedure. This may significantly destabilize boosting, as by focusing on correcting errors on those uncertain instances it will be continuously introducing other errors, locking itself in a cycle of never reducing the overall error. Bagging methods offer natural partitioning of instances, allowing to break difficult neighborhood or clusters and introduce more instance-level diversity into base classifiers. This aids the used mechanisms for handling class imbalance, making bagging methods more robust to scenarios where learning difficulties lie in spatial characteristics of data.

Comparison with standard ensembles. Interestingly, general-purpose ensembles display better robustness to various instance-level difficulties than over half of the classifiers dedicated to imbalanced data streams. LB, ARF, and KUE can relatively effectively handle both types of difficult instances, as well as various types of evolving clusters within the minority class. They always significantly outperform all methods based on boosting and most of approaches using informative oversampling (except for SMOTE-OB). Of course, we can observe a drop in their performance with the increase of imbalance ratios, yet even for IR = 100 they can perform better than several dedicated approaches. Their robustness to instance types can be explained by the fact that all three mentioned ensembles use instance subsets for training their base classifiers. Therefore, such subsampling may implicitly lead to more sparse neighborhoods (reducing overlapping and uncertainty) and thus to reduction of difficulty levels for certain instances. When analyzing the robustness to evolving clusters, one can explain this by concept drift adaptation mechanisms employed by LB, ARF, and KUE. Changes in minority class clusters can be picked up by their drift detectors, leading to adaptation to the current state of the minority class. Therefore, any splitting or merging of clusters will be picked up as changes in data distributions and managed by simple online adaptation to the most recent instances. Finding that such general-purpose ensembles display significant robustness to data-level difficulties stands as a testament to how well designed those methods are. However, to excel when dealing with such challenging learning scenario, highly specialized ensemble models enhance with skew-insensitive mechanisms can deliver much better performance.

Fig. 16
figure 16

Comparison of all 24 algorithms for borderline instances on static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

Fig. 17
figure 17

Comparison of all 24 algorithms for rare instances on static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

Relationships between instance-level difficulties and imbalance ratios. When analyzing algorithms for their robustness to data-level difficulties, we must understand the relationship between them and the class imbalance ratio. Ideally, we are looking for a method that will be insensitive to changing imbalance ratios and will display stable robustness to data-level difficulties. Most of the existing algorithms do not possess this quality, displaying either drops in the performance with increasing imbalance ratio (e.g. MICFOAL), or lack of any stability (e.g. VFC-SMOTE). The most reliable methods are CSARF, SMOTE-OB, OOB, and ROSE that offer stable, or improving, robustness with increasing imbalance. It is important to note that CSARF performance is skewed towards G-Mean, while the remaining methods tend to perform well on both metrics.

Fig. 18
figure 18

Comparison of all 24 algorithms for moving minority clusters on static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

Fig. 19
figure 19

Comparison of all 24 algorithms for splitting minority clusters on static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

Fig. 20
figure 20

Comparison of all 24 algorithms for merging minority clusters on static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

What difficulties are the most challenging. While analyzing the performance of the methods, we can see significant drops in the performance in two scenarios: when dealing with rare instances and splitting/merging clusters. Rare instances are one of the biggest challenges for any imbalanced algorithms, as they combine small sample size, class overlapping, and potential presence of noise. With increasing ratio of rare instances, the minority class in the stream starts losing any coherent structure, converging towards a collection of sparsely distributed and spatially uncorrelated instances, more akin to a cloud of points than any structure. This makes the formulation of decision boundaries especially difficult and requires dedicated mechanisms that can either learn under small sample size or can create more coherent representations (either via resampling like OOB, or via instance buffers like ROSE). Minority clusters pose even bigger challenge, as they force classifiers to track sub-concepts in minority classes (each cluster should be treated as a sub-concept). Both cases require fast adaptation and are strongly aided by a presence of underlying drift detector. With splitting clusters, previously learned decision boundaries become to general and are not able to capture the emergence of sub-concepts in minority class. With merging clusters, we are left with too complex decision boundaries that are not able to generalize well over the current state of the stream.

Fig. 21
figure 21

G-Mean and Kappa on increasing borderline, rare, moving, splitting, and merging minority clusters and increasing imbalance ratio

Table 8 G-Mean and Kappa averages on borderline, rare, moving, splitting, merging minority clusters and increasing imbalance ratio
Fig. 22
figure 22

Comparison of all 24 algorithms on borderline, rare, moving, splitting, merging minority clusters and increasing imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

7.1.4 Concept drift and static imbalance ratio

Goal of the experiment. This experiment aims to address RQ4 and to evaluate the robustness of the data stream classifiers to the static imbalance in the presence of concept drift. Even though the classifiers are designed to deal with imbalance ratios, they also have mechanisms to deal with concept changes. Concept drift affects decision boundaries, thus leading to a more challenging skewed learning scenario with a higher degree of overlap between classes. To evaluate this, we prepared the same generators used in experiment Sect. 7.1.1 with two types of concept drift: gradual and sudden. They were combined with the static imbalance ratio examined in experiment Sect. 7.1.1. Figure 23 illustrates the G-Mean and Kappa over time for the five selected classifiers with static imbalance ratio under the presence of concept drift. Table 9 presents the G-Mean and Kappa for the top 10 classifiers and each imbalanced ratio, and their overall ranking as well. Figure 24 summarizes the overall performance of all classifiers in this scenario.

Discussion

Impact of approach to class imbalance. In this experiment, we extend the problem of analyzing the robustness of classifiers to various imbalance ratios by adding concept drift affecting the decision boundaries. It is important to notice that the drift did not influence the disproportion between classes. OOB and UOB, two methods that offered excellent performance for stationary and imbalanced streams suffer from a significant drop in performance when handling non-stationary problems. OOB had the biggest performance drop under concept drift for higher imbalance ratios. This is expected since both methods do not have mechanisms to deal with changes in feature distribution. While the changes in the imbalance ratio could be tackled by resampling approaches, they do not allow for any efficient adaptation to evolving decision boundaries. Classifiers based on informed resampling, such as C-SMOTE, OSMOTE and VFC-SMOTE offered only a slightly better performance than the mentioned blind resampling ensembles. This shows that under the presence of concept drift, adaptation mechanisms play a more important role than the solutions used to tackle class imbalance. For algorithm-level methods, CSARF demonstrated the best results, thanks to its underlying implicit mechanisms for handling non-stationary data. While, similarly to previous experiments CSARF suffered under Kappa evaluation, this time it was the second-best regarding this metric. ROSE remained as the most balanced classifier displaying robustness to changes since it can adapt both to concept drift and imbalance ratio. ARF, ARFR, LB and SRP achieved decent results in a scenario with concept drift, however, their performance drops as the imbalance ratio increases.

Fig. 23
figure 23

Robustness to concept drift with static class imbalance ratio (G-Mean and Kappa)

Table 9 G-Mean and Kappa averages of all 10 streams for concept drift with static class imbalance ratio
Fig. 24
figure 24

Comparison of all 24 algorithms for concept drift with static class imbalance ratio. Axes of the ellipse represent G-Mean and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

Impact of ensembles architecture. As observed in the previous experiments, boosting-based methods deliver the worst performance among all ensemble architectures. This can be explained by drift destabilizing boosting classifier chains, as errors made by previous classifiers may no longer be meaningful for the updating of their follow-ups. There is a need to improve drift adaptation procedures for boosting-based ensembles so they can become competitive with their bagging peers. While bagging-based architectures are still the core of the best-performing methods, we can see the increasing dominance of hybrid architectures for concept drift scenarios. While all of them use bagging, they combine it with the dynamic weighting of the base classifier and dynamic line-up, demonstrating that a combination of several mechanisms is necessary to tackle class imbalance and concept drift at the same time.

Relationship between concept drift and imbalance ratios. In the context of this experiment, it is crucial to analyze and understand the interplay between the concept drift impacting the class boundaries and static imbalance ratios affecting the disproportion between them. While focusing on how the classifiers try to tackle concept drift, we do not see significant differences between the ones utilizing implicit or explicit drift detection. This shows that there is no obvious choice for adaptation mechanisms and that the classifier performance for drifting and imbalance streams is a product of their learning architecture, drift adaptation mechanism, and approach to tackling class imbalance. We can see that popular classifier for drifting data streams, such as ARF, LB, or SRP cannot handle increasing imbalance ratios. At the same time, solutions dedicated to online learning from imbalanced data streams, such as UOB or OOB cannot deal with the non-stationary nature of data streams. Best performing methods, such as ROSE, CSARF and SMOTE-OB combine adaptation and skew-insensitive mechanisms for all-round robustness.

7.1.5 Concept drift and dynamic imbalance ratio

Goal of the experiment. This experiment was designed to complement the previous experiment, and completely address RQ2 and RQ4, examining the classifiers in the presence of concept drift combined with dynamic imbalance ratio. Combining concept drift at the same time with changes in the class imbalance poses a complex challenge to classifiers. To evaluate this, we prepared the same generators in experiment Sect. 7.1.4 with gradual and sudden concept drift, and combined them with the dynamic increasing imbalance ratio proposed in experiment Sect. 7.1.2. Figure 25 illustrates the performance of the selected classifiers with dynamic increasing imbalance ratio under the presence of concept drift. Table 10 presents the G-Mean and Kappa for the top 10 classifiers for each type of concept drift and the average ranking for each evaluated metric. Figure 26 provides an overall comparison of all classifiers in the proposed scenario.

Discussion

Impact of approach to class imbalance. Let us focus on changes in the behavior of classifier as compared with the previous case of evolving class imbalance without explicit concept drift. All methods based on blind resampling display drops in performance, as usually they lack explicit mechanisms for handling concept drift, leading to their deterioration over time. SMOTE based methods followed the behavior experienced in previous experiments, mainly because concept drift and increasing imbalance ratio may lead to temporal incoherence which can enhance problems of oversampling. Only SMOTE-OB displayed satisfactory robustness to simultaneously evolving imbalance ratio and concept drift, while additionally achieving good balance between Kappa and G-Mean metrics.

Classifiers based on training modifications, such as ROSE and CALMID displayed robustness to concept drift and dynamic imbalance ratio, especially for the G-mean metric. Their training procedures provide reliability in a scenario where multiple changes happen simultaneously. The cost-sensitive approach of CSARF presents outstanding results regarding the G-Mean metric, with almost 1 as the average rank. Nevertheless, when analyzing the Kappa metric, we can see shortcomings of CSARF, where it ranks the third. This shows that the CSARF adaptation to evolving data characteristics is not balanced over both classes. ROSE displays balanced performance on both metrics, which can be explained by the combination of concept drift detector with balanced buffers for each of classes, allowing for equalized performance on the majority and minority classes.

Impact of ensemble architecture. This highly difficult scenario further shows that bagging-based and hybrid architectures are the only ones capable of handling drifting and evolving class imbalance. Their superiority over boosting methods becomes even more evident in these experiments. However, another interesting observation is the increasing gap between dynamic and static ensemble line-ups. Here we can see that most of the best performing methods use dynamic replacement of the ensemble members. This can be explained by the fact that when concept drift is combined with evolving class imbalance, especially under rapid changes, it is more efficient to train a new classifier from scratch and replace the weakest one, instead of trying to adapt the existing members to a vastly different new concept.

Fig. 25
figure 25

G-Mean and Kappa on concept drift with increasing imbalance ratio

Table 10 G-Mean and Kappa averages of all 10 streams for concept drift with increasing class imbalance ratio
Fig. 26
figure 26

Comparison of all 24 algorithms for concept drift with increasing class imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

Impact of concept drift speed. As mentioned in the previous observation, the speed of changes (velocity of concept drift) significantly impacts the classifiers. We observed that all of the classifiers tend to react worse to gradual drift, while displaying better robustness on sudden drift. While this observation can be surprising, we can explain it by taking a deeper look on how the adaptation mechanisms work in these ensembles. Under sudden concept drift, we cn observe a rapid deterioration of the ensemble performance. However, new instances coming from a stable concept are readily available, allowing for a recovery and adaptation with sufficient sample size. When dealing with gradual drift, classifiers do not see the new, fully formed concept so quickly. Therefore, the adaptation process becomes more tedious, as the sample size from the new concept may not be big enough. This may mislead some pruning or weighting mechanisms, forcing costly false adaptations. While in case of gradual drift we do not observe one single drop of performance, the negative impact of change is prolonged over time and thus may sum up to a bigger challenge for the classifier in the long run.

Relationship between concept drift and increasing class imbalance. In this scenario each classifier must be able to simultaneously handle concept drift (impacting the decision boundaries) and evolving class imbalance ratio (impacting both skew-insensitive mechanisms and decision boundaries). This creates a trade-off, with classifiers displaying different behavior patterns. Some methods, like LB or KUE display high adaptability to concept drift. Others, like OOB focus on robustness to evolving class imbalance. The most balanced method, offering best trade-off between those two factors is ROSE, followed by SMOTE-OB and CSARF.

7.1.6 Real-world binary class imbalanced datasets

Goal of the experiment. This experiment was designed to address RQ5 and to evaluate the performance of the classifiers on 19 real-world imbalanced data streams. The previous experiments focused on analyzing how the classifiers cope with various learning difficulties present in imbalanced data streams using synthetic generators, allowing us to inspect how the classifiers behave in specific and controlled scenarios. Meanwhile, real-world datasets pose specific challenges to classifiers, as they are not generated in a controlled environment. They are characterized by a combination of various learning difficulties that appear with varying intensity or frequency. Their imbalance ratio changes over time, while concept drift may oscillate among different types with varying speed. Therefore, assessing the performance of all classifiers on real-world data is a major step towards evaluation. The real-world data streams employed in the experiments are popular benchmarks for imbalanced data streams classifiers, and their specifications are presented at Table 11. Figure 27 illustrates the performance of the five selected classifiers in the real-world datasets. Table 12 presents the performance for the top 10 classifier on each dataset. Figure 28 summarizes the overall performance of all classifiers in the real-world datasets scenario.

Table 11 Real-world binary datasets specifications

Characteristics of real-world imbalanced data streams. Before analyzing the classifiers’ performance in real-world datasets, it is important to point up the difference between artificial and real-world imbalanced data streams. Generators are probabilistic and base the generation of instances on prior probability taken from the parametric imbalance ratio. Their appearance in the stream is dictated strictly by these priors, leading to bounded windows in which minority and majority instances appear. In real-world datasets, this does not happen, since they were collected to model specific phenomenon observations and does not respect such clear probabilistic mechanisms. All of this poses unique challenges to classifiers, such as the latency with which instances from a specific class arrive, or long periods when instances from only one class appear. This configuration of data streams presents many more challenges for streaming classifiers. Such benchmarks allow us to gain insights about the classifiers examining them under unique and challenging conditions.

Fig. 27
figure 27

G-Mean and Kappa on binary class imbalanced datasets

Table 12 G-Mean and Kappa on binary class imbalanced datasets
Fig. 28
figure 28

Comparison of all 24 algorithms for binary class imbalanced datasets. Color gradient represents the product of both metrics (Color figure online)

Discussion.

Impact of approach to class imbalance. First, it is interesting to note than on average all examined methods displayed much better Kappa than G-mean performance. We can observe that ensembles utilizing blind resampling, such as OOB and UOB, returned poor performance over real-world data streams. We can explain this by their purely online nature paired up with catastrophic forgetting, as these ensembles adapt their resampling strategy to the newest arriving instances, and thus are not being able to retain any memory of previously seen concepts. As in real-world scenarios instances do not arrive in stratified windows, one of classes may disappear for a while. This confuses such ensembles and leads to high skewness towards one class that is very difficult to overcome via online blind resampling. The methods based on informed resampling, such as C-SMOTE and SMOTE-OB displayed satisfactory results, showing that their learning mechanisms are robust to various characteristics of real-world streams. Also, SMOTE-OB achieved balanced results regarding both metrics, demonstrating a high stability and reliability in real-world cases. Interestingly, C-SMOTE was underperforming in previous synthetic cases, showing discrepancies between artificial and real-world domains. This allows us to conclude that there is a need for further research in real-world imbalanced streams and capturing more realistic benchmarks that reflect various learning difficulties.

When analyzing algorithm-level modification classifiers, ROSE displayed the best results, especially for the Kappa metric. While for synthetic datasets ROSE was consistently among the best methods, for real-world cases we can see that its robustness to a variety of learning difficulties allowed it to demonstrate its potential. ROSE stores buffers for each class independently contributing to scenarios with high latency of instances. CSARF remained as one of the best-performing classifier, displaying the best results on G-Mean, and being among the best regarding Kappa.

The worst-performing algorithms in the real-world scenario differ from what we saw in previous scenarios (with exception of OADA still being the weakest classifier). Algorithms that achieved average to good performance in other experiments such as KUE, HDVFDT and OBA did not maintain their performance over real-world datasets.

It is interesting to see that ARF which was not among the best-performing classifiers in the experiment with synthetic data, was the third-best classifier regarding Kappa. This shows that in the used real-world datasets the impact of concept drift was much more significant than the impact of class imbalance, allowing for a method focusing purely on adaptation to changes to rank so high.

Impact of ensemble architecture. Real-world datasets allow us to evaluate how each type of ensemble architecture deals with streams under multiple difficulties appearing at the same time. We can see that all ensemble-based methods display much better performance on average than in the previous experiments. This is especially true of boosting-based methods that reduced the gap in their performance when compared to top-performing algorithms. However, bagging-based and hybrid ensembles still are the superior choices. This shows how these architectures offer better robustness in scenarios where data does not follow uniform characteristics over extended periods.

7.2 Multi-class experiments

The second set of experiments focuses on multi-class problems where the relationships among the many classes may vary over time (Lango & Stefanowski, 2022). Multi-class imbalanced data is more difficult and less frequently studied than its binary counterpart. There are relative imbalance ratios among classes and overlapping of the minority and majority classes becomes a greater issue (Santos et al., 2023; Stefanowski, 2021; Lipska & Stefanowski, 2022). These experiments include static imbalance ratio, dynamic imbalance ratio, concept drift and static imbalance ratio, concept drift and dynamic imbalance ratio, analysis on the impact of the number of classes, real-world multi-class datasets, and semi-synthetic multi-class imbalanced datasets. The number of examined algorithms in this set of experiments is reduced to 15 following their multi-class capabilities shown in Table 2.

7.2.1 Static imbalance ratio

Goal of the experiment. This experiment was designed to address RQ1 and to evaluate the performance and robustness of the classifiers to the static class imbalance in a scenario with multiple classes. In multi-class settings, the class imbalance can be even more challenging than in binary settings, since now multiple classes can be underrepresented. Also, relations among the classes are no longer obvious, since one class may be a majority when compared to some other classes, but a minority for the rest of them. This allows us to analyze how each classifier behaves under specific class distributions. To evaluate this, we prepared three multi-class generators {Hyperplane, RandomRBF, and RandomTree}, all of them with 5 classes using the class distribution {50, 20, 10, 5, 1}. Figure 29 illustrates the performance of the five selected algorithms classifiers for each multi-class stream. Table 13 summarizes the performance of the top 10 classifiers for each generator and their average ranking regarding each metric. For overall comparison, Fig. 30 presents the overall aggregated performance of all classifiers. Axes of the ellipse represent PMAUC and Kappa metrics, the more rounded the better, and the color represents the product of both metrics.

Discussion

Impact of class imbalance approach. First, we need to observe that the performance of the algorithms in multi-class problems significantly differs from the binary problems. This shows that multi-class imbalance data streams pose a series of unique challenges and thus this requires developing specific mechanisms dedicated to tackling more than two classes. Simple adaptation of binary mechanisms tends to fail and underperform, especially when dealing with a large number of classes.

For blind resampling methods, we can see a drop in their performance. OOB returns mediocre results, much below the ranks observed in binary scenarios. UOB becomes completely unusable in multi-class problems, failing to achieve any acceptable predictive power. This shows that when dealing with multiple distributions, blind resampling methods cannot capture complex relationships among classes and tend to further increase the difficulty factors (such as class overlapping or noise). This happens because blind resampling approaches consider only a single class, thus discarding valuable information about other classes. There is a need to develop novel resampling algorithms dedicated specifically to multi-class data streams.

ARFR was the best algorithm regarding the Kappa metric and the second-best regarding PMAUC. Its weighing mechanism led to good robustness to multi-class imbalance ratio, as it assigns importance to every tree in the ensemble based on the class distribution (independently of the number of classes. The cost-sensitive CSARF displays the best performance on the G-Mean metric, yet suffers under Kappa evaluation. This shows that CSARF focuses on the minority classes but at the cost of suffering a larger number of false positives. The best three classifiers are based on ARF, showing that it is very reliable in multi-class scenarios. Among classifiers based on training modifications, only ROSE achieved good results, demonstrating that keeping buffers for each class is a good choice for this scenario. On the other hand, HDVFDT and GHVFDT were among the worst. It is worth mentioning that CALMID and MICFOAL were not able to outperform the mentioned classifiers, despite being specifically designed for multi-class imbalanced data streams.

Impact of ensemble architecture. Ensembles once again are predominant among the best performing methods for multi-class imbalanced streams. Within bagging-based methods, only LB underperformed. However, LB is a general-purpose ensemble, therefore it was expected not to display robustness on pair with dedicated skew-insensitive solutions. KUE and SRP could satisfactory handle static multi-class imbalance. Also, it is interesting to note that most bagging methods displayed balanced performance considering Kappa and PMAUC, demonstrating that their natural partitioning of instances contributes to a balanced performance among all classes. We have much less information on boosting-based ensembles, as only one of the examined classifiers were suitable for multi-class problems. However, this single case performed poorly, allowing us to assume that the performance of boosting-based methods will follow trends from binary scenarios.

Fig. 29
figure 29

PMAUC and Kappa on multi-class static imbalance ratio

Table 13 PMAUC and Kappa on multi-class static imbalance ratio
Fig. 30
figure 30

Comparison of all 15 algorithms for multi-class static class imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

7.2.2 Dynamic imbalance ratio

Goal of the experiment. This experiment was designed to complement the previous experiment and address RQ2 to evaluate the robustness of the classifiers to dynamic changes in imbalance ratio with multiple classes. To evaluate this, we prepared three multi-class generators {Hyperplane, RandomRBF and RandomTree}, all of them with 5 classes shifting imbalance ratio through the following distributions: {{50, 20, 10, 5, 1}, {20, 10, 5, 1, 50}, {10, 5, 1, 50, 20}, {5, 1, 50, 20, 10}, {1, 50, 20, 10, 5}}. The speed of the changes was evaluated both sudden and gradual. This allows us to analyze how classifiers are able to cope with dynamic imbalance ratio changes and how they are able to adapt. Figure 31 illustrates the prequential PMAUC and Kappa for each generator over time for the selected classifiers. Table 14 presents the performance for the top 10 classifiers, and their average ranking. To summarize, Fig. 32 shows the overall performance of all classifiers

Discussion

Impact of class imbalance approach. Interestingly, the average performance of all evaluated classifiers is higher under the dynamic imbalance than under the static skewness ratio. We can explain that by the fact that evolving imbalance and class roles lead to each of the classes being the majority class for a given period of time, thus allowing for a better exposure of it to the classier, as well as countering the small sample size problem (which is a big challenge for multi-class imbalance data).

Blind resampling methods repeated the trends observed in the previous experiment, with OOB returning acceptable performance and UOB failing to deliver predictive power. Undersampling, by reducing the size of majority class, can be enhancing the small sample size difficulty, instead of temporarily alleviating it. This prevents UOB from capitalizing on stats of the stream when a minority class transforms to majority one.

Ensembles based on ARF maintain their very good performance and robustness to evolving imbalance ratio. ARFR displayed one of the best performances, showing that adding a level of resampling really enhanced the robustness to drifting class imbalance. CSARF exhibited great performance on the PMAUC metric, but again failed to return satisfactory Kappa. Algorithms based on training modifications like ROSE and CALMID showed better robustness and ability to handle drifting imbalance ratios. CALMID exceeds on Kappa metric, but drops several ranks under PMAUC. ROSE presented the most balanced results in this scenario regarding both metrics, therefore it can be seen as the most reliable and trustworthy choice for a multi-class imbalanced data stream.

Impact of ensemble architecture. As we saw in previous experiments, bagging methods are among the best-performing, and it can be seen easily on the overall figure (Fig. 32), where 7 classifiers form a cluster in the bottom-left side of the distribution, all of them being bagging-based ensembles. Hybrid architectures presented by CALMID and MICFOAL that were designed specifically for multi-class imbalanced streams significantly improve their performance when dealing with evolving imbalance ratios.

Impact of drift speed in class imbalance. Considering the speed of changes in class imbalance ratios, we can notice that the impact of speed is marginal on most of the classifiers. Some methods, mainly the ones without any drift handling mechanisms, have slower responses, but it did not translate to significant changes over predictive performance. PMAUC metric seems to be more sensitive to differentiation between gradual and sudden changes, while Kappa values are similar for both speeds. We can explain that by the fact that PMAUC does not consider the class imbalance ratios, thus responding differently to varying speed of changes. Kappa offers a more stable monitoring of the stream changes, not affected by the velocity of imbalance ratio evolution.

Fig. 31
figure 31

PMAUC and Kappa on multi-class shifting imbalance ratio

Table 14 PMAUC and Kappa on multi-class shifting imbalance ratio
Fig. 32
figure 32

Comparison of all 15 algorithms for multi-class shifting class imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

7.2.3 Concept drift and static imbalance ratio

Goal of the experiment. This experiment was designed to complement previous experiments and address RQ2 and RQ4 and to evaluate the behavior of the classifiers in a scenario with multiple classes in the presence of concept drift and static imbalance ratio. Concept drift leads to changes in decision boundaries, creating a challenge for classifiers to cope with and react to change. To evaluate this, we prepared three streams generators similarly to experiment Sect. 7.2.1, plus a concatenation of all three streams, and introduced concept drifts along the stream gradually or suddenly. Figure 33 presents the performance of the five selected classifiers for each evaluated drifting stream. Table 15 provides the PMAUC and Kappa for the top 10 classifiers for both types of drift and their average value and ranking. Figure 34 illustrates the overall performance for all classifiers.

Discussion

Impact of class imbalance approach. Concept drift poses an increased difficulty in multi-class scenarios, as it changes complex relationships among classes. Multi-class problems tend to have much more complex decision boundaries than their binary counterparts and thus adaptation to drift requires more training instances or increased amount of time.

When analyzing resampling-based approaches, we can observe a significant drop in predictive power for both OOB and UOB. We already established that UOB is incapable of handling multi-class problems, but the additional presence of concept drift positions it among the worst performing classifiers. This follows our observations from the binary experiments, where we showed that lack of explicit or implicit drift adaptation mechanisms in OOB and UOB inhibits their learning capabilities from non-stationary data. ARFR returned best results among resampling-based algorithms, being at the same time competitive with other top performing classifiers.

The basic version of ARF displayed a loss of performance, showing that this algorithm cannot handle well changes appearing in multiple classes at once, especially when these classes are skewed. Its cost-sensitive modification maintained the very good performance observed in previous experiments, additionally improving under Kappa metric. This shows that CSARF is capable of an efficient adaptation to concept drift. CALMID and MICFOAL displayed good results, being methods natively designed for multi-class scenarios. ROSE was among the best performing algorithms, without relying on resampling or cost-sensitive modifications. This shows that ROSE mechanisms, mainly effective classifier replacement and class-based buffers, allow for an improved robustness in drifting and imbalanced multi-class scenarios.

Impact of ensemble architecture. Once again bagging-based and hybrid architectures tend to dominate the experimental study. Even methods such as LB and SRP returned decent results, despite their lack of skew-insensitive mechanisms. This shows that well-designed drift adaptation goes a long way in every streaming scenario and that bagging-based architectures can utilize their diversity to better anticipate the drift occurrence. Two exceptions to this rule are KUE, which as we observed in binary class cannot perform well with concept drift and imbalance, and UOB that does not adapt well to multi-class imbalance.

Impact of concept drift speed. We can see that the speed of concept drift does not significantly affect the results of individual classifiers. However, we can see different behavior of the metrics as compared to the previous experiment. Here Kappa reacts differently to gradual and sudden drifts, showing that the speed of evolution of class boundaries can be picked up by Kappa analysis. This allows us to conclude that when concept drift is combined with imbalance, both PMAUC and Kappa become sensitive to speed of changes.

Fig. 33
figure 33

PMAUC and Kappa on concept drift and multi-class static imbalance ratio

Table 15 PMAUC and Kappa on concept drift and multi-class static imbalance ratio
Fig. 34
figure 34

Comparison of all 15 algorithms on concept drift and multi-class static class imbalance ratio. Color gradient represents the product of both metrics

7.2.4 Concept drift and dynamic imbalance ratio

Goal of the experiment. This experiment was designed to complement previous experiments and address RQ4 to evaluate the behavior of the classifiers in a scenario with multiple classes in the presence of concept drift and dynamic imbalance ratio. Besides concept drift, changes in the imbalance ratio poses obstacles for classifiers that have to deal with multiple changes in data distribution. To evaluate this, we prepared three streams generators similarly to experiment Sect. 7.2.2, but introducing concept drifts along the stream gradually and suddenly. Figure 35 illustrates the PMAUC and Kappa metrics of the selected classifiers for each evaluated drifting stream. Table 16 presents the PMAUC and Kappa for the top 10 classifiers for both types of drift and their average value and ranking. Figure 36 provides the overall performance for all classifiers.

Discussion

Impact of class imbalance approach. Regarding blind resampling methods, we can see that the combination of concept drift and evolving imbalance ratios led to significant deterioration of OOB results, showing that the blind oversampling cannot adapt well to changes happening in both feature space and class characteristics. UOB was impacted even more significantly, making it the worst classifier in this scenario. ARFR is still among the best performing methods, however we can see small drop in the performance compared to the previous experiment. This shows that informed resampling techniques still require more work regarding the adaptation to both drifting and evolving imbalance ratios, as especially class role switching became challenging for ARFR.

When analyzing algorithm-level solutions we can see that CSARF, while still performing well on PMAUC, displayed reduced performance on Kappa. This shows that it cannot handle evolving imbalance ratios and class roles well, having high bias towards the initial role of classes. CALMID and MICFOAL improved their relative ranking regarding previous experiments, showing that they are resilient enough to handle both challenges at the same time.

ROSE is a clear winner in this scenario, showing the best robustness to multiple types of changes affecting the data stream. Its adaptation and skew-insensitive mechanisms allow it to efficiently handle the combination of concept drift and dynamic class imbalance, easily adapting to the new incoming concepts, even with changed class roles.

Impact of ensemble architecture. Once again we can see a clear dominance of bagging-based and hybrid architectures. However, this difficult learning scenario gives us a very unexpected insight. We can see that SRP and LB are able to outperform CALMID and MICFOAL. This is highly surprising, as the former methods are general-purpose classifiers, while the latter ones were specifically designed to handle imbalanced multi-class streams. Additionally, KUE achieved similar performance to dedicated skew-insensitive ensembles. This allows us to conclude that combination of bagging-based or hybrid architecture with an effective drift adaptation mechanism is a leading factor in the performance of ensemble classifiers for drifting and dynamically imbalanced streams. Therefore, it is crucial for future researchers not to focus solely on how to handle class imbalance, but firstly how to handle non-stationary characteristics, and then make this adaptation mechanism skew-insensitive.

Impact of concept drift speed. Once again, we are unable to see a clear relationship between the speed of concept drift and classifier performance. Even under sudden drifts, most of the examined methods were able to quickly recover and return to their performance before the change. Therefore, end results are similar for any speed of change. The differences can be observed very locally during the drift occurrence, but they did not have a long-lasting effect on any classifier.

Relationship between concept drift and shifting imbalance ratio. This scenario combines two types of changes, creating a more realistic and challenging scenario. Therefore, we need to understand the impact of each of these types of changes on the underlying classifier. Analyzing the results, we can see that most of existing algorithms are characterized by a trade-off: either focusing on adaptation to changes or on robustness to class imbalance. Only ROSE and SRP displayed a balanced performance on both tasks. This supports our previous conclusion that there is a need to design novel methods where both adaptation and skew-insensitiveness will be solved as a joint problem.

Fig. 35
figure 35

PMAUC and Kappa on concept drift and multi-class shifting imbalance ratio

Table 16 PMAUC and Kappa on concept drift and multi-class shifting imbalance ratio
Fig. 36
figure 36

Comparison of all 15 algorithms on concept drift and multi-class shifting class imbalance ratio. Color gradient represents the product of both metrics (Color figure online)

7.2.5 Impact of the number of classes

Goal of the experiment. This experiment was designed to evaluate the robustness of the classifiers to different number of classes under the presence of concept drift and dynamic imbalance ratio. Combining those learning difficulties with different number of classes allow us to evaluate how classifiers deal with higher number of classes and examine if does affect their learning mechanisms or not. To evaluate this, we used the generators in experiment Sect. 7.2.2. All these generators were evaluated with the following number of classes {3, 5, 10, 20, 30}. Figure 37 illustrates the performance of five selected algorithms classifier for each number of classes. Table 17 summarizes the performance of the top 10 classifiers for each number of classes and their average ranking regarding each metric. For overall comparison, Fig. 38 presents the overall aggregated performance of all classifiers. Axes of the ellipse represent PMAUC and Kappa metrics, the more rounded the better, and the color represents the product of both metrics.

Discussion

Impact of class imbalance approach. We can see that high number of classes pose a significant challenge for most of the examined methods. For resampling-based approaches, we observe that OOB and UOB cannot handle any higher number of classes, returning the worst performance of the same rank as single tree classifiers (GHVFDT and HDVFDT). ARFR maintains its performance with the increase in the number of classes, showing that the combination of informed resampling with ARF-based architecture allows for the memorization of more complex decision boundaries, while combating bias using well-placed artificial instances.

When looking at the algorithm-level modifications we can see that CSARF, previously one of the best algorithms, displays no robustness to increasing number of classes. This shows the limitations of cost-sensitive approaches, as with the increased number of classes the cost matrix needs to grow. Large cost matrices lead to loss of meaning behind the penalties and their reduced influence on learning process and no effect on bias towards majority classes. We can conclude that existing cost-sensitive methods are not suitable for handling multi-class imbalanced streams with a high number of classes. CALMID, despite being designed for multi-class problems, cannot handle increasing number of classes and returns performance similar to ensembles based on blind resampling. ROSE and MICFOAL displayed the best robustness to high number of classes. ROSE, especially for the Kappa metric, is a safe choice for scenarios with elevated number of classes.

Impact of ensemble architecture. Our analysis of the ensembles under increasing number of classes showed once again the dominance of bagging-based and hybrid architectures. In this scenario, hybrid approaches became dominant, with ROSE displaying best results due to its combination of working on both instance and feature subspaces, combined with per-class memory buffers for balanced class representations.

Impact of the high number of classes. With the increasing number of classes, we can see a clear break point when the number of classes is \(> 20\). This shows that all classifiers could handle the increasing number of classes up to a certain point, after which their capabilities for memorizing new concepts and generalizing over all classes begin to rapidly deteriorate. We can see that for scenarios with 30 classes most of the methods start returning highly unsatisfactory results. Interestingly, for these cases we can observe a very good performance of standard classifiers, such as SRP. When analyzing ranks, we can see that SRP and ROSE are two best performing ensembles when handling high number of classes. While we provided the explanation for the better performance of ROSE, it is very surprising to see that SRP performs on par with it. We can explain it by the fact that both ROSE and SRP use feature subspaces, which can be seen as lower dimensional projections of a difficult learning task. In such a lower dimensional subspaces the decision boundaries among classes may be simplified, leading to better generalization capabilities. This follows observation made in Korycki and Krawczyk (2021b), where it was postulated that low-dimensional representations can overcome class imbalance without any dedicated skew-insensitive mechanisms.

Fig. 37
figure 37

Impact of the number of classes on PMAUC and Kappa under concept drift and multi-class shifting imbalance ratio

Table 17 PMAUC and Kappa averages on the number of classes under concept drift and multi-class shifting imbalance ratio
Fig. 38
figure 38

Comparison of all 15 algorithms on the number of classes under concept drift and multi-class shifting imbalance ratio. Axes of the ellipse represent PMAUC and Kappa metrics. Color gradient represents the product of both metrics (Color figure online)

7.2.6 Real-world multi-class imbalanced datasets

Goal of the experiment. This experiment was designed to address RQ5 and to evaluate the performance of the classifiers on 18 multi-class real-world imbalanced and drifting data streams. The previous experiments focused on analyzing how the classifiers can deal with multiple learning difficulties in multi-class data streams. This allowed us to examine their behavior in very specific and controlled scenarios. Furthermore, with data stream generators we have full control over the created data, but we cannot generate specific scenarios that are present in real-world scenarios, because they are characterized by merging various learning difficulties at varying frequency and intensity. The real-world data streams employed in the experiment are popular benchmarks for data streams classifiers, and their specifications are presented in Table 18. The PMAUC and Kappa for the five selected classifiers are presented in Fig. 39. Table 19 provides the average PMAUC and Kappa for the selected top 10 classifiers for each dataset. Figure 40 illustrates the overall performance of all classifiers for all real-world datasets.

Characteristics of real-world data streams. By analyzing the performance of classifiers in real-world datasets it is worth to bring up the difference between artificial streams and real-world imbalance data streams. In real-world datasets data was collected in order to model a specific phenomenon observations and does not hold clear probabilistic mechanisms such as stream generators. Also, in a multi-class real world scenario, relations between features and classes are not so clearly defined as it is on artificial generators. This benchmark allows us to gain insights about the classifiers examining them under real unique and challenging conditions.

Table 18 Real-world multi-class datasets specifications

Discussion

Impact of class imbalance approach. Similar to the previously analyzed binary case, real-world datasets bring a combination of various challenges in addition to the multi-class nature of analyzed streams. However, contrary to our observations from binary experiments, we cannot determine for any of the evaluated classifiers to be better than its peers. Also, it is possible to notice that on average PMAUC was very similar for all classifiers, while Kappa values tend to highlight more differences among algorithms. This shows that Kappa is an effective metric for multi-class imbalanced data streams, allowing us to gain more insight into how each of the algorithms is performing.

Analyzing the resampling-based approaches, we can see UOB returned unsatisfactory results, confirming our observations regarding its inability to cope with multiple classes. OOB returned much better predictive power, however only for datasets with relatively small number of classes. This confirms our previous observations that blind resampling methods are not suitable for problem characterized by a high number of classes to be learned from. Interestingly, ARFR returned much better results, but on a similar level than standard ARF. This shows that major reason behind the success of ARFR lies not in the chosen informative resampling, but in a good selection of the ensemble architecture.

For algorithm-level approaches CSARF remained among the best-performing classifiers, displaying excellent PMAUC metric, but falling behind when it comes to Kappa evaluation. ROSE, CALMID and MICFOAL presented highly satisfactory results. It is worth to note that ROSE did not perform as well as it in previous scenarios, which can be explained by lack of specific learning difficulties in analyzed real world data streams (as ROSE excels in very difficult problems). CALMID and MICFOAL demonstrated better performance than on artificial domains, showing that their mechanisms lead to good performance over real-world problems.

Impact of ensemble architecture. While this experiment follows all our previous observations, we should focus on a comparison between general-purpose and skew-insensitive ensembles. Similarly to experiment with high number of classes, we can observe very good performance of general-purpose ensembles on real-world imbalanced benchmarks. CSARF displayed the best results in real-world datasets regarding PMAUC but the worst regarding Kappa. This shows that in the analyzed benchmarks adaptation to change and ability to better separate classes in lower-dimensional subspaces can return at least as good performance as dedicated mechanisms for tackling class imbalance.

Fig. 39
figure 39

PMAUC and Kappa on multi-class imbalanced datasets

Table 19 PMAUC and Kappa on multi-class imbalanced datasets
Fig. 40
figure 40

Comparison of all 15 algorithms for multi-class imbalanced datasets. Color gradient represents the product of both metrics (Color figure online)

7.2.7 Semi-synthetic multi-class imbalanced datasets

Goal of the experiment. This experiment was designed to address more in-depth RQ5 and to evaluate the robustness of the classifiers to semi-synthetic data streams (Korycki & Krawczyk, 2020). We used all 9 multi-class semi-synthetic data streams proposed in.Footnote 4 These benchmarks simulate critical class ratio changes and concept drifts. This allows us to analyze how the classifiers are able to cope with dynamic changes and concept drifts with real-world data streams, how they are able to adapt to those changes. Figure 41 illustrates the performance of five selected algorithms in the semi-synthetic data streams. Table 20 presents the average PMAUC and Kappa for the top 10 classifiers for each of the evaluated streams and the overall rank of the algorithms. Figure 42 provides a comparison of all algorithms.

Discussion

Impact of class imbalance approach. Semi-synthetic benchmarks allowed us to use real-world data to create much more challenging scenarios with rapidly evolving imbalance ratios. Thus, we preserved the desirable characteristics of real-world problems (such as mixed types of drift) but enhanced them with much more challenging problem from the imbalance standpoint. When analyzing the results, we can see that all classifiers formed two clusters when looking at their predictive performance.

For resampling-based methods, we can see that UOB and OOB returned opposite performance, despite them sharing similar core. Here, we can see the superiority of oversampling, which confirms observations found in (Korycki & Krawczyk, 2020), where authors of these semi-synthetic benchmarks postulated that smart oversampling is the best solution. ARFR again returned very similar performance to standard ARFR, highlighting that its predictive power can mainly be attributed to its robust core design.

For algorithm-level methods CSARF achieved the best-performing classifier regarding PMAUC, while surprisingly displaying good results on Kappa. ROSE, CALMID and MICFOAL displayed great performance, showing that their hybrid mechanisms are capable of efficient handling of rapid changes in imbalance ratios within real-world datasets.

Impact of ensemble architecture. By adding sudden and extreme changes in real-world benchmarks datasets, we could see an increase in the gap between best and worst performing methods. Similarly, to the previous experiments we can observe an excellent performance returned by SRP, showing a significant potential in using low-dimensional representations for imbalanced data streams, direction so far only explored in (Korycki & Krawczyk, 2021b).

Fig. 41
figure 41

PMAUC and Kappa on semi-synthetic multi-class imbalanced datasets

Table 20 PMAUC and Kappa on semi-synthetic multi-class imbalanced datasets
Fig. 42
figure 42

Comparison of all 15 algorithms for semi-synthetic multi-class imbalanced datasets. Color gradient represents the product of both metrics (Color figure online)

7.3 Overall comparison

Goal of the experiment. The previous experiments discussed how different individual underlying data properties affected the performance of the classifiers. The goal of this experiment is to perform a joint comparison of the algorithms, identify performance trends and divergences, that will allow us to make recommendations to end-users. Moreover, we analyze the computational and memory complexity of the algorithms to address RQ6. The goal of any algorithm for data streams is to simultaneously maximize the classification metrics while minimizing the runtime and memory consumption (Krempl et al., 2014). However, these are often conflicting objectives and highly accurate methods often require long runtimes, which is not acceptable for real-time high-speed data streams. Table 21 shows the runtime and memory consumption of the 24 algorithms both for binary and multi-class imbalanced streams. Figures 43 and 44 present a pairwise joint comparison of the algorithm’s ranks on G-Mean, PMAUC, Kappa, runtime and memory consumption across all experiments. Figure 45 shows a circular stacked barplot with the ranks for the four metrics. The bigger the stack the better aggregated performance. The circular barplot displays the algorithms sorted clockwise based on the stack size.

Discussion

Classification metrics. All the above experiments showcased the importance of using not only more than a single metric for evaluating classifiers for imbalanced data streams, but also the importance of using diverse and complimentary metrics. G-mean and PMAUC are strongly correlated with each other and follow the same trends, thus making using both redundant. However, by adding Kappa metric we gained an additional insight into specific characteristics of evaluated classifiers, thus allowing us to better understand which of the classifiers favor only minority classes and which return balanced performance over all analyzed classes.

Two best performing classifiers across all of experiments were ROSE and CSARF. ROSE returned single best performance regarding Kappa metric and one of the best for the other metrics. This allows us to conclude that ROSE is a well-rounded classifier that demonstrates robustness to various learning difficulties embedded in imbalanced and drifting data streams, both binary and multi-class. CSARF returned excellent results in both types of experiments for G-Mean (for binary tasks) and PMAUC (for multi-class tasks) metrics. However, its rank dropped significantly under Kappa evaluation, showing that CSARF is driven by its performance on minority classes, not balanced performance on all of them. Furthermore, CSARF becomes unsuitable for scenarios with very high number of classes.

Table 21 Comparison of runtime (seconds per 1,000 instances) and memory consumption (RAM-Hours)
Fig. 43
figure 43

Overall comparison of algorithms’ ranks for G-Mean/PMAUC versus Kappa and Memory Consumption versus Runtime on binary and multi-class imbalanced benchmarks. Color gradient represents the product of each pair of metrics (Color figure online)

Fig. 44
figure 44

Overall comparison of algorithms’ ranks for G-Mean/PMAUC/Kappa versus Runtime/Memory Consumption on binary and multi-class imbalanced benchmarks. Color gradient represents the product of each pair of metrics (Color figure online)

Fig. 45
figure 45

Overall comparison of stacked algorithms’ ranks for G-Mean/PMAUC, Kappa, runtime, and memory on binary and multi-class imbalanced benchmarks. Algorithms are sorted clockwise by the stacked ranks best to worst. Equal weight for the four metrics

Other highly ranked classifiers included SMOTE-OB and OOB for binary scenarios and ARFR for multi-class ones. SMOTE-OB was the only classifier based on SMOTE that ranked among top performers, showing that SMOTE-based resampling for drifting streams needs to be further developed to achieve success, especially under instance-level difficulties. OOB did not get good results on multi-class scenarios, with close to average performance. Since SMOTE-OB does not support multi-class problems we could not evaluate it in this scenario. For multi-class imbalanced data streams, ARFR returned excellent results. This can be explained by the ability of its architecture to deal with multi-class scenarios and adapt to changes on multiple classes. This combined with a informed resampling approach lead to an effective classifier capable of handling multiple skewed classes in the stream.

OADA can be pointed out as the worst classifiers regarding classification metrics for both settings. This gives us insights about limitations of boosting-based ensembles for imbalanced data streams, where various learning difficulties destabilize the Boosting procedure and lead to low predictive power.

Computational and memory complexity. When evaluating a classifier for data stream mining, we have to take into account how much resources are needed to run it. In the streaming setting we often deal with a situation where memory or computational power is limited, thus we may not choose the best classifier, but the one that fits our scenario. HDVFDT and GHVFDT are characterized by a very small memory usage, and fast runtime. This happens because they are tree-based classifiers which are naturally lightweight, with simple structure and low-cost prediction mechanisms. UOB can also be seen as a relatively low-cost classifier, which is justified by its nature of removing samples from the data streams in order to balance it. Therefore, we reduce the size of each batch and obtain more compact base classifiers.

When analyzing the classifiers that require the highest computational resources, we can see that they are dominated by oversampling-based approaches. This comes as an obvious observation, as oversampling increases the size of the already big data stream by generating a high number of artificial instances. Additionally, the increase in computational cost lies in the oversampling method itself. All SMOTE-based approaches rely on nearest neighbor computation to generate artificial instances, which leads to significant increases in their complexity. Out of approaches relying on blind oversampling (and thus free of nearest neighbor computations), OUOB and OSMOTE consumed the highest amount of resources. This can be explained by it employing resampling mechanisms combined with dynamic switching between them and drift detectors. Out of the classifiers that do not rely on resampling, OADA was the most computational heavy one. Although its memory consumption is similar to other ensemble approaches, the runtime was bigger that its peers. This is another motivation against using current boosting-based algorithms for imbalanced data streams.

Relationship between predictive power and computational and memory complexity. Is there a trade-off?. We can see that both analyzed criteria are often in a direct opposition to each other, the most lightweight classifiers are also among the worst performing ones. Therefore, how one can strike a balance between predictive power and computational complexity? How to select the best trade-off for imbalanced data streams? To select the most suitable classifier for a given data stream we cannot always get the best-performing regarding classification metrics, due to resources restrictions. For example, SMOTE-OB got excellent results regarding classification metrics, but often required more than 256GB of RAM per run, a prohibitive number for many real-world scenarios. On the other hand, we cannot also choose the lightweight classifier if it does not present good predictive power for the problem (e.g. single tree-based classifiers are non-competitive to most of ensembles for imbalanced data streams).

Analyzing all our experiments, we aim to select such classifiers that balance both sides. We can clearly see that OOB, UOB, ROSE and CALMID presented the best trade-off between their predictive performance and computational complexity for binary and multi-class experiments. ROSE presented the best overall performance when using equal weights for the predictive performance and complexity metrics. While second, the oversampling method in OOB demanded more memory and runtime when building the classifier. UOB undersamples the majority class which reduces the runtime complexity of the classifier learning. However, UOB has shown that undersampling in multi-class imbalanced data did not perform as in the binary scenario. ROSE and CALMID rely on highly efficient hybrid architectures and do not employ any costly mechanisms such as oversampling or adaptive cost-sensitive matrix. When focusing on the predictive metrics only, ROSE, SMOTE-OB, and CSARF perform the best on binary class while ARFR, ROSE and CSARF perform the best on multi-class.

8 Lessons learned

In order to address RQ7 and summarize the knowledge we extracted through the extensive experimental evaluation, this section presents the lessons learned and recommendations for future researchers.

Design of the experimental study. To gain insights into the performance of classifiers and fairly evaluate them for imbalanced data streams, a properly designed experimental testbed is crucial. The experimental evaluation must be done in a holistic and comprehensive manner that will assess the robustness of the classifiers to the most important challenges embedded in imbalanced data streams. These must include: (i) static and dynamic imbalance ratios with switching class roles; (ii) instance-level difficulties; (iii) various types and speeds of concept drift; (iv) binary and multi-class scenarios; (v) increasing number of classes; and (vi) real-world datasets. Only such a comprehensive evaluation will allow for comparing new classifiers to existing state-of-the-art. For the sake of reproducible research, this paper offers a ready to use testbed available on GitHub that allows for easy and reproducible evaluation of new classifiers designed for imbalanced data streams.

Class imbalance approach. Our experiments showed that among the top performing methods we had two approaches based on training modifications (ROSE and CALMID), two approaches based on resampling (ARFR and SMOTE-OB), and one cost-sensitive method (CSARF). This is a very interesting outcome, as it shows that any of existing approaches to class imbalance can achieve excellent robustness and thus confirms the no-free-lunch theory - there is no single best way of tackling class imbalance in drifting data streams. Each of these solutions has their merits and works best in slightly different settings. In the next section we will formulate recommendations on what algorithms should be used in which scenarios. For future research it is important to understand what characteristics of each successful algorithm led to its superior performance, as those characteristics should be preserved and further developed when designing new classifiers.

Desirable properties of data-level solutions. When analyzing the resampling-based algorithms, we can see the dominance of oversampling approaches, both in their blind and informative versions. Blind oversampling has much lower computational cost and good reactivity to concept drift. However, it fails in multi-class scenarios, especially with high number of classes. Informative oversampling based on SMOTE, when combined with ensembles, offer a very high predictive power, being able to handle instance-level difficulties and adapt to various types of non-stationary stream characteristics. This came at the price of extremely high computational complexity (mainly due to the distance calculations), as well as being currently designed only for binary problems.

Desirable properties of algorithm-level solutions. When analyzing the algorithm-level solutions, we can see that two main dominant approaches were based either on modifying training method or using cost-sensitive classification. ROSE stands as a primary example of effective training modification, as it offers top performance over a plethora of analyzed scenarios and the best robustness to various learning difficulties. This can be contributed to combination of diversity assurance for base classifiers (on both instance and feature levels), effective classifier replacement scheme (where pruning can replace multiple classifiers at once), and not relying on any resampling scheme (instead using class-specific buffers that allow for handling high number of classes). Those modifications allowed ROSE to strike a balance between predictive power (across all metrics) and its computational complexity. Cost-sensitive solution realized within CSARF showed that the combination of efficient design with cost matrix leads to a highly competitive classifier that offers great adaptation to concept drift and do not rely on any resampling. However, current limitations of cost-sensitive approaches include bias towards G-mean/PMAUC (while underperforming on Kappa) and inability to effectively handle higher number of classes.

Ensemble architectures. All experiments pointed out to the dominance of bagging-based and hybrid ensemble architectures (please note that most successful hybrid architectures were also rooted in bagging). Both static and dynamic ensemble setups worked well with bagging initialization, showing that this leads to creation of diverse base learners that can perform well under concept drift and various learning challenges. Furthermore, ensembles that added a feature space diversification on top of bagging, such as ROSE or ARFR were among the top performers. This shows that the feature space manipulation is a highly promising direction. Boosting proved to be the least efficient, not being able to cope with high imbalance ratios or data-level difficulties.

Adaptation to concept drift versus robustness to class imbalance. We can see that the most challenging scenarios where when dynamic class imbalance was combined with concept drift. Here we could observe that the classifiers either focused on drift adaptation, or handling bias towards majority classes. Interestingly, classifiers with very good adaptation mechanisms tend to perform slightly better in these complex scenarios than their counterparts that focus mainly on robustness to imbalance.

Data-level difficulties. Instance-level characteristics can be very disruptive to existing algorithms for imbalanced data streams. They should be analyzed not only as individual instances, but also as subconcepts within minority class that can evolve over time (e.g. merge or split). We can see that resampling-based solutions tend to perform well under these difficulties, mirroring observations for static data. However, none of the algorithms could explicitly use the instance-level characteristics to their advantage, as suggested by (Krawczyk & Skryjomski, 2017).

Handling high number of classes. When analyzing the robustness of classifiers to very high number of classes, we observed that SRP, a general-purpose ensemble with no skew-insensitive mechanisms, returns one of the best performances. This, combined with very good performance of ROSE, allows us to conclude that for multi-class imbalanced problems with very high number of classes using lower dimensional representations may lead to simplification of learning tasks. Using feature subspaces may lead to more diverse capturing of relationships among classes. This confirms observations made by (Korycki & Krawczyk, 2021b) that discussed the merit of low-dimensional embeddings for extremely imbalanced and difficult data streams.

Classifier evaluation. To evaluate a classifier in imbalanced data streams, we require the use of multiple diverse and complimentary metrics. In our testbed we argue for the use of Kappa and G-Mean/PMAUC. These metrics assess different and complementary perspectives, thus if only one is provided the evaluation of a classifier is biased towards measuring how it performs on minority class under highly imbalance ratio (Kappa) or on how it balances majority and minority class performance (PMAUC/G-mean). We showed how under high imbalance ratios, Kappa significantly penalizes the false positives whereas G-Mean tolerates a larger proportion of false positives.

Computational and memory complexity. One must take into an account the trade-off between predictive power and computational complexity. Algorithms requiring lowest resource consumption are among the weakest ones (such as skew-insensitive versions of Adaptive Very Fast Decision Trees). On the other hand, some of the best performing classifiers are characterized by almost prohibitive computational complexity (e.g. SMOTE-OB). ROSE, CALMID and OOB presented the best trade-off between computational resources consumption and predictive power.

9 Recommendations

After analyzing all the scenarios and evaluating different approaches to class imbalance, we could summarize some recommendations to help future researchers when designing their own algorithms to tackle imbalanced data streams and other learning difficulties:

Choose the best off-the-shelf algorithms. If you are looking for efficient classifiers for solving your real-world imbalanced data streams, or you are looking for effective reference methods for your experiments, it is important to be aware of the most efficient off-the shelf solutions. Based on our exhaustive experimental study, we can recommend ROSE, CSARF, OOB, ARFR, and CALMID as the ready to use and effective classifiers. We especially recommend using ROSE due to its balanced performance, great trade-off between predictive power and computational cost, excellent robustness in all analyzed scenarios, as well as ease of use due to its autonomously self-adaptive parameters.

Analyze the dynamics of imbalance ratio. In data streams where imbalance ratio is static, oversampling and training modification methods return excellent performance. Ensembles based on bagging and hybrid architecture are a good choice. When it comes to evolving imbalance ratios, we need a more sophisticated mechanism adapting to the changing imbalance ratio. Here we can see a dominance of algorithm-level solutions that offer dynamic ensemble line-up with effective pruning, such as ROSE.

Consider the presence of concept drift. Our experiments showed that many skew-insensitive classifiers suffer due to their lackluster adaptation mechanisms. On the other hand, general-purpose classifiers can display surprisingly good performance in specific cases, showing the impact of recovery from concept drift. This allows us to recommend paying close attention to embedding an efficient concept drift adaptation mechanism into your method. Regardless of how robust your skew-insensitive mechanism will be, it will not be sufficient to cope with the drifting nature of imbalanced data streams.

Check for instance-level difficulties. Instance-level difficulties in data streams pose significant difficulties to most of the classifiers (Brzeziński et al., 2021). It is crucial to analyze your stream to understand if such factors are present. We noticed that methods based on oversampling tend to handle instance-level difficulties particularly well. However, none of them can directly take an advantage of such challenging instances to improve adaptation and robustness. Existing research suggest that incorporating such information during learning from imbalanced streams may be highly beneficial (Krawczyk & Skryjomski, 2017). Therefore, we recommend to truly understand the nature of streams you are working with and focusing on how you can leverage this information to make your classifiers more robust.

Consider the number of classes. There is a significant difference in developing methods for binary and multi-class imbalanced data streams. While some of algorithms work well regardless of the number of classes (e.g. ROSE), other are very sensitive to it and their performance deteriorates significantly with increase in the number of classes (e.g. CSARF). Multi-class data streams will require the development of dedicated resampling algorithms, just like in the static scenarios (Krawczyk et al., 2020). Existing resampling methods work well mainly in binary cases and do not translate well to a higher number of classes. Finally, most of the existing classifiers work under fixed number of classes. This should be considered when dealing with emerging and disappearing classes, as existing classifiers need to be extended with dedicated mechanisms to handle this phenomenon (Masud et al., 2009, 2010a, b).

Think outside of the box. While data-level and algorithm-level solutions are the most popular approaches to handling class imbalance, there are other promising directions to explore. Instead of focusing on another online resampling method or cost-sensitive modification, explore alternative solutions. Our experiments showed the high promise behind low-dimensional representations for imbalanced data streams, as firstly explored by (Korycki & Krawczyk, 2021b). This is just the tip of an iceberg in developing novel techniques tailored to imbalanced data streams that do not follow these two most popular directions.

Use fair and holistic evaluation. New classifiers for imbalanced data streams should always be compared with both the popular methods (e.g. OOB or UOB), as well as with the most recently published and top performing ones (as of the time of this study these will include ROSE, CSARF, or OOB). It is important to use an established experimental setup and follow the best practices in this field. This paper provides reproducible code for the entire testbed, along with all examined classifiers and datasets. This is the first standardized approach for evaluating classifiers for imbalanced data streams. We recommend for future researchers to simply plug-in their new methods into our framework to ensure fair and holistic evaluation of newly proposed methods.

Do not neglect using general-purpose ensembles as reference. Our experiments showed that general-purpose ensembles can return surprisingly good performance for non-stationary imbalanced data streams, due to their well-designed drift adaptation mechanisms. Therefore, it is important to use them as a point of reference to see if the proposed skew-insensitive mechanism actually contributes significantly to the performance of a new classifier.

Use multiple performance metrics. There are many performance metrics for evaluating imbalanced data streams including Kappa, G-Mean, PMAUC, WMAUC, EWMAUC. Section 6.3 presented the different aspects these performance metrics assess, and acknowledged the different biases in individual metrics. We recommend using multiple metrics exhibiting complementary behavior rather than picking a single metric.

Ensure reproducible research. Reproducible research is the key towards the advancement of the machine learning community. If you want your method to have an impact, always provide the source code on GitHub and use popular frameworks such as MOA (Bifet et al., 2010b), River (Montiel et al., 2020), Stream-learn (Ksieniewicz & Zyblewski, 2022), and Scikit-multiflow (Montiel et al., 2018). This will make sure that other researchers can use your classifier, as well as that it can be easily embedded in existing frameworks, for comparison with other methods.

One size does not fit all. This survey paper presents a very large experimental evaluation of as many imbalanced data scenarios as possible in order to compare existing methods in the state of the art. It is not our intention nor realistic that every study from now on is required to always use the full set of benchmarks. Our goal is that future works can build on our recommendations to include some of the benchmarks proposed as appropriate in each work, acknowledging that not all of them are necessary nor suitable for all studies.

10 Open challenges and future directions

After formulating recommendations regarding the currently available algorithms, we will now present and discuss open challenges and future directions for learning from imbalanced data streams.

Informative and fast resampling. Our experimental study showed that current undersampling-based methods underperform for imbalanced data streams, especially when faced with multiple classes. There is a need to develop novel and informative undersampling approaches that can adapt to concept drift and allow to efficiently tackle dynamic class imbalance, while preserving the desirable low computational complexity. Current informative oversampling methods are rooted in SMOTE, offering good improvements in predictive power at the high computational cost. We should develop novel oversampling methods that do not rely on a nearest neighbor approach, thus reducing the computational complexity and alleviating SMOTE limitations (Krawczyk et al., 2020).

Proactive instead of reactive tackling of dynamic class imbalance. Existing methods focus on adaptation to both concept drift and dynamic class imbalance after the change has taken place. But is there a possibility to anticipate the change? Can we predict how the class imbalance will evolve over time and offer proactive approach? This would significantly reduce the recovery time after changes in data streams and lead to more robust classifiers.

Improving boosting-based ensembles. We have discussed how existing boosting-based ensembles perform poorly for imbalanced data streams. Yet boosting is one of the most successful ensemble architectures and deserves a second chance. We hope that the weaknesses of boosting identified in this paper will help other researchers develop more suitable classifiers based on this architecture, capable of fast adaptation to changes and overcoming small sample size in minority classes.

Handling evolving number of classes. While we investigated the impact of the number of classes on imbalanced problems, we have not touched upon dynamic changes in class numbers (Masud et al., 2009, 2010a, b). In data stream scenarios classes may emerge, disappear, and recur over time. An evolving number of classes combined with dynamic imbalance ratio creates an extremely challenging scenario that requires new and flexible models capable of detecting and incorporating new classes into their structures, as well as forgetting the outdated classes and remembering recurring classes (Masud et al., 2011, 2012; Al-Khateeb et al., 2012; Sun et al., 2016). We envision strong parallels with continual and lifelong learning approaches (Korycki & Krawczyk, 2021a).

Fairness in imbalanced data streams. Algorithmic fairness is a subject of intense research (Iosifidis et al., 2021), aiming at creating non-biased classifiers that do not rely on protected attributes. Recent works by suggest that algorithmic fairness and class imbalance are the two sides of the same coin, as protected information is often displayed by underrepresented, minority groups. Fairness in data stream mining could benefit from enhancing existing methods with skew-insensitive approaches, as both domains aim at countering bias in data.

Online skew-insensitive feature selection. We have noticed a superior performance of ensembles based on reduced feature subspaces, especially for difficult multi-class problems. While existing methods are based on randomized approaches, there is a need to develop efficient online feature selection methods insensitive to class imbalance. This will allow not only to create more compact classifier, filter irrelevant features, but also eliminate features that increase bias towards the majority class. This could further be expanded into scenarios where the feature space size evolves over time.

Beyond binary and multi-class imbalanced data streams. Most of the existing research in imbalanced data streams focuses on binary and multi-class classification. However, multiple other tasks in data streams may be subject to data imbalance. Multi-label data is inherently imbalanced and calls for dedicated methods capable of handling multi-target outputs (Alberghini et al., 2022). Regression from streams is also frequently subject to imbalance in the form of rare values, as frequencies of specific ground truths may evolve over time (Branco et al., 2017; Aminian et al., 2021). Finally, streaming times series also require dedicated resampling and skew-insensitive methods to facilitate robust predictions.

11 Conclusions

Summary. In this paper, we offered an exhaustive and informative experimental review of classification methods for imbalanced data streams. We designed a robust experimental framework, publicly available for reproducibility, to evaluate state-of-the-art classifiers in varied scenarios and understand how each aspect of imbalanced data streams affects the performance of classifiers, and provide a template for future researchers to evaluate their newly classifiers with the state of the art. With this experimental framework, we performed an experimental comparison with 24 algorithms in multiple scenarios to analyze their behavior and discuss their performance trends and divergences. The classifiers were evaluated on 515 benchmarks with different difficulties such as dynamic and static imbalance ratio, with and without concept drift, the presence of data-level difficulty factors, and real-world problems. All these settings were evaluated isolated and combined, in a binary and multi-class scenario, to gain insights and understand how they would affect the underlying learning mechanisms of data-streams classifiers. Throughout the experiments, we could demonstrate which approaches work or do not work for each scenario, such as undersampling techniques were undermined in multi-class scenarios, and dynamic ensemble methods such as ROSE could do better in many different settings, demonstrating robustness. Our proposed experimental framework allowed us to get insights into all the classifiers and how would they perform in different scenarios, therefore future researchers can follow the same standard of evaluation when proposing their classifier for imbalanced data streams, in order to achieve the most transparent and complete results possible.

Towards the future of reproducible research in data stream mining. We proposed a standardized and holistic framework for evaluating imbalanced data streams. We strongly believe that this is a crucial step towards unifying the community working in this domain, offering a flexible tool for long-time practitioners, and an easy way to get started for newcomers. Guidelines and recommendations formulated in this paper should allow more streamlined and effective improvement of existing algorithms and development of new solutions. Only as a community working together, we can truly advance our understanding of data streams and design truly impactful, well-rounded, and thoroughly evaluated algorithms that will be used in both academia and industry.

We hope that our framework will begin to grow over time with new algorithms, problems, and benchmarks being added by the community. There are still many questions unanswered in this domain and many open challenges for the future. We look forward to discovering new knowledge together.