1 Introduction

With the increasing role of computers in our everyday life we obtain more and more potentially useful information that can aid us in tasks such as supporting key decisions, finding interesting patterns, and discovering anomalies. The human brain can hardly match the capacities of computers in processing all the available information, which brings new challenges for machine learning. Therefore, computers are trained for tasks such as recognizing Spam, helping with medical diagnosis, predicting electricity consumption, and discovering frauds.

Recently, machine learning has undergone a change in the focus and application of its techniques. More and more attention shifts from off-line mining of static data towards ubiquitous mining and real-time applications dealing with data represented by streams. Data streams refer to a process where instances arrive continuously and possibly infinitely over time. Therefore, we might not have all the data available at once and consequently the data require sequential access rather that random access. New approaches and stream extensions to off-line methods emerge in order to improve performance under constraints given by the stream environment. The approaches for streams need to possess special characteristics such as to be able to produce a model while scanning the data only once, the model must be available at any point of time, must be up-to-date, and all must be able to run under computational and memory constraints (Gama 2010).

A common behavior in streaming data is that it evolves over time with unknown dynamics. This evolving nature of data might change the functional mapping between attributes and classes, a phenomenon known as concept drift. This is an issue that up-to-date streaming learning system must take into account. Almost everything in this world changes and so can the concepts in data, whenever they are being observed for long enough time. The influence of seasonal change, economic change, health conditions or wear of machinery can be the cause of decreasing quality of the models built based on previous observations. Therefore, fast model adaptation is an advantage for decision learning in most real world problems.

In machine learning, and especially in the classification tasks, various types of decision trees are among the most popular tools that are widely used. They are hierarchical structures with decisions in nodes and labels in leaves. Not only do they provide very good predictive capabilities, but also evince high degree of interpretability due to their comprehensible visualization. Consequently, the paths from root of the tree to the leaves can be rewritten into set of unordered IF-THEN rules. Those rules capture the main characteristics of the decision problem and its relevant features. Moreover, each unordered rule can be handled independently of the others in the set. A common and straightforward approach is to generate decision rules from decision trees. Another way to obtain the rule set is to induce them directly from data, similarly as when learning the tree.

Indeed, the popularity, quality and interpretability of decision trees are preserved in the stream mining scenarios. Among the best known and most used models for data stream classification are the algorithms based on Hoeffding trees (Domingos and Hulten 2000) (HT). These incremental algorithms automatically adapt to concept drift just by expanding the tree. The adaptation of HT, however, is rather slow. Faster adaptation might be achieved by employing explicit drift detection methods (Gama et al. 2004; Baena-Garcia et al. 2006; Hinkley 1970). Nevertheless, using explicit detection usually requires rebuilding the current tree, which could be computationally expensive. The aforementioned rule sets did not attract much attention from the community in the context of stream mining as opposed to the trees. Yet they have the advantage of having individual rules that can be managed independently. Therefore, in decision rules the implicit adaptation feature of the trees remains. Moreover, the set of rules can be altered more easily. Instead of rebuilding the classifier from scratch or executing a complicated change of the structure in the tree, individual rules which are considered outdated can be simply removed. The explicit change detection of individual rules can provide much faster adaptation mechanisms to changes and can also serve as rule set pruning mechanism.

This article is based on three conference papers. The first two introduced the algorithm for learning decision rules from data streams (Gama and Kosina 2011) and proposed a modification to the algorithm (Kosina and Gama 2012b). The work presented a classifier with different evaluation criteria and it was redesigned to be able to learn rules for each class. These algorithms were not specifically designed for time changing data and were not tested under such conditions. The adaptation of the rule learning algorithm to a time changing data was the aim of the third paper (Kosina and Gama 2012a). This work summarizes previous versions, proposes another modification of the algorithm, and provides more extensive and in-depth evaluation.

The paper is organized as follows. Section 2 discusses the related work in rule learning and handling time changing data. The VFDR algorithm and its extensions are presented in Sect. 3. The proposed algorithms are evaluated and compared to other stream classification algorithms in Sect. 4. The lessons learned and future work are discussed in Sect. 5.

2 Related work

This section presents the related works in the stream classification, the rule learning algorithms and their different approaches to multi-class problems, and the last part focuses on drift detection in time-changing data.

2.1 Stream classification algorithms

One of the most influential algorithms for classification in data streams was proposed by Domingos and Hulten (2000). His work presented what is known as very fast decision tree (VFDT), an on-line version of a decision tree classifier appropriate for data stream processing. VFDT is learned by recursively replacing leaves with decision nodes. Each leaf stores the sufficient statistics about attribute-values. The sufficient statistics are required by a heuristic function that evaluates the merit of split-tests based on the attribute-values. When an example is available it traverses the tree from the root to a leaf evaluating the corresponding attribute at each node, and following the branch according to the attribute’s value of the example. When the example reaches a leaf the sufficient statistics are updated. The tree replaces leaves with test nodes when the Hoeffding bound condition is satisfied. The popularity of VFDT attracted a lot of attention and different extensions and improvements were proposed. Hulten et al. (2001) presented CVFDT, a decision tree learner for mining data streams with non-stationary distributions. CVFDT learns a model consistent with a sliding window of recent examples. When the concept is changing and a split that was previously selected would no longer be the best, the CVFDT algorithm starts learning an alternate subtree with a new best attribute as its root. The subtree replaces the original one when it becomes more accurate. The algorithm keeps the model up-to-date when there are large and frequent changes in a concept. Gama et al. (2003) introduced VFDTc algorithm that extended the base incremental tree learning algorithm in two ways. The first extension was to equip leaves with binary search trees in order to handle numerical attributes. The second improvement introduced the use, in the leaves of the tree, of Naive Bayes (NB) classifiers trained on the examples that fall into the leaf. This functionality significantly improves the predictive accuracy of the tree. The possible weakness that the statistics kept within the node using the binary tree can be relatively large when the examples have many unique values for the numerical attribute.

A recent system called IBLStreams proposed by Shaker and Hüllermeier (2012) introduces instance based learning classifier for data streams. Instance based learning is inherently incremental. This is a memory based (or lazy) approach to learning that consists of adding or removing instances from the model. IBLStream selects the case base (instances that form the classifier) based on temporal relevance, spatial relevance, and consistency. The idea is that the most recent examples are the most relevant; it aims to have more or less uniform coverage of the instance space, and to remove data that seem to be inconsistent with the current concept. New examples are added to the case base and the redundant (neighboring) examples are checked for removal. The neighborhood is given by \(k_{cand}\) nearest neighbors. The most recent are excluded from the candidates for removal, because it is difficult to distinguish between noise and examples of potentially new concept. The classifier used a drift detection technique (statistical process control, SPC, described later in this paper) computed over last 100 training instances to detect abrupt changes. If a change is detected IBLStream removes larger number of instances from the base. The extent of the change (%) is estimated by the difference of minimum classification error over last 100 training instances and the error over last 20 training instances. The examples to be removed are then chosen at random according to a distribution which is spatially uniform but temporally skewed. The advantage of IBLStream are the flexibility of the learner and the adaptation capabilities with very good accuracy. The weakness of the classifier can be the classification time and interpretability of the model.

2.2 Rule learning

A widely used strategy for learning rules consists of deriving rules from decision trees, as it is done in Quinlan (1993). Any decision tree can be easily transformed into a collection of rules. Each rule corresponds to the path from the root to a leaf, and there are as many rules as leaves. This process generates a set of rules with the same complexity as the decision tree. However, it has been shown that the antecedents of individual rules may contain irrelevant conditions. C4.5rules (Quinlan 1993) uses an optimization procedure to simplify conditions. Previous empirical studies (Quinlan 1993) have demonstrated that the set of rules is both simpler and more accurate than the initial tree. Frank and Witten (1998) present a method for generating rules from decision trees without using global optimization. The basic idea is to generate a decision tree in a breadth-first order, select the best rule, remove the examples covered by the rule and iteratively induce further rules for the remaining instances.

There are several algorithms in the literature for learning decision listsFootnote 1 (Rivest 1987; Clark and Niblett 1989; Cohen 1995; Domingos 1996; Weiss and Indurkhya 1998) with different capabilities and approaches to handle multi-class problems. CN2 (Clark and Niblett 1989) is one of the first rule learning systems. It uses a top-down, general to specific approach. CN2 evaluates every possible complex, i.e., conjunction of attribute tests (if-conditions) of a rule, based on information-theoretic entropy measure. This classifier is able to learn multiple class-problems. Each rule predicts the most common class of the examples covered by the rule. In its improved version, presented in Clark and Boswell (1991), the main loop of rule set algorithm learn a model (a rule set) for each class separately. In each iteration, the examples of one class are chosen as positive and all the others are labeled as negative examples. Only positive examples that satisfy a learned rule are removed from the training set for next iteration of the rule search. Similarly, RIPPER (Cohen 1995) decomposes the problem in one vs. all fashion. The classes are ordered in increasing order of their frequency in the training set. After learning rules that separate the minority class, covered examples are removed and the algorithm proceeds with the next class. This process stops when a last single class remains which is then assigned as default class. Another technique to reduce multi-class problem is pairwise classification or round-robin (Fürnkranz 2001). The idea is to learn a classifier for each pair of classes thus transforming the original \(c\)-class problem into \(\frac{c(c-1)}{2}\) two-class problems. All binary classifiers are used to classify test examples. The predictions are aggregated using uniform voting. The predictions of the classifiers learned on the true class of an example are expected to outweigh those that were trained to recognize examples of different classes. If rule learning algorithm is not class-symmetric (i.e., problem of discriminating class \(i\) from class \(j\) is different than discriminating \(j\) from \(i\)), the authors suggest double round robin approach. The classifiers are learned for both problems thus obtaining in total \(c(c-1)\) classifiers. The recent book (Fürnkranz et al. 2012) presents a comprehensive description of the most relevant issues and works in rule learning system.

Former incremental rule learners include STAGGER (Schlimmer and Granger 1986), the first system designed expressly for coping with concept drift, the FLORA family of algorithms (Widmer and Kubat 1996) with FLORA3 being the first system able to deal with recurring contexts, and the AQ-PM family (Maloof and Michalski 2004). All these systems are able to incorporate new information, but they need to maintain in memory previous examples and cannot deal with high-speed data streams. The first rule learner designed for processing data streams is the system Facil (Ferrer et al. 2005). Facil uses a bottom-up, specific to general, search strategy, similar to AQ-PM. The decision model consists of a set of rules and, for each rule, a set of positive and negative examples that define the borders of the rule. The core of this approach is that rules may be inconsistent by storing positive and negative examples which are very near one another (border examples). This approach is similar to the AQ11-PM system (Kolter and Maloof 2003; Maloof and Michalski 2004), which selects positive examples from the boundaries of its rules (hyper-rectangles) and stores them in memory. When new examples arrive, AQ11-PM combines them with those held in memory, applies the AQ11 algorithm to modify the current set of rules, and selects new positive examples from the corners, edges, or surfaces of such hyper-rectangles (extreme examples). The idea of creating new rules based on border examples is similar to the double induction approach introduced by Lindgren and Boström (2004). Instead of using frequency based or NB based decision, when more rules predict different labels for an example, they proposed to induce a new set of rules from the examples covered by the conflicting rules. Then the new rules are used to classify the examples that are covered by the conflicting rules. Facil uses a forgetting mechanism that can be either explicit or implicit. Explicit forgetting takes place when the examples are older than a user defined threshold. Implicit forgetting is performed by removing examples that are no longer relevant as they do not enforce any concept description boundary. Facil

2.3 Adaptive methods

The stream mining community has already introduced many different approaches to deal with the phenomenon of concept drift. Approaches such as sliding windows and example weights (Klinkenberg 2004) are widely used to maintain a classifier consistent to the most recent data. Bifet and Gavalda (2009) proposed the ADWIN algorithm, a detector and estimator which automatically adapts to current rate of change by keeping the window of recent examples of variable length. It employs Hoeffding bound to guarantee that the window has maximal length without a change inside the window. Other methods can explicitly detect change-points or small time-windows where the concept to learn has changed. A classifier can be equipped with such drift detection method, e.g., based on error rate (Gama et al. 2004) or distance between classification errors (Baena-Garcia et al. 2006), and forgetting mechanism.

As pointed out by Wang et al. (2003), a drawback of decision trees is that even a slight drift of the target function may trigger several changes in the model and severely compromise learning efficiency. On the other hand, ensemble methods avoid expensive revisions by weighting the members, but may run the risk of building unnecessary learners when virtual drifts are present in data. Bifet et al. (2009) presented two new decision tree ensemble methods: ADWIN bagging and adaptive-size Hoeffding tree bagging. The former extends on-line bagging (Oza and Russell 2001) with ADWIN change detector, which works as an estimator for the weights of the boosting method. The worst performing classifier is removed from the ensemble when change is detected and it is replaced by a new one. The latter uses HT of different maximum sizes since smaller trees adapt faster to changes and larger work better for long periods with little or no change.

3 Very fast decision rules algorithm

In this section we present the classification rule learning system for data streams—very fast decision rules VFDR and variants. The rule learning algorithm strives to provide a flexible classifier for data streams that would produce output easy to interpret and that could adapt quickly to changes in the underlying concept.

As in many other systems a rule in VFDR is an implication of the form \(A \Rightarrow C\). The \(A\) part of a rule is a conjunction of literals, that is, conditions based on attribute values. For numerical attributes, each literal is of the form \(X_i > v\), or \(X_i \le v\) for some feature \(X_i\) and some constant \(v\). For categorical attributes VFDR produces literals of the form \(X_i = v_j\) where \(v_j\) is a value in the domain of \(X_i\). The \(C\) part of a rule \(r\) is not a constant as in most of rule based systems, but the class is given by a function—either majority class or NB. This is the most distinctive feature of VFDR.

In the first part of this section we describe the base algorithm VFDR-Base. The following subsections describe modifications for multi-class problems and time evolving data streams.

3.1 The basic algorithm

The VFDR-Base algorithm is designed for high-speed data streams. It learns ordered or unordered rule sets. It needs only one scan of data and is able to provide any-time classifications.

3.1.1 Growing a set of rules

The algorithm begins with an empty rule set \((RS)\) and a \(default\; rule\;\{\}\rightarrow \mathcal{L}\), where \(\mathcal{L}\) is initialized to \(NULL\). \(\mathcal{L}\) is a data structure that contains information used to classify the test instances, and the sufficient statistics needed to expand the rule.

As already said, each learned rule \((r)\) is a conjunction of literals, that are conditions based on attribute values, and an \(\mathcal{L}_r\). If all the literals are true for a given example, then the example is said to be covered by the rule. The labeled examples covered by a rule \(r\) are used to update \(\mathcal{L}_r\). A rule is expanded with the literal that has the highest gain measure of the examples covered by the rule. \(\mathcal{L}_r\) accumulates the sufficient statistics to compute the gain measure of all possible literals. \(\mathcal{L}_r\) is a data structure that contains: an integer that stores the number of examples covered by the rule; a vector to compute \(p(c_k)\), i.e., the probability of observing examples of class \(c_k\); a matrix \(p(X_i=v_j|c_k)\) to compute the probability of observing value \(v_j\) of a nominal attribute \(X_i\) per class; and a binary search tree to compute the probability of observing values greater than \(v_j\) of continuous attribute \(X_i,\; p(X_i > v_j|c_k)\), per class. The information maintained in \(\mathcal{L}_r\) is similar to the sufficient statistics used by Gama et al. (2006).

The number of observations, after which a rule can be expanded or new rule can be induced, is determined by the Hoeffding bound. It guarantees that, with probability at least \(1-\delta \), the true mean of a random variable \(x\) with a range \(R\) will not differ from the sample mean of size \(N\) by more than:

$$\begin{aligned} \epsilon = \sqrt{\frac{R^{2}\ln (1/\delta )}{2N}}. \end{aligned}$$

It is not efficient to check for the sufficient number of examples with every incoming example, therefore this is done only after every \(N_{min}\) observations.

The set of rules \((RS)\) is learned in parallel as described in Algorithm 4. We consider two cases: learning ordered or unordered set of rules. In the former, every labeled example updates statistics of the first rule that covers it. In the latter, every labeled example updates statistics of all the rules that cover it. If a labeled example is not covered by any rule, the default rule is updated.

The expansion of a rule is done using Algorithm 2 that employs the aforementioned Hoeffding bound. For each attribute \(X_i\) the value of split evaluation function \(G\) is computed for each attribute value \(v_j\). The best and the second best literal merit named as \(g_{best}\) and \(g_{2best}\) respectively are used for the Hoeffding bound condition. If the best merit is better the second best with given confidence, i.e. satisfies condition \(g_{best}-g_{2best} > \epsilon \), the rule is expanded with condition \(X_a = v_j\) and the class of the rule is assigned according to the majority class of observations of \(X_a = v_j\).

figure a
figure b

3.1.2 Classification strategies

Assume that a rule \(r\) covers a test example. The example will be classified using the information in \(\mathcal{L}_r\) of that rule. The simplest strategy uses the distribution of the classes stored in \(\mathcal{L}_r\), and classify the example in the class with maximum estimated \(p(c_k)\). This strategy only uses the information about class distributions and does not look for the attribute-values; therefore it uses only a small part of the available information. In a more informed strategy, a test example is classified with the class that maximizes the posteriori probability given by Bayes rule assuming the independence of the attributes given the class. There is a simple motivation for this option. \(\mathcal {L}\) stores information about the distribution of the attributes given the class usually for hundreds or even thousands of examples, before expanding the rule and re-initializing the counters. NB takes into account not only the prior distribution of the classes, but also the conditional probabilities of the attribute-values given the class. The information available in each rule is therefore better utilized. Given the example \(\varvec{x}=(x_{1},\ldots ,x_{j})\) and applying Bayes theorem, we obtain:

$$\begin{aligned} P(c_{k}|\varvec{x}) \propto P(c_{k})\prod P(x_{j}|c_{k}). \end{aligned}$$

Using NB in VFDT like classification algorithms is a well-known technique since it was introduced in Gama et al. (2003). One of its greatest advantages is the boost in any-time learning property because even though the learned rule set might not be robust enough or the individual rules might not provide sufficient information for expert interpretation (not being specialized enough, i.e., having only one or few conditions), it may already be able of highly informed predictions based on NB classification.

There are different ways to determine and rank the predictions and their associated confidences from the rule set in case multiple rules fire. The set of rules learned by VFDR-Base can employ three different classification strategies as defined in PMML (Data Mining Group 2011): First Hit, Weighted Sum, and Weighted Max.

  • First Hit uses the first firing rule to determine the predicted class, and the confidence is the weight of that rule.

  • Weighted Sum calculates the total weight for each class by summing the weights for each firing rule. The prediction with the highest total weight is then selected. The confidence is the total weight of the winning class divided by the number of firing rules.

  • Weighted Max selects the firing rule with the highest weight. The confidence returned is the confidence of the selected rule.

If two firing rules have the same weight (Weighted Max), or two or more classes are assigned the same weight (Weighted Sum) the winner is chosen alphabetically. As in Clark and Boswell (1991), the ordered rules use the First Hit strategy, while the unordered rules use the Weighted Sum strategyFootnote 2.

3.2 One versus all rule learning

This modified version of the base VFDR-Base employs one vs. all strategy for multi-class learning (VFDR-OA). In this strategy, the examples of class \(c_k \in C\) are positive and \(\forall c_l \in C, c_l\ne c_k\) are negative. It considers a rule expansion for each class \(c \in C_r\), where \(C_r\) is the set of classes observed at rule \(r\). This version is able to produce more rules in one call of ExpandRule. The number of rules induced from one rule \(r\) in \(RS\) in such a call is at most \(|C_r|\).

The process to select new conditions for a rule works as follows. For each attribute \(X_i\) the value of gain function \(G = { Gain}(r',r)\) adopted from FOIL (Quinlan 1991) is computed. The change in gain between rule \(r\) and a candidate rule after adding a new condition \(r'\) is defined as:

$$\begin{aligned} Gain(r',r) = s \times \left( \log _2 \frac{N'_{+}}{N'} - \log _2 \frac{N_{+}}{N} \right) \end{aligned}$$

where \(N\) is the number of examples covered by \(r\) and \(N_{+}\) is the number of positive examples in them, \(N'_{+}\) and \(N'\) represent the same for \(r'\), and \(s\) is the number of true positives in \(r\) that are still true positives in \(r'\), which in this case corresponds to \(N'_{+}\).

We are interested only in positive gain, therefore we consider the minimum of the gain function as zero and the maximum for a given rule is \(N_{+} \times \left( -\log _2 \frac{N_{+}}{N} \right) \). We can then normalize the positive gain as:

$$\begin{aligned} GainNorm(r',r) = \frac{Gain(r',r)}{N_{+} \times \left( -\log _2 \frac{N_{+}}{N} \right) }. \end{aligned}$$

In the case of expanding a non-default rule, that is a rule which already contains conditions, the algorithm works as follows. The rule was induced for a certain class that was considered as positive. To keep this class of interest in the rule set, the class is maintained as positive for the next computation of the merit. More specifically the difference is that an ordered set, in Algorithm 7, considers only the positive class. The unordered rule set in Algorithm 8 additionally computes the merits of other classes considered as the positive and can expand the rule also with the best condition for such classes. The additional expansions are allowed only when the rule has already been expanded with the original class at that call of ExpandRule. The default rule does not have any previous class considered as positive thus the search for new rules proceeds until the merit for all possible classes are checked.

In other words, the process creates multiple rules for different classes marked as positive, but not necessarily for all the available classes in one call. The advantage is that the algorithm is able to learn more rules and more specialized (complex) rules from fewer examples. As a result it can achieve higher accuracy at the possible cost of producing larger sets.

The search for the best and second best values considers the value of GainNorm for a given class. If \(g_{best}^k\) is the true best gain measure, i.e., satisfies condition \(g_{best}^{c_k}-g_{2best}^{c_k} > \epsilon \) for a given class \(c_k\), the rule is expanded with condition \(X_a = v_j \Rightarrow c_k\).

figure c
figure d

3.2.1 Classifier with multiple sets

Another modification further develops the one vs. all approach in rule stream learning. While the previously described version computed the merit for each class in one vs. all fashion, the following learner creates a separate rule set for each of the classes in data. Consequently the learning and prediction phase reflect certain differences. This modification is denoted as VFDR-MS (multiple sets). The aim of these modifications is to have rules for all classes with less exhaustive expansion search.

Learning phase The classifier contains multiple rule sets, i.e., rule subsets and one default rule. For each class in data there is a subset characterized by it (subset class) The rule subset distinguishes between its subset class, which is considered as a positive class, and all the other classes that are considered as a negative class, i.e., each \(\mathcal{L}_r\) of a rule \(r\) in a subset contains information about a two-class problem. When a new example arrives, it updates statistics of rules that cover the example in all the subsets. If the class of the example is the same as the subset class the example is used as is. If the classes differ, the class of the example for learning such a rule is changed to negative class label. Therefore, a rule grows only when the statistics for the subset class indicate the expansion.

The default rule of the classifier keeps sufficient statistics covering all classes, i.e., \(\{\}\rightarrow \mathcal{L}\) stores summarized information about a multi-class problem. A new rule can be induced for multiple subsets at one time. More specifically, every time default rule observes \(N_{min}\) examples, the classifier computes the FOIL gain for each class as positive and the others as negative so that each subset can possibly obtain a new rule.

This approach represents a balanced strategy. In general, the rule set grows larger with fewer examples than the basic algorithm, but not as large as the VFDR-OA version.

Prediction phase The rule learner with multiple sets uses slightly modified approach to obtain predictions. Similarly to previous versions the individual rules from each of the rule subsets can use either majority class (MC) prediction or NB prediction. When the class predicted by (MC) or (NB) does not correspond to the subset class the weight of such firing rule is not considered that is it is treated as if the rule did not cover the example. In other words, if the class predicted by the individual rule is the negative class, it indicates that the rule is not appropriate to predict the class of a given example. The rule subset itself then provides prediction of its subset class supported by the weight given by the rules that covered the example. Subset weights are determined based on the three strategies for rule learning: First Hit, Weighted Max, or Weighted Sum. Then the final prediction of the classifier is selected by using the subset class with maximum weight. If none of the rules from any of the rule subsets provide weight, i.e., there was no rule that would cover the example and predicted the class of its subset, then default rule provides the prediction based on MC or NB.

3.3 Adaptive very fast decision rules

In this section we present an extension to VFDR that can be applied to any of the previously described versions of the classifier. The aim of the AVFDR (adaptive VFDR) extension is the adaptation of the rule learner to the phenomenon of concept drift.

3.3.1 Rule adaptation in the presence of drift

Data streams are characterized by their evolving nature. The ability to detect and react to concept drift is crucial when modeling continuous flows of data generated by processes with unknown dynamics.

Adaptation in rule models is facilitated due to their modularity. As opposed to decision trees the set of rules offer the possibility to remove individual rules without the need for rebuilding the entire model. This characteristic is very important for incremental learning from evolving data, because it allows faster adaptation of the model.

The previous algorithms were adapting only implicitly by inferring new rules and specializing the existing ones. AVFDR extension embeds an explicit drift detection into the learning process as presented in Algorithm 9. In addition to \(\mathcal{L}_r\), a drift detection mechanism tracks performance of each rule \(r\) during learning. The method employed in AVFDR is the SPC (Gama et al. 2004) (Statistical Process Control) described in Algorithm 10. With every labeled training example covered by a rule, the rule makes prediction and updates its error rate. SPC monitors the error rate and manages two registers during training: \(p_{min}\) and \(s_{min}\), where \(p\) is error rate and \(s\) is SD. Every time a new example \(i\) is covered by the rule those values are updated when

$$\begin{aligned} p_{i} + s_{i} < p_{min} + s_{min}. \end{aligned}$$

The learning process of a given rule can be in one of the following 3 states: In-Control, Out-of-Control, or in Warning. We follow the 3-sigma rule (Grant and Leavenworth 1996):

$$\begin{aligned} p_{i} + s_{i} \ge p_{min} + \alpha \times s_{min}, \end{aligned}$$

where \(\alpha = 2\) changes the state to Warning and \(\alpha = 3\) signals that the Out-of-Control state is reached.

In the warning state, the rule stops learning until the status of the rule becomes again In-Control. The default rule stores the examples during warning in a short-term memory. If the rule reaches Out-of-Control it implies that the performance of the particular rule has degraded significantly and can negatively influence the quality of predictions of the classifier therefore it is removed from the rule set. This control enables to keep the rule set up-to-date and prevent the rule set from excessive growth. The default rule’s statistic is re-initialized with the examples from short-term memory if it reaches the Out-of-Control state.

figure e
figure f

3.4 Illustrative example

In this section we present the functionalities of the VFDR on a simple example for better understanding of the whole process. The illustrative problem is an artificial dataset with class given by the logical rule \((A \wedge B) \vee (C \wedge D)\), where the attributes \(A,B,C,D \in \{true, false\}\) are randomly generated. The number of examples is 1,500.

The learned rule sets are presented in Tables 1 and 2 for unordered and ordered rule sets respectively. The different VFDR variants produce different rule sets due to the different search strategies for the best conditions.

Table 1 Unordered rule sets generated by different versions of VFDR
Table 2 Ordered rule sets generated by different versions of VFDR

From Table 1 we conclude that all the rules are consistent with the target concept. In this simple example, the basic approach differs only in one rule from VFDR-MS, although the rule was induced in slightly different way. VFDR-OA requires fewer examples to produce rule sets with larger coverage. Therefore, VFDR-OA on the same dataset generated larger rule set with more specialized rules, which is an advantage in more difficult tasks.

Table 2 presents ordered rule sets learned on the illustrative dataset. Note that when interpreting the ordered sets, special care needs to be taken because each rule implicitly includes the negation of the previous rules. All the sets are again consistent with the concept of learning data. VFDR-OA creates only two rules that describe the class True (T). The two rules are sufficient because the rest of the cases are treated by default rule as F. Nevertheless, VFDR-MS searches for rules explaining all classes and that is why there are the two rules for T, but also two rules for F, leaving nothing to be resolved by the default rule.

In order to provide an illustration of VFDR learning process, we examine the growth of a rule set over time. The example shown in Table 3 presents the unordered rules learned by VFDR-MS. The upper part of the table corresponds to the stationary dataset. Each column represents a state of the rule set during the learning process. The rows labeled with numbers represent a time stamp specified by the number of training examples processed by the learner at the point when the rule set evolved—it either added a new rule, specialized existing one or removed a rule. At each presented point, new rules or rules that added a literal at that point are highlighted in bold. Note that the learner creates one rule for each class. Since the data set represents a two class problem and the attributes have only two values, new rules are added to both rule subsets and have the opposite values.

Table 3 Illustrative example of the rule set growth using VFDR-MS

In our illustrative example we can proceed to a non-stationary data test. In order to illustrate the learning process on a non-stationary dataset we extend the learning dataset used above by adding another 1,500 examples for which the rule defining the target concept is \((A \wedge \lnot B) \vee (C \wedge D)\). The evolution of the rule set after the first stationary 1,500 examples corresponds to the bottom part of the Table 3. When the drift appears, we can observe the process of reaction to the drift by removing incorrect rules and inducing new ones. Note that the rules describing the part of the space unaffected by the drift are kept in the rule set. Therefore, the rule set learner reveals the local changes. The comparison of prequential error-rate in Fig. 1 illustrates the faster adaptation of AVFDR even in this simple example, where the change in data is easily handled also by further growth of the rule set.

Fig. 1
figure 1

Prequential error of VFDR-MS and AVDFR-MS classifiers in illustrative dataset. Adaptive version reacts faster to a change

4 Experimental evaluation

In this section we analyze the performance of the proposed rule learning algorithms for data streams. We would like to answer following research questions:

  • What is the behavior of the VFDR variants?

  • How does the rule algorithm perform compared to stream classification algorithms on stationary datasets?

  • What is the performance on non-stationary datasets compared to the data stream classifiers?

Since there are many combinations of the various versions, we first analyze the behavior of the different versions of VFDR. We chose one version as the best representative of the group and compare it to the streaming classifiers in further experiments.

Next, we compare the selected version of decision rules from Sect. 3, with and without explicit drift detection, against standard stream classifiers: a decision tree classifier VFDTc (Gama et al. 2003), a NB, and an instance based classifier IBLStream (IBLS), recently proposed by Shaker and Hüllermeier (2012). These experiments answer the research questions.

The following subsections provide information about experimental methodology and results from the experiments.

4.1 Experimental methodology

We will present two sets of experiments. The first set of experiments aim to select the best representative from the VFDR algorithms. The main focus is on the second set of experiments, where the selected rule learning algorithm is compared to standard streaming algorithms.

4.1.1 Datasets

In order to compare the classifiers we use large scale artificial and real world datasets. The real world datasets were previously used in other works when testing on-line learning algorithms as they represent large datasets and may contain drifts. The main characteristics of the datasets are summarized in Table 4. A more detailed description of the datasets appears in the Appendix 1. The artificial datasets can be generated with or without a concept drift; therefore they can be used for evaluation on stationary and non-stationary data.

Table 4 Overview of the datasets

4.1.2 Algorithms

We use the notation VFDR and AVFDR for the version without drift detection and adaptive version respectively. The suffixes -Base, -OA, and -MS refer to the base version, the one vs. all and multiple class versions, and suffixes -UN and -OR refer to unordered and ordered respectively. VFDTc+SPC and NB+SPC stand for the decision tree and NB classifiers with SPC drift detection method respectively. All algorithms were implemented in Java as an extension for KNIME (Berthold et al. 2009) except for IBLS, which is implemented in MOA framework. The common parameters for VFDR and VFDTc are confidence \(\delta = 0.000001\), tie breaking constant \(\tau = 0.05,\; N_{min} = 200\) and the functional leaves use NB after observing at least one example. These parameters are selected based on default values in MOA. IBLS is used with default parameters except for using AdaptK parameter.

4.1.3 Performance metrics

The evaluation method in the experiments is the predictive sequential (prequential) method (Gama et al. 2009). In prequential evaluation whenever a training example is available, the classifier first makes a prediction. After, the prediction is compared to the class of the example and the error-rate is updated. Finally, the classifier is trained with the example. We report the error and SD in the tables. The best results are presented in bold.

The significance of the observed differences is tested with Friedman (1937, 1940) test to compare multiple classifiers on multiple datasets based on average ranks as suggested by Demšar (2006). When the null hypothesis is rejected, we use the post-hoc Nemenyi test (Nemenyi 1963). We also use two-tailed paired t test to demonstrate differences between classifiers with and without drift detection.

The size of the classifiers is presented in a following way. Each rule is closely related to a leaf in the decision tree, because they contain almost the same information and the same functionality. Thus the size of the classifiers can be compared by considering the number of rules in a set and number of leaves in a tree.

We compare the learning times of the classifiers in prequential evaluation in CPU time seconds running on Intel i5-3210M 3.1GHz DC, 8GB DDR3, Ubuntu 13.04.

4.2 Comparison of VFDR algorithms

4.2.1 Stationary data

First, we focus on comparing the different modifications of VFDR classifiers as well as their respective ordered and unordered versions. The goal is to select the most promising classifier so that it is used in further comparisons. Table 5 presents the prequential error rates of the classifiers tested on the artificial stationary datasets. Comparing the respective variants of ordered and unordered classifiers using paired t test we obtain that the unordered is significantly better on three of the five datasets in the base version of the algorithm (VFDR-Base-UN) and significantly worse in one case. The VFDR-MS-UN version is also better in three cases, but worse in the remaining two. The unordered VFDR-OA-UN is significantly better in all five datasets and also achieves the best average results from all the variants on all the datasets. We can conclude that the unordered rule sets generally achieve better results and furthermore, they are considered more flexible and more interpretable. Since they are not dependent on other rules in the set, they are better suited for the adaptive extension of the rule learning algorithm. Comparing the average ranks with Friedman test, we obtain \(\chi ^2_F = 13.11\) and \(F_F = 4.41\) with critical value \(F(5,20)=2.711\) and so we reject the null hypothesis. The result of post-hoc Nemenyi test with critical distance \(CD =3.372\) is that VFDR-OA-UN is significantly better than VFDR-OA-OR, VFDR-Base-UN, and VFDR-Base-OR.

Table 5 The prequential error of the ordered and unordered VFDR classifiers on stationary data

The size of the ordered classifiers is smaller than the size of unordered sets in Table 6. The differences between VFDR-Base-UN and VFDR-Base-OR are smaller than in the case of VFDR-MS, VFDR-OA, which is expected since both create multiple rules in their respective unordered variants.

Table 6 Number of rules of the ordered and unordered VFDR classifiers on stationary data

The learning times, Table 7, in the prequential evaluation of ordered rule set classifiers are smaller because when learning or predicting, the classifiers generally execute less tests whether a rule covers an example. Firstly they usually have smaller sets and secondly once a rule that covers the example is found, the search stops.

Table 7 The learning times in seconds in prequential evaluation of the ordered and unordered VFDR classifiers on stationary data

4.2.2 Non-stationary data

The main interest when using streaming algorithm is in the situation when data is non-stationary. We focus on the unordered rule sets and their adaptive variants. Table 8 presents the prequential error of the classifiers. VFDR-OA-UN and AVFDR-OA-UN variants achieve the best average results. The average rank comparison demonstrates that for \(\alpha =0.05\), the \(\chi ^2_F = 20.77\) and \(F_F = 19.64\) with critical value \(F(5,20)=2.711\), AVFDR-OA-UN is better than AVFDR-MS-UN and VFDR-MS-UN. With the critical distance \(CD=3.372\) for the post-hoc test VFDR-OA-UN is also better than AVFDR-MS-UN.

Table 8 The prequential error of the AVFDR classifiers on non-stationary data

The adaptive versions remove rules thus it is expected that the sizes of the adaptive classifiers are smaller than the sizes of the classifiers without drift detection as can be observed in Table 9. Consequently the classifiers with drift detection are faster in the prequential evaluation (Table 10).

Table 9 Number of rules of the AVFDR classifiers on non-stationary data
Table 10 The learning times in seconds in prequential evaluation of the AVFDR classifiers on non-stationary data

Based on the results from this and the previous section, we select AVFDR-OA-UN and VFDR-OA-UN classifiers as the most promising classifiers and compare them to other streaming classifiers.

4.3 Comparison between streaming classification algorithms

In this section, we compare the VFDR-OA-UN and AVFDR-OA-UN classifiers with standard streaming classification algorithms. In the first part we evaluate the performance on stationary datasets. The main focus of the proposed classifiers is non-stationary data, which is evaluated in the second part of this section.

4.3.1 Stationary data

The Table 11 presents results from five runs on the artificial data sets generated without drift. Among the classifiers which do not employ any explicit drift detection method, VFDR-OA-UN ranks as the top classifier. Moreover, comparing the VFDTc and VFDR-OA-UN algorithms, which have similar learning approach, VFDR-OA-UN exhibits better performance. We notice that NB can learn the hyperplane concept much better than the other classifiers.

Table 11 Prequential error of classifiers without drift detection on stationary data

In the next experiment we compare the algorithms equipped with a drift detection method. Because the datasets do not incorporate drifts, the results of the classifiers with drift detection method are very similar to those without it. The small differences are caused mostly by incorrect drift detections in some cases. Note that RBF is devised so that it is not easy to capture the concept with a decision tree model. Therefore, neither VFDTc nor VFDR-OA-UN achieves the performance of IBLS. On average VFDR-OA-UN ranked the best followed by IBLS, however they did not rank significantly better than the other algorithms.

We anticipate that the real world data contain concept drift therefore the order of examples is important. For illustration, we can shuffle the order of examples using random seeds. It is expected that the possible concept drift is removed and the performance of classifiers with and without drift detection is not different. The tables with results can be found in Appendix 2. The p values of the respective paired t tests of classifiers with and without drift detection demonstrates that for \(\alpha = 0.05\), the only significant difference is on bank data and intrusion. The performance of AVFDR-OA-UN compared to VFDR-OA-UN generally does not differ on stationary datasets and real world datasets. The main difference is in the number of rules the two algorithms produce. VFDR-OA-UN produces larger models than VFDR-OA-UN. While the number of rules in AVFDR-OA-UN still exceeds the number of leaves in VFDTc in many cases, the size of the model is reduced by removing the rules with decreasing performance.

4.3.2 Evaluation on non-stationary and real world data

Most of the current problems concerning mining data streams are treated as data with a non-stationary distribution, because if a phenomenon is observed for long enough time, it is highly probable that some change will appear. Therefore it is important to analyze the behavior under such conditions. Apart from the artificial data generated with drifts, this section includes large real world data, in which the presence of concept drift is not known but can be expected. We separately analyze the results obtained from the experiments on artificial datasets and on real world data as they represent more challenging problems.

Examining the results on the artificial data from the first part of Table 12 we can observe that VFDR-OA-UN performs very well since it adapts very fast even without the drift detection. Also in the real world datasets, the VFDR-OA-UN outperforms NB, and achieves slightly better results than VFDTc.

Table 12 Prequential error rates of classifiers on non-stationary and real-world datasets

The right part of Table 12 presents results of classifiers with drift detection. The upper part shows that VFDR-OA-UN has the best average rank, but the IBLS algorithm also works well on the artificial datasets. Again the most noticeable difference is in RBF datasets. Comparing the average ranks of classifiers with drift detection on artificial datasets we obtain \(\chi ^2_F = 1.32\) and \(F_F = 1.32\), while the critical value for \(\alpha = 0.05\) is 3.49. Therefore the null-hypothesis is not rejected.

The variability of real world datasets shows that each of the classifiers for time changing data has different strengths and weaknesses. Overall the AVFDR-OA-UN achieves the best average rank. The IBLS achieves best result in forestCovtype. The NB+SPC and VDFTc+SPC perform similarly since often the tree does not grow very much before the drift is detected thus the NB in the the root (functional leaf) is responsible for the prediction for many examples. The statistical tests for comparison of classifiers with drift detection on real world data achieve value of \(\chi ^2_F = 10.05\) and \(F_F = 5.04\) with critical value \(F(3,21)=3.072\) and so we reject the null hypothesis. The post-hoc Nemenyi test with critical distance \(CD=1.658\) suggests that AVFDR-OA-UN is significantly better than IBLS.

We can analyse the differences between AVFDR-OA-UN and VFDR-OA-UN. The artificial datasets can be randomly generated to contain drift and therefore provide a suitable data for the analysis. Using t test we get the p value \(= 0.048\) for hyperplane datasets, which signals that there is a statistical evidence for rejecting the null-hypothesis that they have the same accuracy. The p value for SEA datasets is \(0.0002\) and we can reject the null-hypothesis, and also in LED, where the p value \(= 0.0099\) The null-hypothesis is not rejected in waveform datasets because the p value \(= 0.2573\). Finally, the p value in RBF is 0.0002 and we can reject the null-hypothesis. The conclusion is that AVFDR-OA-UN mostly performs better or the same in the presence of concept drift with the benefit of producing less rules and learning faster.

Testing the differences between the two algorithms on the real datasets, we get mean difference of 0.9363 with SD of 4.2445. The p value for the t test is 0.5525, therefore we cannot reject that the two classifiers achieve the same error. The advantages of AVFDR-OA-UN are the smaller classifier models (in term s of number of rules) and faster learning times.

In order to examine the robustness to noise we introduced various levels of artificial noise in the hyperplane dataset. The hyperplane is chosen for being numerical and not being as ‘simple’ as SEA dataset, and the generator itself provides the option to choose arbitrary noise level. In the Table 13 we observe that increasing the noise level does not have a strong influence on the number of the rules neither in VFDR-OA-UN nor in AVFDR-OA-UN. The noise also did not deteriorate the performance of the AVFDR-OA-UN compared to VFDR-OA-UN and the number of removed rules does not increase with the increasing noise. Figure 2 shows the influence of varying noise in hyperplane dataset on the accuracy of the classifiers. The robustness to noise is very similar for VFDTc+SPC and AVFDR-OA-UN algorithms. It is confirmed by the comparison of the two algorithms on three artificial datasets with varying noise plotted in Fig. 3.

Table 13 Hyperplane varying the noise level
Fig. 2
figure 2

Varying noise in hyperplane dataset

Fig. 3
figure 3

Comparison of noise robustness of AVFDT-OA-UN and VFDTc+SPC

Since AVFDRs adapt to changes within their rules the detections are more frequent than in classifiers tracking the changes in the performance of the whole classifier. Thus the change in data can affect many rules. All those rules that model the feature space under change. In order to have the idea of the process, we plot bars for each drift detected by AVFDRs and standard stream classifiers with SPC.

In the SEA dataset it is known that the change occurs exactly after every 15,000 examples and therefore we can examine the behavior of classifiers with drift detection mechanisms on Fig. 4. NB+SPC reacts exactly to each of the drifts with detection delay depending on the severity of the drift. AVFDR-MS-UN did not detect any change when first drift occurred, but it reacted even faster than NB+SPC for the next two changes. AVFDR-OA-UN induces more rules and thus it is expected there would be more changes detected (more rules influenced by the change). This is confirmed by the experiment. The closer to the change point the more frequently the drift detections in the rules occur. Some detections appear also later (some rules cover smaller space and are updated infrequently), nevertheless the occurrences are sparser.

Fig. 4
figure 4

Drift detections in SEA dataset

Table 14 presents the number of rules of unordered sets and the number of leaves of the tree (the average number in case of the artificial datasets). As previously mentioned, VFDR-OA-UN generally produces large number of rules. In most of the cases the number is higher than number of leaves of VFDTc. We can observe that the highest number of rules appears in large sets with many classes since the algorithm allows the rules to expand for each class. The obvious benefit of using AVFDR-OA-UN is in the noticeable reduction of the size. With the exception of artificial LED dataset, the size of the AVFDR-OA-UN is not much different from the tree consequently the memory consumption is very similar. Moreover, it is possible to set the maximum number of rules in the set and choose an appropriate strategy for the rule replacement in order to limit the requirements for computational resources or limit the maximum number of classes for which one rule can expand.

Table 14 Number of rules of rule classifiers and leaves of VFDTc on non-stationary data

Another dimension of analyses is the learning times. Table 15 presents the learning times (in CPU time seconds) of prequential evaluation of the classifiers. The IBLS algorithm has the highest learning times with prequential evaluation, but note that it is implemented in a different framework.

Table 15 Learning times of classifiers in seconds

We can conclude that the times of VFDR-OA-UN are higher than the learning times of VFDTc and NB, but still sufficient for processing thousands of examples. The time complexity in comparison to trees increases with the number of rules which need to be checked if they cover the train/test example. The highest running times are on datasets with many classes, consequently the classifier generates many rules for each of the classes. It is desired to keep the rule set reasonably small so that the increase is negligible. In our case it is done by the use of the adaptive version. We can clearly observe that the advantage of AVFDR-OA-UN is that it removes potentially incorrect rules thus keeps the set smaller and consequently the learning and prediction times are reduced. Other possibility could be to have the limit the maximum number of rules in the set.

5 Conclusions

In this paper we present the VFDR system, an on-line, any-time and one-pass algorithms for learning decision rules in data streams. We present a base version, and two modifications to the basic algorithm, which focus on learning specialized rules for each class. The extended algorithms exhibit better results compared to baseline algorithm. Moreover, we describe an adaptive extension for the rule learning classifier AVFDR. In this extension each rule is equipped with explicit drift detection method, which is able to react to changes in data. The detection technique serves not only as a faster adaptation method, but also as a rule pruning mechanism. It allows keeping high performance when learning from evolving data and reduces the size by reducing the number of rules of the classifier. The detection method can provide relevant information about the structural changes in the process generating data. The rule learning algorithms we present generate highly flexible models with comparable performance against the standard stream mining techniques. The decision rules presented in this paper offer an interesting, interpretable and modular alternative to well-known and widely used techniques, like decision trees.