Very fast decision rules for classification in data streams

Kosina, Petr; Gama, João

doi:10.1007/s10618-013-0340-z

Very fast decision rules for classification in data streams

Published: 03 December 2013

Volume 29, pages 168–202, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Very fast decision rules for classification in data streams

Download PDF

Petr Kosina^1,2 &
João Gama^1,3

1900 Accesses
31 Citations
1 Altmetric
Explore all metrics

Abstract

Data stream mining is the process of extracting knowledge structures from continuous, rapid data records. Many decision tasks can be formulated as stream mining problems and therefore many new algorithms for data streams are being proposed. Decision rules are one of the most interpretable and flexible models for predictive data mining. Nevertheless, few algorithms have been proposed in the literature to learn rule models for time-changing and high-speed flows of data. In this paper we present the very fast decision rules (VFDR) algorithm and discuss interesting extensions to the base version. All the proposed versions are one-pass and any-time algorithms. They work on-line and learn ordered or unordered rule sets. Algorithms designed to work with data streams should be able to detect changes and quickly adapt the decision model. In order to manage these situations we also present the adaptive extension (AVFDR) to detect changes in the process generating data and adapt the decision model. Detecting local drifts takes advantage of the modularity of the rule sets. In AVFDR, each individual rule monitors the evolution of performance metrics to detect concept drift. AVFDR prunes rules whenever a drift is signaled. This explicit change detection mechanism provides useful information about the dynamics of the process generating data, faster adaptation to changes and generates more compact rule sets. The experimental evaluation demonstrates that algorithms achieve competitive results in comparison to alternative methods and the adaptive methods are able to learn fast and compact rule sets from evolving streams.

eRules: A Modular Adaptive Classification Rule Learning Algorithm for Data Streams

RILL: Algorithm for Learning Rules from Streaming Data with Concept Drift

Computationally Efficient Rule-Based Classification for Continuous Streaming Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the increasing role of computers in our everyday life we obtain more and more potentially useful information that can aid us in tasks such as supporting key decisions, finding interesting patterns, and discovering anomalies. The human brain can hardly match the capacities of computers in processing all the available information, which brings new challenges for machine learning. Therefore, computers are trained for tasks such as recognizing Spam, helping with medical diagnosis, predicting electricity consumption, and discovering frauds.

Recently, machine learning has undergone a change in the focus and application of its techniques. More and more attention shifts from off-line mining of static data towards ubiquitous mining and real-time applications dealing with data represented by streams. Data streams refer to a process where instances arrive continuously and possibly infinitely over time. Therefore, we might not have all the data available at once and consequently the data require sequential access rather that random access. New approaches and stream extensions to off-line methods emerge in order to improve performance under constraints given by the stream environment. The approaches for streams need to possess special characteristics such as to be able to produce a model while scanning the data only once, the model must be available at any point of time, must be up-to-date, and all must be able to run under computational and memory constraints (Gama 2010).

A common behavior in streaming data is that it evolves over time with unknown dynamics. This evolving nature of data might change the functional mapping between attributes and classes, a phenomenon known as concept drift. This is an issue that up-to-date streaming learning system must take into account. Almost everything in this world changes and so can the concepts in data, whenever they are being observed for long enough time. The influence of seasonal change, economic change, health conditions or wear of machinery can be the cause of decreasing quality of the models built based on previous observations. Therefore, fast model adaptation is an advantage for decision learning in most real world problems.

In machine learning, and especially in the classification tasks, various types of decision trees are among the most popular tools that are widely used. They are hierarchical structures with decisions in nodes and labels in leaves. Not only do they provide very good predictive capabilities, but also evince high degree of interpretability due to their comprehensible visualization. Consequently, the paths from root of the tree to the leaves can be rewritten into set of unordered IF-THEN rules. Those rules capture the main characteristics of the decision problem and its relevant features. Moreover, each unordered rule can be handled independently of the others in the set. A common and straightforward approach is to generate decision rules from decision trees. Another way to obtain the rule set is to induce them directly from data, similarly as when learning the tree.

Indeed, the popularity, quality and interpretability of decision trees are preserved in the stream mining scenarios. Among the best known and most used models for data stream classification are the algorithms based on Hoeffding trees (Domingos and Hulten 2000) (HT). These incremental algorithms automatically adapt to concept drift just by expanding the tree. The adaptation of HT, however, is rather slow. Faster adaptation might be achieved by employing explicit drift detection methods (Gama et al. 2004; Baena-Garcia et al. 2006; Hinkley 1970). Nevertheless, using explicit detection usually requires rebuilding the current tree, which could be computationally expensive. The aforementioned rule sets did not attract much attention from the community in the context of stream mining as opposed to the trees. Yet they have the advantage of having individual rules that can be managed independently. Therefore, in decision rules the implicit adaptation feature of the trees remains. Moreover, the set of rules can be altered more easily. Instead of rebuilding the classifier from scratch or executing a complicated change of the structure in the tree, individual rules which are considered outdated can be simply removed. The explicit change detection of individual rules can provide much faster adaptation mechanisms to changes and can also serve as rule set pruning mechanism.

This article is based on three conference papers. The first two introduced the algorithm for learning decision rules from data streams (Gama and Kosina 2011) and proposed a modification to the algorithm (Kosina and Gama 2012b). The work presented a classifier with different evaluation criteria and it was redesigned to be able to learn rules for each class. These algorithms were not specifically designed for time changing data and were not tested under such conditions. The adaptation of the rule learning algorithm to a time changing data was the aim of the third paper (Kosina and Gama 2012a). This work summarizes previous versions, proposes another modification of the algorithm, and provides more extensive and in-depth evaluation.

The paper is organized as follows. Section 2 discusses the related work in rule learning and handling time changing data. The VFDR algorithm and its extensions are presented in Sect. 3. The proposed algorithms are evaluated and compared to other stream classification algorithms in Sect. 4. The lessons learned and future work are discussed in Sect. 5.

2 Related work

This section presents the related works in the stream classification, the rule learning algorithms and their different approaches to multi-class problems, and the last part focuses on drift detection in time-changing data.

2.1 Stream classification algorithms

One of the most influential algorithms for classification in data streams was proposed by Domingos and Hulten (2000). His work presented what is known as very fast decision tree (VFDT), an on-line version of a decision tree classifier appropriate for data stream processing. VFDT is learned by recursively replacing leaves with decision nodes. Each leaf stores the sufficient statistics about attribute-values. The sufficient statistics are required by a heuristic function that evaluates the merit of split-tests based on the attribute-values. When an example is available it traverses the tree from the root to a leaf evaluating the corresponding attribute at each node, and following the branch according to the attribute’s value of the example. When the example reaches a leaf the sufficient statistics are updated. The tree replaces leaves with test nodes when the Hoeffding bound condition is satisfied. The popularity of VFDT attracted a lot of attention and different extensions and improvements were proposed. Hulten et al. (2001) presented CVFDT, a decision tree learner for mining data streams with non-stationary distributions. CVFDT learns a model consistent with a sliding window of recent examples. When the concept is changing and a split that was previously selected would no longer be the best, the CVFDT algorithm starts learning an alternate subtree with a new best attribute as its root. The subtree replaces the original one when it becomes more accurate. The algorithm keeps the model up-to-date when there are large and frequent changes in a concept. Gama et al. (2003) introduced VFDTc algorithm that extended the base incremental tree learning algorithm in two ways. The first extension was to equip leaves with binary search trees in order to handle numerical attributes. The second improvement introduced the use, in the leaves of the tree, of Naive Bayes (NB) classifiers trained on the examples that fall into the leaf. This functionality significantly improves the predictive accuracy of the tree. The possible weakness that the statistics kept within the node using the binary tree can be relatively large when the examples have many unique values for the numerical attribute.

A recent system called IBLStreams proposed by Shaker and Hüllermeier (2012) introduces instance based learning classifier for data streams. Instance based learning is inherently incremental. This is a memory based (or lazy) approach to learning that consists of adding or removing instances from the model. IBLStream selects the case base (instances that form the classifier) based on temporal relevance, spatial relevance, and consistency. The idea is that the most recent examples are the most relevant; it aims to have more or less uniform coverage of the instance space, and to remove data that seem to be inconsistent with the current concept. New examples are added to the case base and the redundant (neighboring) examples are checked for removal. The neighborhood is given by $k_{cand}$ nearest neighbors. The most recent are excluded from the candidates for removal, because it is difficult to distinguish between noise and examples of potentially new concept. The classifier used a drift detection technique (statistical process control, SPC, described later in this paper) computed over last 100 training instances to detect abrupt changes. If a change is detected IBLStream removes larger number of instances from the base. The extent of the change (%) is estimated by the difference of minimum classification error over last 100 training instances and the error over last 20 training instances. The examples to be removed are then chosen at random according to a distribution which is spatially uniform but temporally skewed. The advantage of IBLStream are the flexibility of the learner and the adaptation capabilities with very good accuracy. The weakness of the classifier can be the classification time and interpretability of the model.

2.2 Rule learning

A widely used strategy for learning rules consists of deriving rules from decision trees, as it is done in Quinlan (1993). Any decision tree can be easily transformed into a collection of rules. Each rule corresponds to the path from the root to a leaf, and there are as many rules as leaves. This process generates a set of rules with the same complexity as the decision tree. However, it has been shown that the antecedents of individual rules may contain irrelevant conditions. C4.5rules (Quinlan 1993) uses an optimization procedure to simplify conditions. Previous empirical studies (Quinlan 1993) have demonstrated that the set of rules is both simpler and more accurate than the initial tree. Frank and Witten (1998) present a method for generating rules from decision trees without using global optimization. The basic idea is to generate a decision tree in a breadth-first order, select the best rule, remove the examples covered by the rule and iteratively induce further rules for the remaining instances.

There are several algorithms in the literature for learning decision lists^{Footnote 1} (Rivest 1987; Clark and Niblett 1989; Cohen 1995; Domingos 1996; Weiss and Indurkhya 1998) with different capabilities and approaches to handle multi-class problems. CN2 (Clark and Niblett 1989) is one of the first rule learning systems. It uses a top-down, general to specific approach. CN2 evaluates every possible complex, i.e., conjunction of attribute tests (if-conditions) of a rule, based on information-theoretic entropy measure. This classifier is able to learn multiple class-problems. Each rule predicts the most common class of the examples covered by the rule. In its improved version, presented in Clark and Boswell (1991), the main loop of rule set algorithm learn a model (a rule set) for each class separately. In each iteration, the examples of one class are chosen as positive and all the others are labeled as negative examples. Only positive examples that satisfy a learned rule are removed from the training set for next iteration of the rule search. Similarly, RIPPER (Cohen 1995) decomposes the problem in one vs. all fashion. The classes are ordered in increasing order of their frequency in the training set. After learning rules that separate the minority class, covered examples are removed and the algorithm proceeds with the next class. This process stops when a last single class remains which is then assigned as default class. Another technique to reduce multi-class problem is pairwise classification or round-robin (Fürnkranz 2001). The idea is to learn a classifier for each pair of classes thus transforming the original $c$-class problem into $\frac{c(c-1)}{2}$ two-class problems. All binary classifiers are used to classify test examples. The predictions are aggregated using uniform voting. The predictions of the classifiers learned on the true class of an example are expected to outweigh those that were trained to recognize examples of different classes. If rule learning algorithm is not class-symmetric (i.e., problem of discriminating class $i$ from class $j$ is different than discriminating $j$ from $i$), the authors suggest double round robin approach. The classifiers are learned for both problems thus obtaining in total $c(c-1)$ classifiers. The recent book (Fürnkranz et al. 2012) presents a comprehensive description of the most relevant issues and works in rule learning system.

Former incremental rule learners include STAGGER (Schlimmer and Granger 1986), the first system designed expressly for coping with concept drift, the FLORA family of algorithms (Widmer and Kubat 1996) with FLORA3 being the first system able to deal with recurring contexts, and the AQ-PM family (Maloof and Michalski 2004). All these systems are able to incorporate new information, but they need to maintain in memory previous examples and cannot deal with high-speed data streams. The first rule learner designed for processing data streams is the system Facil (Ferrer et al. 2005). Facil uses a bottom-up, specific to general, search strategy, similar to AQ-PM. The decision model consists of a set of rules and, for each rule, a set of positive and negative examples that define the borders of the rule. The core of this approach is that rules may be inconsistent by storing positive and negative examples which are very near one another (border examples). This approach is similar to the AQ11-PM system (Kolter and Maloof 2003; Maloof and Michalski 2004), which selects positive examples from the boundaries of its rules (hyper-rectangles) and stores them in memory. When new examples arrive, AQ11-PM combines them with those held in memory, applies the AQ11 algorithm to modify the current set of rules, and selects new positive examples from the corners, edges, or surfaces of such hyper-rectangles (extreme examples). The idea of creating new rules based on border examples is similar to the double induction approach introduced by Lindgren and Boström (2004). Instead of using frequency based or NB based decision, when more rules predict different labels for an example, they proposed to induce a new set of rules from the examples covered by the conflicting rules. Then the new rules are used to classify the examples that are covered by the conflicting rules. Facil uses a forgetting mechanism that can be either explicit or implicit. Explicit forgetting takes place when the examples are older than a user defined threshold. Implicit forgetting is performed by removing examples that are no longer relevant as they do not enforce any concept description boundary. Facil

2.3 Adaptive methods

The stream mining community has already introduced many different approaches to deal with the phenomenon of concept drift. Approaches such as sliding windows and example weights (Klinkenberg 2004) are widely used to maintain a classifier consistent to the most recent data. Bifet and Gavalda (2009) proposed the ADWIN algorithm, a detector and estimator which automatically adapts to current rate of change by keeping the window of recent examples of variable length. It employs Hoeffding bound to guarantee that the window has maximal length without a change inside the window. Other methods can explicitly detect change-points or small time-windows where the concept to learn has changed. A classifier can be equipped with such drift detection method, e.g., based on error rate (Gama et al. 2004) or distance between classification errors (Baena-Garcia et al. 2006), and forgetting mechanism.

As pointed out by Wang et al. (2003), a drawback of decision trees is that even a slight drift of the target function may trigger several changes in the model and severely compromise learning efficiency. On the other hand, ensemble methods avoid expensive revisions by weighting the members, but may run the risk of building unnecessary learners when virtual drifts are present in data. Bifet et al. (2009) presented two new decision tree ensemble methods: ADWIN bagging and adaptive-size Hoeffding tree bagging. The former extends on-line bagging (Oza and Russell 2001) with ADWIN change detector, which works as an estimator for the weights of the boosting method. The worst performing classifier is removed from the ensemble when change is detected and it is replaced by a new one. The latter uses HT of different maximum sizes since smaller trees adapt faster to changes and larger work better for long periods with little or no change.

3 Very fast decision rules algorithm

In this section we present the classification rule learning system for data streams—very fast decision rules VFDR and variants. The rule learning algorithm strives to provide a flexible classifier for data streams that would produce output easy to interpret and that could adapt quickly to changes in the underlying concept.

As in many other systems a rule in VFDR is an implication of the form $A \Rightarrow C$. The $A$ part of a rule is a conjunction of literals, that is, conditions based on attribute values. For numerical attributes, each literal is of the form $X_i > v$, or $X_i \le v$ for some feature $X_i$ and some constant $v$. For categorical attributes VFDR produces literals of the form $X_i = v_j$ where $v_j$ is a value in the domain of $X_i$. The $C$ part of a rule $r$ is not a constant as in most of rule based systems, but the class is given by a function—either majority class or NB. This is the most distinctive feature of VFDR.

In the first part of this section we describe the base algorithm VFDR-Base. The following subsections describe modifications for multi-class problems and time evolving data streams.

3.1 The basic algorithm

The VFDR-Base algorithm is designed for high-speed data streams. It learns ordered or unordered rule sets. It needs only one scan of data and is able to provide any-time classifications.

3.1.1 Growing a set of rules

The algorithm begins with an empty rule set $(RS)$ and a $default\; rule\;\{\}\rightarrow \mathcal{L}$, where $\mathcal{L}$ is initialized to $NULL$. $\mathcal{L}$ is a data structure that contains information used to classify the test instances, and the sufficient statistics needed to expand the rule.

As already said, each learned rule $(r)$ is a conjunction of literals, that are conditions based on attribute values, and an $\mathcal{L}_r$. If all the literals are true for a given example, then the example is said to be covered by the rule. The labeled examples covered by a rule $r$ are used to update $\mathcal{L}_r$. A rule is expanded with the literal that has the highest gain measure of the examples covered by the rule. $\mathcal{L}_r$ accumulates the sufficient statistics to compute the gain measure of all possible literals. $\mathcal{L}_r$ is a data structure that contains: an integer that stores the number of examples covered by the rule; a vector to compute $p(c_k)$, i.e., the probability of observing examples of class $c_k$; a matrix $p(X_i=v_j|c_k)$ to compute the probability of observing value $v_j$ of a nominal attribute $X_i$ per class; and a binary search tree to compute the probability of observing values greater than $v_j$ of continuous attribute $X_i,\; p(X_i > v_j|c_k)$, per class. The information maintained in $\mathcal{L}_r$ is similar to the sufficient statistics used by Gama et al. (2006).

The number of observations, after which a rule can be expanded or new rule can be induced, is determined by the Hoeffding bound. It guarantees that, with probability at least $1-\delta $, the true mean of a random variable $x$ with a range $R$ will not differ from the sample mean of size $N$ by more than:

$$\begin{aligned} \epsilon = \sqrt{\frac{R^{2}\ln (1/\delta )}{2N}}. \end{aligned}$$

It is not efficient to check for the sufficient number of examples with every incoming example, therefore this is done only after every $N_{min}$ observations.

The set of rules $(RS)$ is learned in parallel as described in Algorithm 4. We consider two cases: learning ordered or unordered set of rules. In the former, every labeled example updates statistics of the first rule that covers it. In the latter, every labeled example updates statistics of all the rules that cover it. If a labeled example is not covered by any rule, the default rule is updated.

The expansion of a rule is done using Algorithm 2 that employs the aforementioned Hoeffding bound. For each attribute $X_i$ the value of split evaluation function $G$ is computed for each attribute value $v_j$. The best and the second best literal merit named as $g_{best}$ and $g_{2best}$ respectively are used for the Hoeffding bound condition. If the best merit is better the second best with given confidence, i.e. satisfies condition $g_{best}-g_{2best} > \epsilon $, the rule is expanded with condition $X_a = v_j$ and the class of the rule is assigned according to the majority class of observations of $X_a = v_j$.

3.1.2 Classification strategies

Assume that a rule $r$ covers a test example. The example will be classified using the information in $\mathcal{L}_r$ of that rule. The simplest strategy uses the distribution of the classes stored in $\mathcal{L}_r$, and classify the example in the class with maximum estimated $p(c_k)$. This strategy only uses the information about class distributions and does not look for the attribute-values; therefore it uses only a small part of the available information. In a more informed strategy, a test example is classified with the class that maximizes the posteriori probability given by Bayes rule assuming the independence of the attributes given the class. There is a simple motivation for this option. $\mathcal {L}$ stores information about the distribution of the attributes given the class usually for hundreds or even thousands of examples, before expanding the rule and re-initializing the counters. NB takes into account not only the prior distribution of the classes, but also the conditional probabilities of the attribute-values given the class. The information available in each rule is therefore better utilized. Given the example $\varvec{x}=(x_{1},\ldots ,x_{j})$ and applying Bayes theorem, we obtain:

$$\begin{aligned} P(c_{k}|\varvec{x}) \propto P(c_{k})\prod P(x_{j}|c_{k}). \end{aligned}$$

Using NB in VFDT like classification algorithms is a well-known technique since it was introduced in Gama et al. (2003). One of its greatest advantages is the boost in any-time learning property because even though the learned rule set might not be robust enough or the individual rules might not provide sufficient information for expert interpretation (not being specialized enough, i.e., having only one or few conditions), it may already be able of highly informed predictions based on NB classification.

There are different ways to determine and rank the predictions and their associated confidences from the rule set in case multiple rules fire. The set of rules learned by VFDR-Base can employ three different classification strategies as defined in PMML (Data Mining Group 2011): First Hit, Weighted Sum, and Weighted Max.

First Hit uses the first firing rule to determine the predicted class, and the confidence is the weight of that rule.
Weighted Sum calculates the total weight for each class by summing the weights for each firing rule. The prediction with the highest total weight is then selected. The confidence is the total weight of the winning class divided by the number of firing rules.
Weighted Max selects the firing rule with the highest weight. The confidence returned is the confidence of the selected rule.

If two firing rules have the same weight (Weighted Max), or two or more classes are assigned the same weight (Weighted Sum) the winner is chosen alphabetically. As in Clark and Boswell (1991), the ordered rules use the First Hit strategy, while the unordered rules use the Weighted Sum strategy^{Footnote 2}.

3.2 One versus all rule learning

This modified version of the base VFDR-Base employs one vs. all strategy for multi-class learning (VFDR-OA). In this strategy, the examples of class $c_k \in C$ are positive and $\forall c_l \in C, c_l\ne c_k$ are negative. It considers a rule expansion for each class $c \in C_r$, where $C_r$ is the set of classes observed at rule $r$. This version is able to produce more rules in one call of ExpandRule. The number of rules induced from one rule $r$ in $RS$ in such a call is at most $|C_r|$.

The process to select new conditions for a rule works as follows. For each attribute $X_i$ the value of gain function $G = { Gain}(r',r)$ adopted from FOIL (Quinlan 1991) is computed. The change in gain between rule $r$ and a candidate rule after adding a new condition $r'$ is defined as:

$$\begin{aligned} Gain(r',r) = s \times \left( \log _2 \frac{N'_{+}}{N'} - \log _2 \frac{N_{+}}{N} \right) \end{aligned}$$

where $N$ is the number of examples covered by $r$ and $N_{+}$ is the number of positive examples in them, $N'_{+}$ and $N'$ represent the same for $r'$, and $s$ is the number of true positives in $r$ that are still true positives in $r'$, which in this case corresponds to $N'_{+}$.

We are interested only in positive gain, therefore we consider the minimum of the gain function as zero and the maximum for a given rule is $N_{+} \times \left( -\log _2 \frac{N_{+}}{N} \right) $. We can then normalize the positive gain as:

$$\begin{aligned} GainNorm(r',r) = \frac{Gain(r',r)}{N_{+} \times \left( -\log _2 \frac{N_{+}}{N} \right) }. \end{aligned}$$

In the case of expanding a non-default rule, that is a rule which already contains conditions, the algorithm works as follows. The rule was induced for a certain class that was considered as positive. To keep this class of interest in the rule set, the class is maintained as positive for the next computation of the merit. More specifically the difference is that an ordered set, in Algorithm 7, considers only the positive class. The unordered rule set in Algorithm 8 additionally computes the merits of other classes considered as the positive and can expand the rule also with the best condition for such classes. The additional expansions are allowed only when the rule has already been expanded with the original class at that call of ExpandRule. The default rule does not have any previous class considered as positive thus the search for new rules proceeds until the merit for all possible classes are checked.

In other words, the process creates multiple rules for different classes marked as positive, but not necessarily for all the available classes in one call. The advantage is that the algorithm is able to learn more rules and more specialized (complex) rules from fewer examples. As a result it can achieve higher accuracy at the possible cost of producing larger sets.

The search for the best and second best values considers the value of GainNorm for a given class. If $g_{best}^k$ is the true best gain measure, i.e., satisfies condition $g_{best}^{c_k}-g_{2best}^{c_k} > \epsilon $ for a given class $c_k$, the rule is expanded with condition $X_a = v_j \Rightarrow c_k$.

3.2.1 Classifier with multiple sets

Another modification further develops the one vs. all approach in rule stream learning. While the previously described version computed the merit for each class in one vs. all fashion, the following learner creates a separate rule set for each of the classes in data. Consequently the learning and prediction phase reflect certain differences. This modification is denoted as VFDR-MS (multiple sets). The aim of these modifications is to have rules for all classes with less exhaustive expansion search.

Learning phase The classifier contains multiple rule sets, i.e., rule subsets and one default rule. For each class in data there is a subset characterized by it (subset class) The rule subset distinguishes between its subset class, which is considered as a positive class, and all the other classes that are considered as a negative class, i.e., each $\mathcal{L}_r$ of a rule $r$ in a subset contains information about a two-class problem. When a new example arrives, it updates statistics of rules that cover the example in all the subsets. If the class of the example is the same as the subset class the example is used as is. If the classes differ, the class of the example for learning such a rule is changed to negative class label. Therefore, a rule grows only when the statistics for the subset class indicate the expansion.

The default rule of the classifier keeps sufficient statistics covering all classes, i.e., $\{\}\rightarrow \mathcal{L}$ stores summarized information about a multi-class problem. A new rule can be induced for multiple subsets at one time. More specifically, every time default rule observes $N_{min}$ examples, the classifier computes the FOIL gain for each class as positive and the others as negative so that each subset can possibly obtain a new rule.

This approach represents a balanced strategy. In general, the rule set grows larger with fewer examples than the basic algorithm, but not as large as the VFDR-OA version.

Prediction phase The rule learner with multiple sets uses slightly modified approach to obtain predictions. Similarly to previous versions the individual rules from each of the rule subsets can use either majority class (MC) prediction or NB prediction. When the class predicted by (MC) or (NB) does not correspond to the subset class the weight of such firing rule is not considered that is it is treated as if the rule did not cover the example. In other words, if the class predicted by the individual rule is the negative class, it indicates that the rule is not appropriate to predict the class of a given example. The rule subset itself then provides prediction of its subset class supported by the weight given by the rules that covered the example. Subset weights are determined based on the three strategies for rule learning: First Hit, Weighted Max, or Weighted Sum. Then the final prediction of the classifier is selected by using the subset class with maximum weight. If none of the rules from any of the rule subsets provide weight, i.e., there was no rule that would cover the example and predicted the class of its subset, then default rule provides the prediction based on MC or NB.

3.3 Adaptive very fast decision rules

In this section we present an extension to VFDR that can be applied to any of the previously described versions of the classifier. The aim of the AVFDR (adaptive VFDR) extension is the adaptation of the rule learner to the phenomenon of concept drift.

3.3.1 Rule adaptation in the presence of drift

Data streams are characterized by their evolving nature. The ability to detect and react to concept drift is crucial when modeling continuous flows of data generated by processes with unknown dynamics.

Adaptation in rule models is facilitated due to their modularity. As opposed to decision trees the set of rules offer the possibility to remove individual rules without the need for rebuilding the entire model. This characteristic is very important for incremental learning from evolving data, because it allows faster adaptation of the model.

The previous algorithms were adapting only implicitly by inferring new rules and specializing the existing ones. AVFDR extension embeds an explicit drift detection into the learning process as presented in Algorithm 9. In addition to $\mathcal{L}_r$, a drift detection mechanism tracks performance of each rule $r$ during learning. The method employed in AVFDR is the SPC (Gama et al. 2004) (Statistical Process Control) described in Algorithm 10. With every labeled training example covered by a rule, the rule makes prediction and updates its error rate. SPC monitors the error rate and manages two registers during training: $p_{min}$ and $s_{min}$, where $p$ is error rate and $s$ is SD. Every time a new example $i$ is covered by the rule those values are updated when

$$\begin{aligned} p_{i} + s_{i} < p_{min} + s_{min}. \end{aligned}$$

The learning process of a given rule can be in one of the following 3 states: In-Control, Out-of-Control, or in Warning. We follow the 3-sigma rule (Grant and Leavenworth 1996):

$$\begin{aligned} p_{i} + s_{i} \ge p_{min} + \alpha \times s_{min}, \end{aligned}$$

where $\alpha = 2$ changes the state to Warning and $\alpha = 3$ signals that the Out-of-Control state is reached.

In the warning state, the rule stops learning until the status of the rule becomes again In-Control. The default rule stores the examples during warning in a short-term memory. If the rule reaches Out-of-Control it implies that the performance of the particular rule has degraded significantly and can negatively influence the quality of predictions of the classifier therefore it is removed from the rule set. This control enables to keep the rule set up-to-date and prevent the rule set from excessive growth. The default rule’s statistic is re-initialized with the examples from short-term memory if it reaches the Out-of-Control state.

3.4 Illustrative example

In this section we present the functionalities of the VFDR on a simple example for better understanding of the whole process. The illustrative problem is an artificial dataset with class given by the logical rule $(A \wedge B) \vee (C \wedge D)$, where the attributes $A,B,C,D \in \{true, false\}$ are randomly generated. The number of examples is 1,500.

The learned rule sets are presented in Tables 1 and 2 for unordered and ordered rule sets respectively. The different VFDR variants produce different rule sets due to the different search strategies for the best conditions.

Table 1 Unordered rule sets generated by different versions of VFDR

Very fast decision rules for classification in data streams

Abstract

Similar content being viewed by others

eRules: A Modular Adaptive Classification Rule Learning Algorithm for Data Streams

RILL: Algorithm for Learning Rules from Streaming Data with Concept Drift

Computationally Efficient Rule-Based Classification for Continuous Streaming Data

Explore related subjects

1 Introduction

2 Related work

2.1 Stream classification algorithms

2.2 Rule learning

2.3 Adaptive methods

3 Very fast decision rules algorithm

3.1 The basic algorithm

3.1.1 Growing a set of rules

3.1.2 Classification strategies

3.2 One versus all rule learning

3.2.1 Classifier with multiple sets

3.3 Adaptive very fast decision rules

3.3.1 Rule adaptation in the presence of drift

3.4 Illustrative example

4 Experimental evaluation

4.1 Experimental methodology

4.1.1 Datasets

4.1.2 Algorithms

4.1.3 Performance metrics

4.2 Comparison of VFDR algorithms

4.2.1 Stationary data

4.2.2 Non-stationary data

4.3 Comparison between streaming classification algorithms

4.3.1 Stationary data

4.3.2 Evaluation on non-stationary and real world data

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Datasets

1.1 Artificial datasets

1.2 Real-world datasets

Appendix 2: Results from tests on shuffled real world datasets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation