Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The idea of adaptive learning in dynamical environments has recently received increasing attention in different research communities, for example, in the database and data mining community under the slogan of “learning from data streams” [17, 18], and in the computational intelligence community under the notion of “evolving fuzzy systems” [4, 5, 24, 25]. Despite small differences regarding the basic assumptions and the technical setting, the emphasis of goals and performance criteria, and the focus on specific types of applications, the key motivation of these and related fields is the idea of a system that learns incrementally, and maybe even in real-time, on a continuous stream of data, and which is able to properly adapt itself to changes of environmental conditions or properties of the data-generating process. Systems with these properties have been developed for different machine learning and data mining problems, such as clustering [1], classification [22], and frequent pattern mining [10].

Domingos and Hulten [15] list a number of properties that an ideal stream mining system should possess, and suggest corresponding design decisions: the system uses only a limited amount of memory; the time to process a single record is short and ideally constant; the data is volatile and a single data record accessed only once; the model produced in an incremental way is equivalent to the model that would have been obtained through common batch learning (on all data records so far); the learning algorithm should react to concept drift [32] (i.e., any change of the underlying data-generating process) in a proper way and maintain a model that always reflects the current concept.

Given the existence of a number of sophisticated and partly quite complicated methods for learning on data streams, it is surprising that one of the simplest approaches to machine learning, namely the instance-based (case-based) learning paradigm, has only received very little attention so far—all the more since the nearest neighbor estimation principle, the core of this paradigm, is a standard method in machine learning, pattern recognition, and related fields. In this chapter, we elaborate on the potential of the instance-based approach to supervised learning within the context of data streams and propose an efficient instance-based learning algorithm for classification and regression. To this end, we build on [6], in which our approach to classification was introduced.

The remainder of the paper is organized as follows: The next section recalls the basic ideas of instance-based learning, along with a short discussion of its possible advantages and disadvantages in a streaming context. Our approach to instance-based learning on data streams, IBL-DS, is introduced in Sect. 8.3. In Sect. 8.4, we provide some information about the MOA (Massive Online Analysis) framework for mining data streams, in which IBL-DS is implemented. Experimental results are presented in Sect. 8.5.

2 Instance-Based Learning

The term instance-based learning (IBL) stands for a family of machine learning algorithms, including well-known variants such as memory-based learning, exemplar-based learning and case-based learning [23, 27, 28]. As the term suggests, in instance-based algorithms special importance is attached to the concept of an instance [3]. An instance or exemplar can be thought of as a single experience, such as a pattern (along with its classification) in pattern recognition or a problem (along with a solution) in case-based reasoning.

As opposed to model-based machine learning methods which induce a general model (theory) from the data and use that model for further reasoning, IBL algorithms simply store the data itself. They defer the processing of the data until a prediction (or some other type of query) is actually requested, a property which qualifies them as a lazy learning method [2]. Predictions are then derived by combining the information provided by the stored examples.

Such a combination is typically accomplished by means of the nearest neighbor (NN) estimation principle [11]. Consider the following setting: Let \(\mathcal{X}\) denote the instance space, where an instance corresponds to the description x of an object (usually although not necessarily in attribute-value form). \(\mathcal{X}\) is endowed with a distance measure Δ( ⋅), i.e., Δ(x, x ) is the distance between instances \(x,{x}^{{\prime}}\in \mathcal{X}\)\(\mathcal{Y}\) is the output space and \(\langle x,y\rangle \in \mathcal{X}\times \mathcal{Y}\) is called a labeled instance, a case, or an example. In classification, \(\mathcal{Y}\) is a finite (usually small) set comprised of m classes {λ1, , λ m }, whereas \(\mathcal{Y} = \mathbb{R}\) in regression.

The current experience of the learning system is represented in terms of a set \(\mathcal{D}\) of examples ⟨x i , y i ⟩, \(1 \leq i \leq n = \vert \mathcal{D}\vert \). From a machine learning point of view, \(\mathcal{D}\) plays the role of the training set of the learner. More precisely, since not all examples will necessarily be stored by an instance-based learner, \(\mathcal{D}\) is only a subset of the training set. In case-based reasoning, it is also referred to as the case base.

Finally, suppose a novel instance \({x}_{0} \in \mathcal{X}\) (a query) to be given. The NN principle prescribes to estimate the corresponding output y 0 by the output of the nearest (most similar) sample instance. The k-nearest neighbor (k-NN) approach is a slight generalization, which takes the k ≥ 1 nearest neighbors of x 0 into account. That is, an estimation y 0 est of y 0 is derived from the set \({\mathcal{N}}_{k}({x}_{0})\) of the k nearest neighbors of x 0. In classification, this is usually done by means of a majority vote, i.e.,

$${y}_{0}^{\mathrm{est}} =\arg {\max }_{ \lambda \in \mathcal{L}}\#\{{x}_{i} \in {\mathcal{N}}_{k}({x}_{0})\,\vert \,{y}_{i} = \lambda \},$$
(8.1)

with the set of class labels, whereas in regression, a weighted average of the outputs of the neighbors is predicted:

$${y}_{0}^{\mathrm{est}} ={ \sum \limits _{{x}_{i}\in {\mathcal{N}}_{k}({x}_{0})}}w({x}_{i}) \cdot {y}_{i},$$
(8.2)

with

$$w({x}_{i}) = \frac{f(\Delta ({x}_{i},{x}_{0}))} {{\sum \nolimits }_{{x}_{j}\in {\mathcal{N}}_{k}({x}_{0})}f(\Delta ({x}_{f},{x}_{0}))}.$$

Here, f( ⋅) is a decreasing function \({\mathbb{R}}_{+} \rightarrow {\mathbb{R}}_{+}\), which means that the smaller Δ(x i , x 0), the stronger the weight of y i .

Recall the aforementioned key requirements for learning and data mining algorithms on data streams: Above all, such algorithms must be incremental, highly adaptive, and they must be able to deal with concepts that may change over time. Is lazy, instance-based learning preferable to eager, model-based learning under these conditions? Unfortunately, this question cannot be answered unequivocally.

Obviously, IBL algorithms are inherently incremental, since adaptation basically comes down to adding or removing observed cases. Thus, incremental learning and model adaptation is simple and cheap in the case of IBL. As opposed to this, incremental learning is much more difficult to realize for most model-based approaches. Even though incremental versions do exist for a number of well-known learning methods, such as decision tree induction [30], the incremental update of a model is often quite complex and in many cases assumes the storage of a considerable amount of additional information.

The training efficiency of lazy learners does not come for free, however. Compared with model-based approaches, IBL has higher computational costs when it comes to answering new queries. In fact, the latter requires finding the k nearest neighbors of the query, and even though this retrieval step can be supported by efficient data and indexing structures, it remains costly in comparison with deriving a model-based prediction.

Consequently, IBL might be preferable in a data stream application if the number of incoming data is large compared with the number of queries to be answered, i.e., if model updating is the dominant factor. On the other hand, if queries must be answered frequently and under tight time constraints, whereas a need for updating the model due to newly observed examples rarely occurs, a model-based method might be the better choice.

Regarding the handling of concept drift, a definite answer cannot be given either. Appropriately reacting to concept drift requires, apart from its discovery, flexible updating, and adaptation strategies. In instance-based learning, model adaptation basically comes down to editing the case base, that is, adding new and/or deleting old examples. Whether or not this can be done more efficiently than adapting an other type of model, such as a classification tree or a neural network, does of course strongly depend on the particular model at hand. In any case, maintaining an implicit concept description by storing observations, as done by IBL, facilitates “forgetting” examples that seem to be outdated. In fact, such examples can simply be removed, while retracting the influence of outdated examples is usually more difficult in model-based approaches. In a neural network, for example, a new observation causes an update of the network weights, and this influence on the network cannot simply be cancelled later on.

3 Instance-Based Learning on Data Streams

This section introduces our approach to instance-based learning on data streams, referred to as IBL-DS. Our learning scenario consists of a data stream that permanently produces examples, potentially with a very high arrival rate, and a second stream producing query instances to be classified. The key problem for our learning system is to maintain an implicit concept description in the form of a case base (memory). Before presenting details of IBL-DS, some general aspects and requirements of concept adaptation (case-base maintenance) in a streaming context will be discussed.

3.1 Concept Adaptation

The simplest adaptive learners are those using sliding windows of fixed size. Since the update is very simple, these learners are also very fast. On the other hand, the assumption that the data which is currently relevant forms a fixed-sized window, i.e., that it consists of a fixed number of consecutive observations, is quite restrictive. In fact, by fixing the number of examples in advance, it is impossible to optimally adapt the size of the case base to the complexity of the concept to be learned, and to react to changes of this concept appropriately. Moreover, being restricted to selecting a subset of successive observations in the form of a window, it is impossible to disregard a portion of observations in the middle (e.g., outliers) while retaining preceding and succeeding blocks of data.

To avoid both of the aforementioned drawbacks, nonwindow-based approaches are needed that do not only adapt the size of the training data but also have the liberty to select an arbitrary subset of examples from the data seen so far. Needless to say, such flexibility does not come for free. Apart from higher computational costs, additional problems such as avoiding an unlimited growth of the training set and, more generally, trading off accuracy against efficiency, have to be solved.

Instance-based learning seems to be attractive in light of the above requirements, mainly because of its inherently incremental nature and the simplicity of model adaptation. In particular, since in IBL an example has only local influence, the update triggered by a new example can be restricted to a local region around that observation.

Regarding the updating (editing) of the case base in IBL, an example should in principle be retained if it improves the predictive performance (classification accuracy) of the classifier; otherwise, it should better be removed.Footnote 1 Unfortunately, this criterion cannot be used directly, since the (future) usefulness of an example in this sense is simply not known. Instead, existing approaches fall back on suitable indicators of usefulness:

  • Temporal relevance: According to this indicator, recent observations are considered as potentially more useful and, hence, are preferred to older examples.

  • Spatial relevance: The relevance of an example can also depend on its position in the instance space. This is the case, for example, if a concept drift only affects a part of the instance space. Besides, a more or less uniform coverage of the instance space is usually desirable, especially for local learning methods. In IBL, examples can be redundant in the sense that they do not change the nearest neighbor classification of any query. More generally (and less stringently), one might consider a set of examples redundant if they are closely neighbored in the instance space and, hence, have a similar region of influence. In other words, a new example in a region of the instance space already occupied by many other examples is considered less relevant than a new example in a sparsely covered region.

  • Consistency: An example should be removed if it seems to be inconsistent with the current concept, e.g., if its own output (strongly) differs from those in its neighborhood.

Many algorithms use only one indicator, either temporal relevance (e.g., window-based approaches), spatial relevance (e.g., Lightweight Frequency Counting, LWF), or consistency (e.g., Instance-Based learning algorithm 3, IB3). A few methods also use a second indicator, e.g., the approach of Klinkenberg (temporal relevance and consistency), but only the window-based system FLORA4 (Floating Rough Approximation) uses all three aspects.

3.2 IBL-DS

In this section, we describe the main ideas of IBL-DS, our approach to IBL on data streams that not only takes all of the aforementioned three indicators into account but also meets the efficiency requirements of the data stream setting.

IBL-DS optimizes the composition and size of the case base autonomously. On arrival of a new example ⟨x 0, y 0⟩, this example is first added to the case base. Moreover, it is checked whether other examples might be removed, either since they have become redundant or since they are outliers (noisy data). To this end, a set C of examples within a neighborhood of x 0 are considered as candidates. This neighborhood is given by the k cand nearest neighbors of x 0, determined according a distance measure Δ (see Sect. 8.7), and the candidate set C consists of the examples within that neighborhood. The most recent examples are excluded from removal due to the difficulty to distinguish potentially noisy data from the beginning of a concept change. Even though unexpected observations will be made in both cases, noise and concept change, these observations should be removed only in the former but not in the latter case.

In the classification scenario, the most frequent class among the k cand youngest examples in a larger test environment of sizeFootnote 2 \({k}_{\mathrm{test}} = {({k}_{\mathrm{cand}})}^{2} + {k}_{\mathrm{cand}}\) is determined. If this class corresponds to the current class y 0, those candidates in C are removed that have a different class label and do not belong to the k cand youngest examples in the larger test environment. Furthermore, to guarantee an upper bound on the size of the case base, the oldest element of the similarity environment is deleted, regardless of its class, whenever the upper bound would be exceeded by adding the new example. The similarity environment constitutes the set of instances in the vicinity of the query instance, while the test environment can be seen as the union of the similarity environments of the neighbored instances.

In the regression scenario, the k cand youngest examples in the neighborhood set C determines a confidence interval \(\left [\bar{y} - {Z}_{\frac{\alpha } {2} } \frac{\sigma } {\sqrt{{k}_{\mathrm{cand }}}},\bar{y} + {Z}_{\frac{\alpha } {2} } \frac{\sigma } {\sqrt{{k}_{\mathrm{cand }}}} \right ],\) where \(\bar{y}\) is the average target value for the considered examples and σ is the standard deviation. A class values y 0 outside this interval indicates an unexpected change in the neighborhood when this instance was generated. In this case, instances not belonging to the confidence interval are removed from the larger test environment.

Using this strategy, the algorithm is able to adapt to concept drift but will also have a high accuracy for nondrifting data streams. Still, these two situations—drifting and stable concept—are to some extent conflicting with regard to the size of the case base: If the concept to be learned is stable, classification accuracy will increase with the size of the case base. On the other hand, a large case base turns out to be disadvantageous in situations where concept drift occurs, and even more in the case of concept shift. In fact, the larger the case base is, the more outdated examples will have to be removed and, hence, the more sluggish the adaptation process will be.

For this reason, we try to detect an abrupt change of the concept using a statistical test as in [19, 20]. If a corresponding change has been detected, a large number of examples will be removed instantaneously from the case base. In the classification scenario, the test is performed as follows: We maintain the prediction error p and standard deviation \(s = \sqrt{\frac{p(1-p)} {100}}\) for the last 100 training instances. Let p min denote the smallest among these errors and s min the associated standard deviation. A change is detected if the current value of p is significantly higher than p min. Here, statistical significance is determined by testing the null hypothesis H 0 : p ≤ p min against the alternative hypothesis H 1 : p > p min. This is accomplished by using a standard (one-sided) z-test, i.e., the condition to be tested is \(p + s > {p}_{\min } + {z}_{\alpha }{s}_{\min }\), where α is the level of confidence (we use α = 0. 999).

Finally, in case a change has been detected, we try to estimate its extent in order to determine the number of examples that need to be removed. More specifically, we delete p dif percent of the current examples, where p dif is the difference between p min and the classification error for the last 20 instances; the latter serves as an estimation of the current classification error.Footnote 3 Examples to be removed are chosen at random according to a distribution which is spatially uniform but temporally skewed; see [6] for details.

In the regression scenario, the above test is conducted with the mean absolute error instead of the classification rate, and the percentage of examples to be removed is determined by the relative increase of this error.

4 MOA

IBL-DS is implemented under the MOA (Massive Online Analysis) framework, an open source software for mining and analyzing large data sets in a stream-like manner. MOA is written in Java and is closely related to WEKA [31], the Waikato Environment for Knowledge Analysis, which is presently the most commonly used machine learning software.

MOA supports the development of classifiers that can learn either in a purely incremental mode, or in batch mode first (on an initial part of a data stream) and incrementally afterward. The implementation of an evolving classifier is supported by a Java interface called UpdateableClassifier. This operation simulates the case of online learning, which means that each instance is accessed only once. A few incremental classifiers are already included in MOA, notably the Hoeffding tree [22], a state-of-the-art classifier often used as a baseline in experimental studies. Some meta learning techniques are implemented, too, such as online bagging and boosting both for static [26] and evolving streams [8].

4.1 Stream Generators

MOA supports the simulation of data streams by means of synthetic stream generators. An example is the Hyperplane generator that was originally used in [22]. It generates data for a binary classification problem, taking a random hyperplane in d-dimensional Euclidean space as a decision boundary; a certain percentage of instances is corrupted with noise.

Another important stream generator is the RandomTree generator. Its underlying model is a decision tree for a desired number of attributes and classes. The tree is built by splitting on randomly chosen attributes and then giving random class labels to the leaf nodes. Instances are generated with uniformly distributed values in the attributes while the class label is determined by the tree.

MOA offers the ConceptDriftStream procedure for simulating concept drift. The idea underlying this procedure is to mix two pure distributions in a probabilistic way, smoothly varying the corresponding probability degrees. In the beginning, examples are taken from the first pure stream with probability 1, and this probability is decreased in favor of the second stream in the course of time. More specifically, the probability is controlled by means of the sigmoid function

$$f(t) ={ \left (1 + {e}^{-4\left (t-{t}_{0}\right )/w}\right )}^{-1}.$$

This function has two parameters: t 0 is the mid point of the change process, while w is the length of this process.

4.2 Model Evaluation

The evaluation of an evolving classifier is clearly a nontrivial issue. In fact, compared to standard batch learning, simple one-dimensional performance measures such as classification accuracy are not immediately applicable, or at least not able to capture the time-varying behavior of a classifier in a proper way. MOA offers different solutions for this problem.

The holdout procedure is a generalization of the cross-validation procedure commonly used in batch learning. Here, the training and the testing phase of a classifier are interleaved as follows: the classifier is trained incrementally on a block of M instances and then evaluated (but no longer adapted) on the next N instances, then again trained on the next M and tested on the subsequent N instances, and so forth. Thus, it becomes possible to monitor the performance of the model as time progresses; this information can also be used as an indicator of possible changes of the underlying concept [7, 9].

While the holdout procedure uses an instance either for training or for testing, each instance is used for both in the prequential approach [12]: First, the model is evaluated on the instance, and then a single incremental learning step is carried out. The prequential error is advocated in [21], where it is also shown to converge to the holdout measure when using a sliding window or a fading factor (exponential weighting).

5 Experiments

In this section, we compare IBL-DS with state-of-the-art learners in terms of performance and handling of concept drift, namely Hoeffding trees for classification [22] and the FLEXFIS approach for regression [24]. Hoeffding trees is a decision tree approach suitable for learning on data streams, whereas FLEXFIS constructs and maintains a specific kind of fuzzy rule-based model, namely a model of the Takagi–Sugeno type [29]. Our study is not meant as an extensive empirical evaluation that supports statistically valid conclusions. Instead, it is only supposed to serve an illustration purpose. We refer to [6] for more experiments with classification problems.

We use IBL-DS in its default setting unless otherwise stated (in some binary classification problems, we try different values for the maximum size of the instance base). Experiments are not only conducted with real data sets, but also with synthetic data. As an important advantage of synthetic data, let us note that it allows for conducting experiments in a controlled way and, therefore, to investigate the performance of a method under specific conditions. In particular, synthetic data is useful for simulating a concept drift.

The experiments are performed in the MOA framework, using the holdout procedure for measuring predictive accuracy. The parameters M and N vary depending on the size of the data set (we take M = 5, 000 and N = 1, 000 in the first two experiments with synthetic data). For the experiments with real data, these parameters are adapted to the size of the respective data set; see Table 8.1 for an overview of the main characteristics of these data sets. The real data sets are standard benchmarks taken from the Statlib archiveFootnote 4 and the UCI repository [16]. Since they do not have an inherent temporal order, we average the performance curves over 100 randomly shuffled versions of these data sets.

5.1 Classification

5.1.1 Synthetic Data

The first two experiments are based on synthetic data with different characteristics (i.e., different types of decision boundaries). The first experiment uses data taken from the hyperplane generator. The ConceptDriftStream procedure mixing streams produced by two different hyperplanes simulates a rotating hyperplane. Using this procedure, we generated 12, 000, 000 examples connecting two hyperplanes in four-dimensional space, with t 0 = 500, 000 and w = 100, 000.

We compare the performance of two different settings of IBL-DS, one with a value of 400 for the maximum size of the instance base and the other one with 5,000. Figure 8.1 shows that both versions of IBL-DS initially outperform the Hoeffding tree. The Hoeffding tree is also more affected by the concept drift, showing a more pronounced “valley” in the performance curve, and also taking more time to recover. IBL-DS recognizes and adapts to the concept drift quite early, recovering its original performance as soon as the drift is over.

Table 8.1 Summary of the data sets used in the experiments

In a second experiment, we use the random tree generator to produce examples. Obviously, this generator is favorable for the Hoeffding tree. Again, the same ConceptDriftStream is used, but this time mixing two random tree generators. As can be seen in Fig. 8.2, the Hoeffding tree is now able to outperform IBL-DS in the first phase of the learning process; in fact, reaching an accuracy of close to 100%, which is not unexpected given that the Hoeffding tree is ideally tailored for this kind of data. Once again, however, the Hoeffding tree is much more affected by the concept drift than the IBL-DS. Both variants of IBL-DS suffer from a drop of about 15% in terms of classification rate, and recover quickly during the phase of the drift, whereas the Hoeffding tree loses about 40% of its accuracy.

5.1.2 Real Data

In this experiment, we used the Shuttle data from the Statlog repository, for which the task is to predict the class of a shuttle. The data set is highly imbalanced, with 80% of the instances belonging to one class and the remaining 20% distributed among six other classes; in order to obtain a binary problem, we grouped these six classes into a single one. The new problem thus consists of predicting whether a shuttle belongs to the majority class or not. Both algorithms were initially trained on 300 instances in batch mode; for the holdout evaluation, we used M = 200 and N = 50. Figure 8.3 shows the results averaged over 100 randomly shuffled versions of the data set. As can be seen, IBL-DS starts with a very strong performance, close to 99% accuracy; the Hoeffding tree reaches this accuracy, too, but not before observing three quarters of the whole stream.

Fig. 8.1
figure 1_8

Classification rate on the hyperplane data (binary)

Fig. 8.2
figure 2_8

Classification rate on the RandomTree data (binary)

The wine quality data is an ordinal classification problem, in which a wine (characterized by several chemical properties) is put into a discrete category ranging from 10 (best) to 0 (worst). We turned this problem into a binary classification task by grouping the top-5 and bottom-6 classes. Actually, the data set consists of two subsets, one for white wine and one for red wine. For both data sets, the initial learning is done on 300 instances. In all our experiments on the wine quality data, we average the results over 100 randomly shuffled versions. For the evaluation on the red wine data, we used M = 100 and N = 25, because this data set is relatively small (about 1,600 examples); for white wine, we used M = 200 and N = 50. Figure 8.4 shows the results of both experiments. As can be seen, IBL-DS is clearly superior to Hoeffding trees on these data sets.

Fig. 8.4
figure 4_8

Classification rate on the wine quality data set (binary)

For evaluating the muticlass case, we used the same real data sets as above, but without grouping the output categories. As can be seen from Fig. 8.5, the performance of both IBL-DS and Hoeffding trees on the wine data is lower than that for the binary case, an observation that is clearly expected. Still, IBD-DS remains superior on the whole stream. For the Shuttle data, Fig. 8.6 shows that the performance of IBL-DS remains almost the same, compared to the binary case, whereas the Hoeffding tree again starts with low classification rate and never exceeds the 85% limit.

Fig. 8.3
figure 3_8

Classification rate on the Shuttle data (binary)

Fig. 8.5
figure 5_8

Classification rate on the wine quality data set (multiclass)

5.2 Regression

For the case of regression, we modified the hyperplane generator in MOA as follows: The output for an instance x is not determined by the sign of w T x, where w is the normal vector of the hyperplane, but by the absolute value w T x. In other words, the problem is to predict the distance to the hyperplane. As an alternative, we also tried w T x 2, i.e., the squared distance. Again, ConceptDriftStream was used for simulating a concept drift by mixing two streams.

Figures 8.7 and 8.8 show the performance of IBL-DS and FLEXFIS, in terms of the root mean squared error (RMSE), for the (piecewise) linear and the quadratic case (and dimension d = 4), respectively. As can be seen, FLEXFIS performs quite well in the linear case. This behavior is expected and can easily be explained by its model structure (FLEXFIS uses fuzzy rules with linear functions as consequent parts). What is more interesting, however, is the observation that IBL-DS is much less affected by the concept drift, both in the linear and the quadratic case. In fact, while FLEXFIS deteriorates significantly and needs quite some time to recover, the performance of IBL-DS remains almost unchanged.

Fig. 8.6
figure 6_8

Classification rate on the Shuttle data (multiclass)

Fig. 8.7
figure 7_8

RMSE for the hyperplane data (regression, linear case)

As a real data set, we again used the wine data, this time treating the quality level as a numerical value. Figure 8.9 shows that IBL-DS is slightly worse than FLEXFIS [24] on these two data sets.

6 Summary

We have presented an instance-based algorithm for classification and regression on data streams. This algorithm, called IBL-DS, has a number of desirable properties that are not, at least not as a whole, shared by existing alternative methods. The experiments presented in [6], complemented by those in this paper, suggest that IBL-DS is very flexible and thus able to adapt to an evolving environment quickly, a point of utmost importance in the data stream context. In particular, two specially designed editing strategies are used in combination in order to successfully deal with both gradual concept drift and abrupt concept shift. Besides, IBL-DS is relatively robust and produces good results when being used in a default setting for its parameters. An implementation of IBL-DS under the MOA framework, along with a documentation, can be downloaded under the following address: http://www.uni-marburg.de/fb12/kebi/research/software/iblstreams/.

Fig. 8.8
figure 8_8

RMSE for the hyperplane data (regression, quadratic case)

Fig. 8.9
figure 9_8

RMSE for wine quality data set (regression)

7 Distance Function

The distance function used in IBL-DS is an incremental variant of SVDM (Simple Value Difference Metric) which is a simplified version of the VDM (Value Difference Metric) distance measure [28] and was successfully used in the classification algorithm RISE [13, 14]. Let an instance x be specified in terms of features F 1, , F , i.e., as a vector x = (f 1, , f ) ∈ D 1 ×⋯ ×D .

Numerical features F i with domain D i  =  are first normalized by the mapping \({f}_{i}\mapsto {f}_{i}/(\max -\min )\), where max and min denote, respectively, the largest and smallest value for F i observed so far; these values are permanently updated.Footnote 5 Then, δ i f i , f i is defined by the Euclidean distance between the normalized values of f i and f i .

For a discrete attribute F j , the distance between two values f j and f j is defined by the following measure:

$${\delta }_{i}\left ({f}_{j},{f}_{j}^{{\prime}}\right ) ={ \sum \limits _{k=1}^{m}}\left \|P\left ({\lambda }_{ k}\,\vert \,{F}_{j} = {f}_{j}\right ) - P\left ({\lambda }_{k}\,\vert \,{F}_{j} = {f}_{j}^{{\prime}}\right )\right \|,$$

where m is the number of classes and P(λ | F = f) is the probability of the class λ given the value f for attribute F. Finally, the distance between two instances x and x is given by the mean squared distance

$$\Delta (x,{x}^{{\prime}})\, =\, \frac{1} \ell{\sum \limits _{i=1}^\ell}{\delta }_{ i}{\left ({f}_{i},{f}_{i}^{{\prime}}\right )}^{2}.$$