Keywords

1 Introduction

In real world data analysis, data can continuously arrive in streams, with a probability distribution that can change over time. These data are known as data streams. Depending on the changes in the data distribution, different phenomena can occur, like concept drift [8]. In these situations, it is important to adapt the classification model to the current stream, otherwise its predictive performance can decrease over time.

Several algorithms proposed for data stream mining are based on online learning  [4, 7,8,9]. Some of them are based on the kNN (k-Nearest Neighbor) algorithm. In the data stream mining, the kNN algorithm maintains a sliding window with a certain amount of labeled data, which is used as its training data.

Other algorithms from the literature deal with concept drift by explicitly detecting changes in parts of the stream, comparing the current concept with previous concepts from time to time [9]. Some of them continuously calculate the model classification error. For such, they assume that the label of the data arriving in the stream is available.

However, there is a cost associated with the data labeling process that can become prohibitive or unfeasible when data arrive in high speed or volume. In online classification, the labeling of incoming instances can have a high cost  [11]. The lack of the label make the measure of classification error a difficult task.

Despite its simplicity, kNN has been largely used in literature, because it is nonparametric, favoring its use in scenarios with few available information and with known concepts changing over time [4]. However, the use of a sliding window may ignore instances with relevant information about persistent concepts. Furthermore, the size of a sliding window affects its efficient use.

This article proposes SWC (Sliding Window Clusters), a method based on kNN that implements a sliding window whose number of instances stored can be reduced. SWC summarizes data streams by creating a set of clusters, each one representing similar labeled instances. Instances close to decision border of each cluster are also stored, so they can be used to adapt the model to concept drift.

An experimental evaluation shows that SWC can increase the predictive performance and reduce both computational and time consumption than related methods based on kNN and sliding window.

This paper is structured as follows. Section 2, presents previous related works using kNN and sliding window. Data stream and concept drift are introduced in Sect. 3. The proposed method is described in Sect. 4. Sections 5 presents the experimental setup and analyses the results obtained. Finally, Sect. 6 has main conclusions and points out future work directions.

2 Related Work

This section briefly presents previous works using kNN for data stream classification with concept drift. These works use variations of sliding window to store the training instances.

One alternative of online learning is to randomly select instances to maintain or discard in the sliding window. This is the case of the method PAW (Probabilistic Approximate Window) method [4], a probabilistic measure used to decide which instance will be discarded from the sliding window when a new instance arrive. Thus, the size of the window is variant and represents a mix of outdated and recent relevant instances. The kNN\(_{W}\) method combines the PAW method couplet with the kNN classifier.

Another related method, ADWIN (ADaptive sliding WINdowing) [2], is a concept drift tracker able to monitor changes in data streams. The algorithm automatically grows the sliding window when no change is detected in the stream. When a change is detected, the algorithm shrinks the sliding window and forgets the sub-window that is outdated. In the combination of kNN with PAW and ADWIN [4], kNN\(_{WA}\), ADWIN is used to keep only the data related to the most recent concept from the stream, the rest of instances are discarded.

A deficiency of updating instances using a sliding window is the possibility to forget old but relevant information. To avoid losing relevant information, another method, named SAM (Self Adjusting Memory) [9], to adapt a model to concept drift by explicitly separating current and past information. SAM uses two memory structures to store information, one based on short-term-memory and the other on long-term-memory. The short-term-memory contains data associated with the current concept and the long-term-memory maintains knowledge (old models) from past concepts. SAM is couplet with a kNN classifier.

The implementations and variations of kNN for data stream mining are available in the MOA framework. Due to memory and computational limitations, the implementations use a fixed size window of 1000 labeled instances.

3 Problem Formalization

A possible unbounded amount of data can sequentially arrive in a data stream. These data can often undergo chances in their distribution over time, which may require the adaptation of the current model context [8].

Formally, a data stream is a sequence of instances, potentially infinity that can be represented by [7]:

$$ D_{tr} = \left\{ (X_{1}, y_{1}), (X_{2}, y_{2}), ..., (X_{tr}, y_{tr})\right\} $$

where \(X_{tr}\) is an instance arriving in time tr and \(y_{tr}\) is the target class. Each instance needs to be processed only once due to finite resources.

Concept drift is a change in the distribution probability of target classes [8]. Formally, a distribution P in a given time tr conditioned by the instance X and label y can suffer changes affecting the conditional probability \(P_{tr+1}(X,y)\). As a result, a model built during time tr could be outdated in time \(tr+1\).

$$ X : P_{tr}(X,y) \ne P_{tr+1}(X,y) $$

4 Methodology

In data stream mining, an ideal classifier should be able to learn the current concept in feasible time without forgetting relevant past information [4].

The proposed method is described in Algorithm 1. Instead of storing all instances that fit in a sliding window (for representing both old and current concepts), SWC stores compressed information about concepts and instances close to uncertainty border of each class. As the previous methods, SWC is combined with the kNN classifier in the MOA framework [3].

figure a
Fig. 1.
figure 1

Instances \(X_{1}\) and \(X_{2}\) are stored within the sliding window. The first instance, \(X_{1}\), is closer to cluster \(C_{1}\) and inside its radius. The second instance, \(X_{2}\), is outside cluster area, but close to the uncertainty border.

A more detailed description of how SWC works is presented next. Initially, all instances arriving from the stream are stored in the form of clusters. The clusters are created using the CluStream algorithm [1]. A constrain implying that each cluster must contain only instances from the same class was included.

As data arrive in the stream, a parameter based on probability, \(\rho \), is used to decide if a new instance, \(X_{tr}\), will be incorporated to the model W. If \(X_{tr}\) is inside a radius from a existing cluster, the instance is incorporated to this cluster. However, if \(X_{tr}\) is outside, but is close to a uncertainty border, \(X_{tr}\) is incorporated to the model alone, outside the existing clusters. For such, the uncertainty border is defined as the area outside the radius of a cluster, but inside a given threshold.

As is illustrated in Fig. 1, if the instance, \(X_{1}\), is inside the radius of the closest cluster, then it will be incorporated to the existing cluster, however if the instance, \(X_{2}\), is closer to a uncertainty border, it is stored alone.

It must be observed that not all instances in the stream are included into the sliding window. For each instance arriving in the stream, SWC randomly decides if the instance will be learned or not. A similar procedure is used in [4], which uses a probability \(\rho = 0.5\). SWC uses a lower probability, consider that the learning process can be done with a lower probability of \(\rho = 0.2\), without significant predictive performance loss, but with a lower processing cost.

5 Experimental Evaluation

This section experimentally compares SWC with other methods implemented in the MOA framework that use kNN with sliding window, namely kNN, kNN\(_{W}\), kNN\(_{WA}\) and SAM. The experimental evaluated used was Interleaved Test-Train to incremental learning [4].

5.1 Datasets

Table 1 describes the datasets used in the experiments. Before the streaming, in a offline phase, all methods started with a batch of labeled data representing 10% of the each dataset. The remaining data arrived in the stream. Real and artificial datasets were used.

Table 1. Characteristics of datasets evaluated.

Artificial Datasets

The SEA Concepts Dataset [10] has four concepts. A concept drift occurs at each 15.000 instances, with different thresholds for the concept.

The Rotating Hyperplane dataset is based on a hyperplane of d-dimensional space which is continuously changing in position and orientation. It is available in the MOA framework and was used in [4, 9].

Moving RBF is a dataset, generated by MOA framework, based on Gaussian distributions with random initial positions, weights and standard deviations. Over time, this Gaussian distributions suffer changes. This dataset is used by [4, 9].

Mixed Drift [4, 9] is a mix of tree datasets: Interchanging RBF, Moving Squares and Transient Chessboard. Data from each dataset are alternatively presented in the stream.

Real World Datasets

The Forest Cover Type [6] data set is a well known benchmark for the evaluation of algorithms for data stream mining, being constantly used to validate proposed methods [4, 9, 11].

The Airlines dataset has data from US flight control [11]. It has two classes, one indicating that a flight will be delayed, and the other that the flight will arrive on time.

5.2 Results and Discussion

The proposed method, SWC, is compared with the methods kNN, kNN\(_{W}\), kNN\(_{WA}\) and SAM. For all methods, one nearest neighbor (\(k = 1\)) is adopted. The remaining parameters use default values, including a fixed window size (\(w = 1000\)).

The \(\rho \) parameter, chance of updating the model, in the SWC method is defined for an acceptable trade-off between accuracy and time cost. A parameter of threshold \(T = 1.1\), uncertainty border, is also defined for each cluster. The threshold is multiplied by the radius of each cluster and indicates how much the cluster can expand. Both parameters were explained in Sect. 4.

Experiments were performed to decide the value of \(\rho \) and for SWC. Figure 2 shows that there is an increase of accuracy with \(\rho = 0.5\), meaning that a instance has 50% of chance to be learned by the model. However, the selected value was \( \rho = 0.2\), which results in a better balance between accuracy and time cost.

Fig. 2.
figure 2

SWC accuracy performance of all datasets varying \(\rho \) value in \(5\%\) to \(50\%\).

Table 2 shows the average accuracy and total time cost. It must be observed that accuracy is the measure of instances correctly classified over test/train interleaved evaluation [4].

The results shows that SWC is competitive with state-of-the-art SAM and is considerably faster. The method baseline kNN presented the worst performance, which was expected, once it does not learn over time. However, it is a good baseline to measure how much time each other method take to learn new instances.

Both methods kNN\(_{W}\) and kNN\(_{WA}\) present similar accuracy rates. However, kNN\(_{WA}\) has a higher cost, due the use of ADWIN.

Finally, although SAM and SWC obtained predictive accuracies similar to SWC, for some cases, SWC was better. Besides, SWC is faster, due to the use of only one sliding window with compressed concepts and relevant instances.

Table 2. Accuracy and time cost (in seconds) for each method.

To assess their statistical significance, a Friedman rank sum test combined with Nemenyi post-hoc test [5], both with a significance level of 5%, was applied to the experimental results. A \(p-value=0.000441\) was obtained in the Friedman test, showing a significant difference between the five methods. Additionally, the Nemenyi post-hoc test, Table 3, showed meaningful statistical differences between the following pair of methods: SWC\(\succ \)kNN. There is no significant difference between all remaining pairs. However we emphasize that SWC\(\succ \)kNN\(_{W}\) and SWC\(\succ \)kNN\(_{WA}\) have relatively low p-values (less than 10%).

Table 3. P-values obtained for the multiple comparison post-hoc Nemenyi test.

6 Conclusion and Future Work

This paper presented a new method, SWC, based on k-Nearest Neighbors that implements a sliding window that stores less training instances than related methods. SWC stores in a sliding window clusters and instances close to uncertainty border of each class. The clusters are compressed stable concepts and the instances are possible drifts of these concepts.

Considering accuracy performance, time and storage cost, SWC was experimentally compared with state-of-the-art related methods. According to the experimental results SWC presented higher predictive performance, with lower processing and memory cost than the compared methods.

As future work, the authors want to distinguish concept drift from novelty detection and study an efficient alternative to discard outdated information. Besides, they intend to include an unsupervised concept drift tracker.