Keywords

1 Introduction

The classification of data steams with concept drift is one of the main challenges of data mining [1,2,3]. In various real-applications, including network intrusion detection, spam filtering and credit card fraud detection [4] etc., due to labeling cost and time consuming, it is unrealistic that all instances are labeled by expert. Therefore, a semi-supervised classification algorithm which can handle concept drifts plays a critical role in addressing the issue of data stream mining.

To overcome these challenges, many researches have been reported in recent years. However, there are still some limitations. Firstly, many existing methods [5, 6] commonly take clustering methods to label the unlabeled instances which should assign the number of clusters in advance and keep it constant during the processing of data streams. Secondly, many methods ignore the impact of high density clusters on concept drift detection [5, 7].

Specifically in SUN [5], firstly, K-modes is used to form clusters and label the unlabeled data. However, using K-modes requires setting the number of clusters in advance and keeping it unchanged during the processing of data streams, which is unreasonable in many real-applications. Since the dynamic feature of data streams, new concept clusters may appear while old may disappear. Secondly, the average deviation between the history and new concept clusters is adopted to detect concept drifts, which ignores changes in high-density clusters which are more likely result in concept drifts.

In light of these limitations, an approach of semi-supervised classification of data streams based on adaptive density peak clustering (SSCADP) is proposed. The framework of SSCADP is the same as SUN, but there are two main differences.

Firstly, to generate concept clusters at the leaves in a Hoeffding tree, a density peak clustering method [8] and a change detection technique [9] are combined to adaptively locate the cluster centers, instead of using K-modes. Secondly, we consider that the change of clusters with higher density is more likely to reflect the change of data distribution. And hence an improved detection method based on SUN is proposed that an adaptive weighted average strategy is adopted to assign higher weight value to the higher density clusters.

The rest of this paper is organized as follows. Section 2 presents some related work. Section 3 describes the proposed algorithm in detail including adaptively locating the cluster centers, labeling the unlabeled samples, and concept drift detection method. Experiments and results are shown in Sect. 4. Finally, Sect. 5 summarizes this paper and future work.

2 Related Work

The proposed approaches for semi-supervised classification of data stream with concept drifts can be broadly divided into decision tree-based and non-decision tree-based methods. The decision tree-based methods like SUN [5] and REDLLA [7] adopt a Hoeffding tree as base classifier. During the construction of the base classifier, unlabeled data are labeled by clustering in leaves, and then added into the detection of concept drifts and the updating of the base classifier. Concept drifts are detected based on the deviation between history concept clusters and the new ones. Others like Sco-forest [10] extends Co-forest algorithm to handle evolving data streams. The concomitant ensemble is used to select samples with high classification confidence and label these samples to update the corresponding base classifier. If a concept drift is detected by Adwin2 [11], the base classifier with the worst accuracy will be discarded.

The non-decision tree-based methods usually take cluster model for data streams classification. Reasc [12] maintains an ensemble of cluster-based classifiers. When updating the ensemble, the cluster-based classifier with the worst accuracy will be removed out. SPASC [6] maintains a classifier pool which is composed of weighted clusters-based model. The weight value will be adjusted adaptively according to the correctness of classification. SCBELS [13] maintains an ensemble of cluster-based classifiers which are constructed by BIRCH [14] and incrementally updated during the classification of data stream. Local structural information of data is taken into account to deal with concept drifts. [15] proposed to dynamically maintain a set of micro-clusters, each instance is used to update the model, outdated or micro-clusters with low reliability are removed to adapt to the evolving concepts of data streams.

Other models like ECU [16], it constructs an ensemble model which combines both classifiers and clusters for classification. [17] proposed a neural network framework for streaming data classification that each layer consists of a generative network, a discriminant structure and a bridge.

3 Proposed Algorithm

3.1 The Framework of the Proposed Algorithm

In this paper, a data stream is represented as \({{\textit{\textbf{D}}}}=\{{{\textit{\textbf{D}}}}^{0},{{\textit{\textbf{D}}}}^{1},{{\textit{\textbf{D}}}}^{2}\ldots ,{{\textit{\textbf{D}}}}^{t},\ldots \}\), in which \({{\textit{\textbf{D}}}}^{t}=\{{{\textit{\textbf{x}}}}_{1}^{t},{{\textit{\textbf{x}}}}_{2}^{t},\ldots ,{{\textit{\textbf{x}}}}_{m}^{t}\}\) indicates the data batch collected at the time t. SSCADPFootnote 1 is described in the algorithm 1. It employs a Hoeffding tree as its base classifier. After \({{\textit{\textbf{D}}}}^{t}\) is classified by the Hoeffding tree, each instance in it is sorted into a leaf, the corresponding statistics of the leaf are updated. If the number of instances arrived at the tree meets dp, all the labeled instances in a leaf l are utilized to label the unlabeled instances in it, and then concept drift detection is installed to detect drift at the leaf. Then, a pruning strategy is adopted on the Hoeffding tree, and if the number of instances in a new leaf meet \(n_{min}\), the leaf attempts to split. After splitting, the updated Hoeffding tree is ready for new concept.

figure a

3.2 Adaptively Locate Cluster Centers and Label Unlabeled Instances

If a detection period is reached, a clustering method named Clustering by fast search and find of density peaks (CFSDP) [8] and a change detection technique [9] are combined to adaptively locate the cluster centers. After concept clusters at each leaf are created, graph-based label propagation [18] is installed to label the unlabeled data in each cluster. If a cluster without any labeled instance, all unlabeled instances in it will be assigned the majority label of the closest cluster.

The basic idea of CFSDP is that the clustering centers should be in the region with high data density and far away from each other. CFSDP is based on two quantities: (1) \(\rho _{i}\), the local density of the i-th instance; (2) \(\delta _{i}\), the minimum distance between the i-th instance and the instances which have higher density than the i-th instance. \(\rho _{i}\) and \(\delta _{i}\) are defined as

$$\begin{aligned} {{\rho }_{i}}=\sum \limits _{j\in {{{I}}_{s}}/\{i\}}{\exp \{-{{({{d}_{ij}}/{{d}_{c}})}^{2}}\}}, \ {\delta _i} = \mathop {\min }\limits _{j:{\rho _j} > {\rho _i}} ({d_{ij}}). \end{aligned}$$
(1)

where \(d_{ij}\) refers to the Euclidean distance between i-th and j-th instance. \(d_{c}\) represents the cutoff distance, which is set the same as CFSDP according to experience. \(I_{s}\) is the set of all instances indexes. Hence, a cluster center has such characteristic that \(\rho _{i}\) and \(\delta _{i} \) are as large as possible. The \( \rho _{i}-\delta _{i}\) plot (decision graph) can provide a visual way to determine the number of clusters.

In the function 1, \({{\gamma }_{i}}={{\rho }_{i}}{{\delta }_{i}}\) is computed for each instance and then all \({\gamma }_{i}\) are sorted in ascending order. CFSDP assumed that the sorted \({\gamma }_{i}\)(\({\gamma }\)) follows the power-law distribution, and hence jump point of the sorted \({\gamma }_{i}\) can be found.

figure b

After the cluster centers are determined, the clustering is installed, and then label propagation is conducted.

In order to find the jump point of the sorted \({\gamma }_{i}\), we refer to the idea of change detection in SAND [9], in which a change detection method is proposed to detect the location of the most significant changes in a series of values along a direction. Then, we further assume that the sorted \(\gamma _{i}\) follows a Pareto distribution. After the jump point is found, the number of clusters can be determined. As shown in the Fig. 1, the point in the red circle are jump point, the points above it are the center points.

The probability density function of Pareto distribution can be expressed as \(f(x,a,k)=a{{k}^{a}}{{x}^{-(a+1)}}\) where a is the shape parameter and k is the proportion parameter. The corresponding logarithmic probability density function can be expressed as \(\log f(x,a,k)=a\log (k)+\log (a)-(a+1)\log (x)\). Define a random variable \(\gamma \sim \mathrm{Pareto}(a,k)\), and \(\{\gamma _{i}\}\) is the observed value of \(\gamma \). The maximum likelihood estimation of the parameters are calculated as follow, where N is the total number of observed values.

$$\begin{aligned} {\hat{k}_{MLE}} = \mathop {\min }\limits _{1 \le i \le N} \{ {\gamma _i}\}, \ {\hat{a}_{MLE}} = N/\sum \nolimits _{i = 1}^N {(\ln {\gamma _i} - \ln {{\hat{k}}_{MLE}})} \end{aligned}$$
(2)
Fig. 1.
figure 1

Jump point detection.

To detect the jump point, in the function 2, \(\gamma \) is divided into two sub-windows by k from \(N /2\) to \(N -3\). The k value corresponding to the maximum statistical difference between the two sub-windows is the index of the jump point.

figure c

3.3 Concept Drift Detection

A concept drift detection method is installed at each leaf. Before introducing the detail of concept drift detection, many variables should be defined. Respectively, \({r_{hist}}\) and \({r_{new}}\) denote the radius of the set of history concept clusters \(C_{hist}\) and the new ones \(C_{new}\), \(n_{hist}\) and \(n_{new}\) represents the number of clusters in \(C_{hist}\) and \(C_{new}\). \(r_k\) denotes the radius of a cluster and is computed as the average Euclidean distance from all instances in the cluster to its center: \({r_k} = \sum \nolimits _{i = 1}^{\left| {{C_k}} \right| } {\sqrt{\sum \nolimits _{j = 1}^{D} {({c_{kj}} - {x_{ij}}} {)^2}} } /\left| {{C_k}} \right| \), where \( x{}_i\mathrm{{ = \{ }}{x_{i1}}\mathrm{{, }}{x_{i2}}\mathrm{{, }}\ldots \mathrm{{ , }}{x_{iD}}\mathrm{{\} }} \in {C_k}\) is the i-th instance in cluster \(C_{k}\). D represents the attribute dimension. \(c_{k}\) refers to the cluster center of \(C_{k}\) and \(|C_{k}|\) is the total number of instance in \(C_{k}\). \({r_{hist}}\mathrm{{ = }}\sum \nolimits _{i = 1}^{n_{hist}} {{r_i}} / n_{hist}\). Similarly, \(r_{new}\) is calculated by this way. dist is used to measure the average distance between these two concept cluster sets. \(dist = (\sum \nolimits _{i = 1}^{{n_{new}}} {\min [\sqrt{\sum \nolimits _{j = 1}^D {{{({c_{ij}} - {c_{kj}})}^2}} } ,{c_k} \in {C_{hist}},1 \le k \le {n_{hist}}])/{n_{new}}}\), where \({c_k}\) and \({c_i}\) denote cluster centers in \({C_{hist}}\) and \({C_{new}}\), respectively. In SUN, the value of dist greater than \(\mathrm{max}(r_{hist},r_{new})\) means concept drift.

However, in our algorithm, it is assumed that the change of points with higher density is more likely to reflect the change of data distribution. Therefore, dist is redefined as \(dist = \sum \nolimits _{i = 1}^{{n_{new}}} {{w_i}dis{t_i}}\) to capture the distribution change of data more accurately. \(dist_i\) and \(w_i\) are calculated by (3) and (4) respectively. \(dist_i\) refers to the distance of the cluster center \(c_{i}\) to \(C_{hist}\). \({\rho _{{c_i}}}\) refers to the average density of the clusters \(C_{i}\) and \(C_{k}\) which is the closest cluster to \(C_{i}\) in \(C_{hist}\) , and \({\rho _{{c_{ij}}}}\) refers to the density of j-th instance belonging to \(C_{i}\), \(n_{i}\) and \(n_{k}\) mean the total number of data in \(C_{i}\) and \(C_{k}\) respectively. And hence larger \(w_i\) and \(dist_i\) mean concept drift. Like SUN, if the value of dist is more than \(\mathrm{max}(r_{hist},r_{new})\), a real concept drift is considered. The process of drift detection is described in the function 3.

(3)
(4)
figure d

dist can be utilized to detect concept drifts caused by the change of \(\mathrm {P}(x)\). In addition, considering concept drift can be caused by the distribution change of class labels, the class labels of history concept clusters and new ones are compared when concept drift is not detected. If the class labels of \(C_{hist}\) and \(C_{new}\) are completely opposite, it is also defined as a real concept drift.

After the bottom-up search is implemented to find all drift leaves, a pruning method is installed for adjusting the tree to cope with concept drifts. Each level of the tree will be traversed once to check concept drift of each leaf from bottom to top until the root is reached. If all child nodes of a node are detected concept drift, these child nodes are pruned and the new leaf node maybe split again.

4 Experiments

In this paper, all synthetic datasets are generated by MOA [19]. xxx-abr, xxx-gra and xxx-inc represent the concept drift types of abrupt, gradual and incremental, respectively. In the dataset with gradual drifts, it takes 5000 instances to change from one concept to another. In addition, to verify the performance of SSCADP on the datasets where the number of clusters varies apparently, we generate a Gaussian dataset with clusters change dynamically and concept changes, which are shown in Fig. 2(a), (b) and (c) represent three different distributions which evolve along the time sequence, with the positive instances in red ‘+’ and the negative instances in green ‘\(\times \)’, each figure contains 600 instances.

Table 1 shows the properties of all datasets. For Sea, four concepts are generated by setting \(\theta \) = 8, 9, 7, and 9.5. For Sine, the definition of Sine is if \(a*sin(b*x_1+\theta )+c>x_2\), the label is 0, otherwise is 1, four concepts are generated by setting \(a=b=1\) and \(c=\theta =0\), \(a=b=1\) and \(c=\theta =0\) with class labels are changed oppositely, \(a=0.3\), \(b=3\pi \), \(c=0.5\), \(\theta =0\) and \(a=0.3\), \(b=3\pi \), \(c=0.5\), \(\theta =0\) with class labels are changed oppositely. For HyperPlane-abr and HyperPlane-gra, four concepts are generated by setting \(w^1= (0, 0.5, 0.5), w^2= (0, 1, 0), w^3= (1, 0, 0)\) and \( w^4= (0.5, 0, 0.5)\), while for HyperPlane-inc, \(d=10\). For Agrawal, function 1, 2, 5, 6 are selected as the concepts of 1, 2, 3, 4. The Weather and Electricity dataset are used in this paper. For each synthetic data, 10 copies are randomly generated while for each real dataset labeled instances are randomly selected 10 runs.

In this paper, \(n_{min}\) refers to the minimum number of instances when a leaf attempts to do split-test and it is set to 200. dp means the detection period and \(dp=200\) empirically. \(\alpha =0.95\) is the confidence used in the clustering algorithm.

Fig. 2.
figure 2

Changes in clusters.

4.1 Experimental Results

SSCADP is compared with three baseline methods including SUN, SPASC, and Reasc. Three groups of experiments are conducted to evaluate the accuracy, the impact of label ratio, and concept drift tracking. In the first and last groups of the experiments, in order to simulate the situation of limited labeled data in real applications, the label ratio of all datasets are set to 0.1.

Accuracy. The accumulative accuracy is utilized to evaluate the performance of baseline methods and SSCADP. Table 2 shows the detailed results. For all datasets, each result is obtained by averaging the results of 10 runs.

It can be observed that SSCADP performs better than other baseline algorithms on almost of all datasets. The Friedman test is conducted on the results in Table 2, and the average rank is shown in Table 2, SSCADP achieves the best. Test statistic \(F_F=10.654\), the critical value for \(\alpha =0.05\) is 2.892, hence we can reject the null-hypothesis that there is no difference among the performance of all algorithms. Furthermore, in Nemenyi test \(CD=1.35\) which means SSCADP performs significantly better than SUN and Reasc. There is no significant difference in the performance between SSCADP and SPASC.

Table 1. Properties of the datasets
Table 2. Accumulative accuracy (%) on all datasets.

Impact of Label Ratio. Considering the influence of labeling ratio on classification accuracy, we simulate the real scene to compare the algorithms by setting the label ratio to 0.05 and 0.2. Detailed results are shown in Table 3. In the case of 0.05, Friedman test is conducted and \(F_F=3.175\), critical value for \(\alpha =0.05\) is 2.892. This results indicate that the performance of all the algorithms is significantly different. Furthermore, in Nemenyi test \(CD=1.35\), and hence it can be concluded that SSCADP performs significantly better than SUN. These results indicate that even there are very limited labels available, SSCADP can achieve better performance, and it is suitable for semi-supervised classification of data stream.

In the case of 0.2, Friedman test is conducted and \(F_F=6.567\), critical value for \(\alpha =0.05\) is 2.892. This results indicate that the performance of all the algorithms is significantly different. Furthermore, in Nemenyi test \(CD=1.35\), and hence it can be concluded that SSCADP performs significantly better than SUN and Reasc. There is no significant difference in the performance between SSCADP and SPASC in the cases of 0.05 and 0.2.

Table 3. Impact of the label ratio on accumulative accuracy (%).

Concept Drift Tracking. The drift tracking performance of all algorithms on all datasets with abrupt drift type are shown in Fig. 3. The number at bottom represent the kind of concept, and the vertical line indicate the location of concept drift. In the dataset of Sea, when new concepts arrive, the accuracy of SSCADP declined less in most cases, especially in concept 3 and 4 which are quite different. In the dataset of HyperPlane, SSCADP performed well in most cases except concept 2. In the dataset of Agrawal, the accuracy of SSCADP also declined less in most cases. In the dataset of Sine, SSCADP performed not well since it is a single model and the large differences exist between each concept. SPASC adopts ensemble model as well as can deal with recurring drifts which can achieve better performance.

Fig. 3.
figure 3

Drift tracking graph.

5 Conclusions

In this paper, we propose a method of SSCADP to handle the semi-supervised classification of data streams. SSCADP can adaptively locate the cluster centers by considering both the local density and the distance between two points, and detect concept drifts by combining both the density of clusters and the distance between the historical clusters and the new ones. Experimental results illustrated that SSCADP can achieve better performance than the baseline algorithms in most datasets. In future, we will focus on how to effectively combine labeled data with unlabeled data to detect concept drift. We will also explore how to deal with recurring concept drift more effectively.