Introduction

The traditional computing technology is focusing on quantitative and deterministic problems, but it is not good at solving the imprecise and uncertain problems in biological systems [1]. Different from the traditional computing, cognitive computing is a new data-centric computing model. It hopes to give machine the ability of self-learning to achieve more natural human–computer interaction. Among all kinds of data information, data stream is an important data type in our daily life. Data stream has its own characteristics compared with traditional data concept. Traditional data is static, stably stored in the database, and can be processed many times. Data stream, however, is rather different. In data stream, the data is consecutive, ever-changing in a flow way [2, 3]. Real-time, continuous, ordered sequences are common words used to describe the data stream. Besides, large amount, uncertain arrival rates are also obvious features of the data stream.

The key to cognitive computing is the data. Through the analysis and processing on large amounts of data, the machine will become more intelligent [4]. A representative of the cognitive computing system is “Watson” supercomputer developed by IBM, which beat humans at “Jeopardy” TV show and won the game in 2011. In many practical applications, the distribution of data varies over time. For example, in news, blogs, BBS and other online media, most topics that people discussed change dynamically. Even for the same topic, the content is not exactly the same as one year ago, such as “fashion” and “high tech.” This is known as concept drift in the Internet data analysis. Such kind of problem brings some profound changes and challenges for machine learning. Traditional clustering algorithms have been unable to meet the clustering requirements of dynamic data streams [5, 6]. On the one hand, in terms of fitting or predicting future data, we cannot use the learning machine which is trained by historical data to test future data directly like traditional learning problems, as the independent identical distribution hypothesis is not true; on the other hand, from the view of modeling, the probability of the sample set cannot be simply written as the product of each sample’s probability, for lacking of independent and identical distribution [7]. In order to cluster the data from data streams, we need to modify and improve traditional theory or method, even propose new clustering algorithms.

After years of research, the analysis of data stream clustering has made great progress. Many low-dimensional and small complex data stream problems have been deeply studied [8, 9]. In 2-D computer-assisted animation production, Yu Jun et al. present a semi-supervised patch alignment framework, which introduces pair-wise constraints to improve the performance of correspondence construction [10, 11]. Manifold learning based on graph is a promising method in extracting features from images. As the density of data points’ distribution may be different in different regions, Yu et al. [12] develop a novel sparse patch alignment strategy for the embedding of data lying in multiple manifolds. Table 1 shows the comparison of existing data stream clustering algorithms.

Table 1 Comparison of data stream clustering algorithms

However, there are still many difficulties to be resolved in irregular mixed high-dimensional data streams [20]. Such complex hybrid data streams bring new requirements for data stream clustering: First, as hybrid data streams contain a lot of noise data, how to filter out the noise and identify the correct data becomes more difficult [21]; second, the distribution of data streams is irregular, which requires more refined methods to describe the data structure [22]. In addition, the number of generated clusters is unknown in data streams. The cluster number will change along with the ever-changing data flows, which also increases the clustering complexity of uncertain data streams [23, 24].

Faced with these challenges, this paper mainly focuses on the analysis of irregular data streams and proposes an adaptive density data stream clustering algorithm—ADStream algorithm. ADStream combines the advantages of density clustering [25] and affinity propagation clustering [26]. In this algorithm, the initial cluster centers are determined by an improved affinity propagation method, avoiding the negative influence of random initialization. Besides, ADStream is adaptive to the density characteristics of data streams and can well deal with the irregular data such as noise or outliers.

The rest of this paper is organized as follows. Section 2 introduces the related concepts of ADStream algorithm. Section 3 presents the detailed framework of adaptive density data stream clustering algorithm. Section 4 provides experiments of data stream clustering on MOA platform and real-world machine learning data sets. Conclusions are drawn in Sect. 5.

Related Concepts of ADStream Algorithm

In many practical applications, the state of data streams is time-varying, so the cluster number is hard to predict at each time point. The knowledge contained in recent tuples is often more valuable than that in history tuples. Therefore, in this research, an attenuation window model is used to record different data streams as time goes by. Table 2 lists some important notations used in this paper.

Table 2 Important notations used in this paper

Definition 1

Data weight: In the sliding window model, data points in data stream change over time, and the fading function is \(f(t) = 1/{2^{\lambda t}} < \varepsilon\) (\(\lambda > 0\)). The larger the value of λ is, the lower the importance of historical data is. Assuming the arrival rate of data streams is v, namely the number of data points that captured by window per unit time, then the data weight of data stream can be expressed as

$$\omega = v \cdot \sum\limits_{t = 0}^{t = {t_c}} {1/{2^{\lambda t}}} = \frac{v}{{1 - {2^{ - \lambda t}}}}\,\left( {{t_c}\,{\text{represents}}\,{\text{the}}\,{\text{current}}\,{\text{time}}} \right)$$
(1)

Definition 2

Dimension radius: the jth dimension of data point \({x_i}\) is \({x_{ij}}\), the dimension radius upon the jth dimension is \({r_j}\), then the division on the jth dimension of point \({x_i}\) is \(\left( {{x_{ij}}{ - }{r_j} ,{x_{ij}} + {r_j}} \right)\). \(R = ({r_1},{r_2}, \cdots ,{r_d})\) represents the dimension radius vector composed of d dimension radiuses.

Definition 3

Density units and aggregation blocks: In high-dimensional data streams, define a unit length for each dimension and divide the d-dimensional space into density units \(Den\left( {\text{o}} \right)\). A density unit is a hyperspace started from a d-dimensional vector \({o_i}\) extending a unit length along the positive direction in all dimensions. The hyperspace is called aggregation block denoted by \(Cub\left( {{o_i},\vec r} \right)\), where \(\vec r\) is the unit vector on each dimension.

Affinity propagation (AP) algorithm is a novel clustering algorithm [27]. It can adaptively find and update the central exemplars of data streams by transmitting information between data points. And the cluster number will be detected automatically without user specification. AP algorithm has two important parameters: damping factor \(\lambda\) and preference parameter \(p\). Selecting an appropriate value for damping factor is important to the clustering quality. In this section, an improved damping factor for AP algorithm is applied to the online process of data stream clustering.

Definition 4

Shrinkage factor and similarity measure: in order to accelerate the convergence of AP algorithm, we introduce the shrinkage factor \(\rho\) to update the messages passed between data points during the clustering [28]:

$$\rho = \frac{2}{{\left| {2 - \phi - \sqrt {{\phi^2} - 4\phi } } \right|}},\,\phi > 4$$
(2)

There are two kinds of passing messages: “responsibility” and “availability.” Their updating formulas are as follows:

$${r^{(t)}}(i,j) = \left( {1 - \lambda } \right)\left( {S(i,j) - \hbox{max} \left\{ {{a^{(t - 1)}}(i,k) + S(i,k)} \right\}} \right) + \rho \times \lambda \times {r^{(t - 1)}}(i,j)$$
(3)
$${a^{(t)}}\left( {i, \, j} \right) = \left( {1 - \lambda } \right)\left\{ {\hbox{min} \left\{ {0, \, {r^{(t - 1)}}\left( {j, \, j} \right) + \sum {\hbox{max} \left\{ {0, \, {r^{(t - 1)}}\left( {k, \, j} \right)} \right\}} } \right\}} \right\} + \rho \times \lambda \times {r^{(t - 1)}}\left( {i, \, j} \right)$$
(4)

where \(S = {r^{\left( t \right)}}\left( {i ,j} \right) + {a^{\left( t \right)}}\left( {i ,j} \right)\) is the pair-wise similarity of data points. For data \(x = \left( {{x_1},{x_2}, \ldots ,{x_d}} \right)\) and data \(y = \left( {{y_1},{y_2}, \ldots ,{y_d}} \right)\), \(\omega \left( {{x_i} ,{y_{\text{i}}}} \right)\) is the attribute weight between x and y.

As data streams are always changing, the initial clusters produced by AP algorithm also require constant maintenance and updating. Here a density-based clustering algorithm is used to merge or delete these generated clusters to capture the uneven density distribution of data streams.

Definition 5

Reference point of density cluster: Use the n representative points of clusters calculated by AP algorithm for density-based clustering and generate m new clusters: \(\left( {{c_1},{c_2}, \ldots ,{c_m}} \right)\). Then data will be recorded by m two-tuples structure \(\left( {{c_i} ,{k_i}} \right)\), where \({k_i}\) is the number of data points attached to \({c_i}\) in cluster. \(\left( {{c_i} ,{k_i}} \right)\) is called the reference point of density cluster.

Definition 6

Density micro-cluster: Assume the data stream objects \({x_{i1}},{x_{i2}}, \ldots ,{x_{\text{in}}}\) arrive at time \({t_{i1}},{t_{i2}}, \ldots ,{t_{\text{in}}}\). This data is included in the density unit with \({o_i}\) as the starting point. At time \({t_i}\), the data structure of the density unit’s characteristic is denoted as: \(\left( {{o_i} ,H,\overline {C{F^1}} ,\overline {C{F^2}} ,S,{t_i}} \right)\), where \(\overline {C{F^1}} = \sum\limits_{j = 1}^i {{x_{ij}}{2^{{ - }\lambda \left( {{t_i}{ - }{t_{ij}}} \right)}}}\), \(\overline {C{F^2}} = \sum\limits_{j = 1}^i {{x^2}_{ij}{2^{{ - }\lambda \left( {{t_i}{ - }{t_{ij}}} \right)}}}\), \(S = {r^{\left( t \right)}}\left( {i ,j} \right) + {a^{\left( t \right)}}\left( {i ,j} \right)\). \(H\) is the frequency of corresponding attributes of data object. When data weight \(\omega\) > \(\xi\) and data similarity \(S\) > \(\varepsilon\), the density units become density grid micro-clusters; when \(0 \leqslant \omega \leqslant \xi\) and \(0 \leqslant S \leqslant \varepsilon\), the density unit is called candidate density micro-cluster.

To simplify the algorithm’s calculation complexity, the unit length of each dimension is determined according to the value of parameter \(\varepsilon\) in the literature [10]. \(\varepsilon\) is a threshold and \(\varepsilon \in \left[ {0,1} \right]\). We also introduce this method to our algorithm. If the data similarity in each dimension is not less than \(\varepsilon\), the data similarity on global attributes satisfies the same condition.

Adaptive Density Data Stream Clustering Algorithm

In order to deal with irregular complex data streams, we improve the density-based DenStream algorithm and propose an Adaptive Density data Stream clustering algorithm (ADStream). ADStream algorithm introduces two main concepts: “density micro-cluster” and “time frame” structure. It divides the data stream clustering process into the online part (micro-clustering) and off-line part (macro-clustering). The online part handles the newly arrived data in real time and stores these statistical results periodically; the off-line part uses these statistical results, combined with user-entered parameters, to approximately calculate the clustering results of a certain time in the past.

Online Clustering Stage

For the new arrival data objects of data streams, assign them to the corresponding grid according to their attributes. In each grid, update the similarity of data and weight of attribute in the density unit by AP algorithm. According to the similarity matrix and calculated weights, use density clustering algorithm to determine the density micro-clusters and candidate density micro-clusters. Because data stream is potentially unlimited, with the real-time updating of data, the method to classify the most recent data objects is essential. When the data point is arrived, we should determine whether to incorporate the data into the density micro-clusters or a density unit according to their attribute values. For the density micro-clusters that added new data, its eigenvector \(\left( {{o_i},H + h,\overline {C{F^1}} + x,\overline {C{F^2}} + {x_2},S + 1,{t_i}} \right)\) should be updated. For the density unit with new data, its eigenvector also needs to be updated. Meanwhile, the algorithm will determine whether it meets the conditions of candidate density micro-cluster or candidate density micro-cluster, then will make appropriate changes.

When new data continuously arrives from data stream, the constructed density units, density micro-clusters and candidate density micro-clusters will become more, and they will need greater storage space to be recorded. However, most of the data which is received long time ago is useless because the information they carry is descending with time index. We can delete or fuzzy record this data. There are two principles for the reduction in data stream: (1) direct degenerate as a candidate density micro-cluster. If the feature weights of density micro-cluster are smaller than the threshold, then it will directly degrade as a candidate micro-cluster; (2) modify the eigenvectors of density micro-clusters. If there is no data update in density units or density micro-cluster, modify its eigenvector as \(\left( {{o_i},H,\overline {C{F^1}} \times {2^{ - \lambda \Delta t}},\overline {C{F^2}} \times {2^{ - \lambda \Delta t}},S \times {2^{ - \lambda \Delta t}},{t_i}} \right)\) (\(\Delta t\) means the time interval of modifying eigenvectors. In order to reduce the consumption of resources, generally let \(\Delta t = 1/\lambda \log (\xi /\xi - 1)\)). If there are no data updates in density micro-cluster for a very long time, the eigenvector will not stop attenuation, and feature weights will be lower than the threshold value, and thus the density micro-cluster will be gradually attenuated as a candidate density micro-cluster.

To save memory space and store the recent arrival information, the algorithm gives the method of deleting density unit: For a large amount of density units, calculate the similarity S of each density unit and weights \(\omega\), record the density unit that meets the conditions of similarity and weight as density micro-cluster, record the density unit that meets the conditions of weight \(\omega\) as candidate density micro-cluster, and delete the density unit that does not meet the conditions. For the candidate density micro-clusters, if there is no new data coming into the candidate density micro-clusters for a period of time, then sort these candidate density micro-clusters according to the weight of eigenvectors and delete the candidate micro-clusters with weights w from small to large to meet the memory requirements.

Off-line Clustering Stage

At off-line clustering stage, we can get the connected density units by density algorithm. The interconnected density units can be merged into a clustering unit, and its combined attributes can be represented by geometric distance of these two density units [34-36]. After merging two density units, only if their attributes still meet the threshold conditions, they are regarded as connected units, and the combination between them can be founded. In the algorithm, the connectivity here has transitivity, namely: For the density unit \({d_{\text{i}}}\) and \({d_{\text{j}}}\), if \({d_{\text{j}}}\) and \({d_{\text{k}}}\) are connected, respectively, then \({d_{\text{i}}}\) and \({d_{\text{k}}}\) are connected.

Concrete Steps of ADStream Algorithm

Algorithm name: ADStream data stream clustering algorithm

Algorithm input: Data Flow D, Parameter \(\lambda\), \(\xi\), \(\varepsilon\)

Algorithm process:

(1).

procedure ADStream(\(D\), \(\lambda\), \(\xi\), \(\varepsilon\))

(2).

time ==0//Initial time

(3).

while(\(D \ne NULL\))

(4).

Read the initial data of a period of time \({D_ 1}\), to construct the initial data similarity matrix;

(5).

Use AP algorithm for the initial clustering of initial data and initialize a cache area;

(6).

Calculate the density and eigenvector of each cluster built at initial clustering;

(7).

Reading each dimension attributes of new data \(x = \left( {{x_1},{x_2}, \ldots ,{x_d}} \right)\) and use AP algorithm to calculate the similarity between data, put the data into each cluster of step (5); if they do not meet the conditions of incorporating into the existing clusters, then put them into a temporary buffer.

(8).

Modify and update the eigenvector of each cluster;

(9).

if (the number of density units reaches the limit)

(10).

  Calculate similarity \(S\) and weights \(\omega\) of all density units;

(11).

if (\(S > \varepsilon \& \& \omega > \xi\))

(12).

  The density unit is a density of micro-cluster;

(13).

else if(\(\omega > \xi\))

(14).

  The density unit is a candidate density micro-cluster;

(15).

else

(16).

  Remove density unit and reclaim memory;

(17).

end if

(18).

if (\(\left( {{t_i}\bmod \Delta t} \right) = = 0\))//Check the attenuation of density unit

(19).

  if (\(\omega > \xi\)),

(20).

    Remove density unit and reclaim memory;

(21).

  end if

(22).

end if

(23).

for (i = 0; i < number of density units; i ++)

(24).

    for (j = 0; j < number of density units; j ++)

(25).

      if (i and j are adjacent)

(26).

      Calculate the similarity \(S\) weights \(\omega\) of merged density units;

(27).

      if(\(S > \varepsilon \& \& \omega > \xi\))

(28).

        Merge density unit i and density unit j;

(29).

      end if

(30).

    end if

(31).

  end for

(32).

end for

(33).

Output the clustering results

(34).

end procedure

Convergence and Computational Complexity Analysis of ADStream Algorithm

Although in data streams new data arrives every time, the total number of data sub-blocks waiting to be clustered is finite. We only need to show that each clustering on each subset converges, and then it can be inferred that algorithm in the whole clustering process will converge [29]. In the online clustering stage of ADStream, AP algorithm is used to calculate the similarities of new arrival data objects and the weights of attributes. Then the similarity matrix and weights can help judge density micro-clusters and the candidate density micro-clusters. In each iteration t of AP algorithm, responsibility r (t) and availability a (t) will be weighted and updated with the last iteration of r (t−1) and a (t−1): \({r^{(t)}} = (1 - \lambda ){r^{(t)}} + \lambda {r^{(t - 1)}}\), \({a^{(t)}} = (1 - \lambda ){a^{(t)}} + \lambda {a^{(t - 1)}}\) (where \(\lambda \in \left[ {0,1} \right]\), the default value is 0.5), which reflects the effect of damping factor λ. Another role of λ is to improve the convergence: When AP algorithm has numerical oscillations (the number of produced clusters swings during the iterative process) and cannot converge in some circumstances, increasing λ can eliminate the oscillation [26].

When traditional AP algorithm falls in an oscillation, it needs to increase λ manually and rerun the program until the algorithm converges. In order to avoid oscillations, another approach is directly setting λ close to 1, but the update of responsibility r (t) and availability a (t) will become very slow, which increases the iteration number and running time of the algorithm. Maurice Clerc’s research shows that using constriction factor can effectively ensure the algorithm convergence [30]. Therefore, we introduce the constriction factor (shown as formula 2) into the updating formula of responsibility and availability. When the oscillation occurs, the damping factor can be adjusted automatically to help the algorithm get rid of the oscillation. Oscillation detection is critical to the adaptive damping technology, but it is very difficult to describe the characteristics of oscillation. So we define non-oscillation characteristics instead, in which it is easier to describe the number of generated cluster exemplars declined or unchanged during the iterative process (this is also the characteristic that algorithm will go toward convergence). In order to record the occurrences of non-oscillation characteristics in the iterative process, we design a movable monitoring window K b(j) (j = 1,2,…,t, t is the window width), which can record t iterations continuously. For example, in the ith iteration, if non-oscillation characteristic emerge, K b(i) = 1; otherwise, K b(i) = 0. The criterion to judge whether an oscillation occurs is designed as follows: If the number of values K b(i) = 1 in K b is smaller than two-third window width, then the oscillation emerges. This criterion is a kind of tolerance design, which considers a few occasional oscillations and the unstable stages at the beginning of the algorithm.

Next is the computational complexity analysis of ADStream algorithm. Firstly, some symbols should be described in advance:

  • D: feature space dimensions of data samples;

  • S: the size of data sub-block arrived each time;

  • C: the number of data clusters that the whole data stream sample set contains;

  • s: the number of data sub-blocks that ADStream algorithm needs to traverse.

For the new arrival data sub-block, ADStream algorithm will iteratively calculate the pair-wise similarity s ij , the cluster center c ik , and the feature weighted coefficient ω ik of S new arrival data samples in this sub-block. The computational complexity of this procedure is O((S + C)CD). Suppose the maximum iteration number of ADStream algorithm is M, the complexity of dividing single data sub-block into clusters is O((S + C)CDM). In fact, there are s data sub-blocks in data streams waiting for the proposed adaptive density data stream clustering algorithm to traverse, thus the final complexity of ADStream algorithm is O(s(S + C)CDM).

Experimental Analysis

To test the effectiveness of proposed adaptive density data stream clustering algorithm, first we analyze the algorithm using simulated data stream on MOA analog data stream clustering platform and compare it with the classic DenStream algorithm [16]. The programming and operating environment of the algorithm is JDK 1.6, and use Eclipse SDK 3.4.1, WEKA 3.7.7 (Waikato Environment for Knowledge Analysis) and MOA-20120301 (Massive Online Analysis) platform. The operating system is Windows XP; the configuration of computer used for experiment is 2.6 GHz Intel CPU and 2 GB RAM.

The parameters of ADStream algorithm are set as: the attenuation factor \(\lambda = 0. 0 0 1\), shrinkage factor \(\rho = 0. 5\), similarity threshold \(\varepsilon = 0. 5\), minPoint = 10, weight threshold \(\xi = 5\), initPoint = 1000, window size horizon = 1000; the parameters of DenStream algorithm are set as: \(\varepsilon = 0.01\), \(\mu = 1.1\), \(\beta = 0.001\), initPoint = 1000, horizon = 1000.

The visualization results generated by ADStream algorithms and DenStream algorithm on MOA simulated environment are shown in Fig. 1.

Fig. 1
figure 1

Clustering results of ADStream algorithm and DenStream algorithm. a1 Clustering results of ADStream algorithm, b1 Clustering results of DenStream algorithm, a2 Clustering results of ADStream algorithm, b2 Clustering results of DenStream algorithm

According to Fig. 1, the clustering quality of ADStream algorithm is obviously superior to DenStream algorithms. DenStream algorithm is not sensitive to noise and often excessively deletes micro-clusters, which results in poor clustering accuracy. From the clustering results shown in Fig. 1, we can see that ADStream algorithm can detect clusters over a broad region. With the help of AP algorithm, ADStream algorithm will adaptively find the right cluster centers, which can guide the density method to group the neighbor points together and generate appropriate clusters.

The comparison of clustering purity of ADStream algorithm and DenStream algorithm is shown in Fig. 2. The blue curve represents ADStream algorithm, and the red curve represents DenStream algorithm. The simulation results indicate that ADStream algorithm is much more stable than DenStream algorithm with the continuous arrival of data streams, and the clustering purity of ADStream is relatively high, which means ADStream algorithm is not susceptible to outliers and has strong robustness for the ever-changing data stream structures.

Fig. 2
figure 2

Clustering purity of ADStream algorithm and DenStream algorithm

In addition to using data stream simulation experiment platform, we also compare the clustering accuracy of ADStream algorithm with DenStream algorithm [16] and P-Stream algorithm [19] on KDD-CUP’98 and KDD-CUP’99 data sets of UCI machine learning database. KDD-CUP’98 is a relatively stable data set. It stores the information of charitable donation, totally having 95,412 records and 481 dimensions. Clustering on this data set can reflect the similarity of donation behaviors. Similar to the literature [14], 56 dimensional values are selected in the experiment and the input sequence of records is simulated as the arrival sequence of data streams. KDD-CUP’99 is a network intrusion detection data set with significant data evolutions. It consists of the original records of TCP connection in a local-area network. It is distributed irregularly and contains noise data. There are 23 different network intrusions or network attacks and 34 continuous attributes (without 7 discrete attributes) in the data set. In the experiment, 10 % data of KDD-CUP’99 data set is used for data stream clustering analysis, totally 49,032 test data. The parameters of ADStream algorithm are set as: similarity threshold \(\varepsilon = 0. 5\), the attenuation factor \(\lambda = 0. 0 0 1\), shrinkage factor \(\rho = 0. 5\), weight threshold \(\xi = 5\). The speed of reading data streams is set as 200 data per second. The average clustering accuracy of these algorithms on KDD-CUP’98 and KDD-CUP’99 data sets is shown in Fig. 3.

Fig. 3
figure 3

Clustering accuracy of algorithms on different data sets. a Clustering on KDD-CUP’98 data set, b Clustering on KDD-CUP’99 data set

Figure 3 shows that the clustering accuracy of ADStream algorithm is generally higher than that of DenStream algorithm and P-Stream algorithm. ADStream algorithm uses the micro-cluster analysis mechanism to determine whether the micro-cluster becomes the core micro-cluster or outlier cluster according to the threshold, and deletes the expired clusters. This maintains the potential micro-clusters in data streams and removes noise points at the same time. Then analyze potential micro-clusters and update core micro-clusters and outlier-clusters, which ensure the high quality of clustering results. But DenStream and P-Stream algorithms lack the mechanism of distinguishing potential micro-clusters. Therefore, they need to consume a large amount of memory to process noises, and inappropriate division affects the precision of clustering. Figure 4 is the average clustering time of these algorithms on KDD-CUP’98 and KDD-CUP’99 data sets.

Fig. 4
figure 4

Time cost of algorithms on different data sets. a Clustering on KDD-CUP’98 data set, b Clustering on KDD-CUP’99 data set

It can be seen from Fig. 4 that along with the increase in arrived data streams, the clustering time of algorithms is increasing as well. But for a certain amount of data streams, ADStream algorithm spends less time to generate appropriate clusters compared with DenStream algorithm and P-Stream algorithm. ADStream algorithm does not need to initialize the cluster number. With the help of the improved affinity propagation method, ADStream algorithm can dynamically adjust the cluster number and adaptively determine cluster centers according to the relationships among data points. Besides, ADStream algorithm introduces the sliding window mechanism and sets the attenuation factor so that data streams decay with time. The data in current window will have higher weights, and the decay rate of their weights will decrease; if the data have been out of the window, the decay rate will increase. Combined with the mutual transformation of density micro-cluster and candidate density micro-cluster, the time complexity of clustering procedure can be effectively reduced.

Conclusion

This paper reviews the development of current data stream clustering and proposes an adaptive density data stream clustering algorithm—ADStream. ADStream is composed of two stages: online micro-clustering and off-line macro-clustering. In the online part, the dynamic data streams are analyzed in a sliding window, and an improved affinity propagation clustering is applied to adaptively calculate the initial micro-clusters; in the off-line part, the clustering results in different time granularities are generated and updated by density grid clustering. The experiments show that ADStream algorithm has strong abilities of detecting clusters in complex hybrid data streams. ADStream algorithm performs quite well on both artificial and real-world data sets compared with DenStream and P-Stream algorithm.

Although the proposed ADStream clustering algorithm is effective, there still exist some problems which need further research, for example: the influence of various parameter settings on the algorithm should be investigated; how to improve the robustness of the algorithm and eliminate the negative impact of noise in complex data streams on the clustering; whether the algorithm works well or not in diverse reality environments also remains to be tested.