1 Introduction

In recent years, data stream has emerged as a new form of data and is attracting more and more attention. This kind of information has significant research value and practical meaning in all aspects of human lives, such as monitoring information for network media transmission, sensor transfer information in coal mines, access to website information, economic information produced by finance and securities companies, weather forecast information, and so on. Because of this form of data is massive and updated in real time, traditional clustering methods cannot be used to process it; so the need to discover new clustering methods is becoming more and more urgent.

Data stream mining is a real time process aimed at extracting interesting patterns from rapid and dynamic data. The biggest challenge is how to find valuable information in massive data streams during single scanning. As an effective tool for data stream mining, data stream clustering has aroused wide concern in recent years. Although the research of data stream clustering is still in its infancy, it has made great progress. Many valid data stream clustering algorithms have been proposed and the experimental results are quite satisfactory (Bifet et al. 2009; Sun et al. 2008). These algorithms can be applied in many fields, such as climate monitoring, agricultural, engineering control and so on (Talbot et al. 1999; Hui et al. 2008). The latest research achievements in this field need to be followed up and summarized. So this paper gives an overview of the various data stream clustering algorithms, and analyzes their pros and cons. At last we put forward some valuable research directions for data stream clustering, and point out the emphases and difficulties of the research.

This paper is organized as follows: the second part summarizes classical data stream clustering algorithms; the third part analyzes new data stream clustering algorithms; the fourth part describes some future works and finally makes a conclusion.

2 Classic data stream clustering algorithm

Compared with traditional data forms, data stream has its own characteristics. Traditional data is static and stable. It can be accessed at any time and processed more than once. Data stream is dynamic and it flows like a stream, which means that it is sequential and changes over time. “Real-time”, “continuous” and “ordered” are frequently used to describe data stream, meanwhile, “large amount of data”, “potentially unlimited”, “arrival rate uncertain” are also its obvious features. In view of these characteristics, traditional clustering algorithms no longer meet the requirements of data stream clustering, we have to improve traditional methods or even develop new algorithms. The concept of data stream clustering was first proposed in 2000, then researches in this field attracted broad attentions (Guha et al. 2003). Data stream clustering can be categorized in different ways. According to the characteristics of the data, it can be divided into the following categories (Ordonez 2003; Guha et al. 2003; Gaber et al. 2005; Babcock et al. 2002): density-based clustering, probability-based clustering, correlation-based clustering and mixed attribute-based clustering. There are different methods to process data streams, such as sliding window method, dynamic grid method, multi-classifier method, and dissimilarity matrix method. All of these methods have the same research goal that they are committed to find an effective way to deal with massive high-dimensional data streams, or try to solve the clustering problems that data information is uncertain, dynamic and distributed in arbitrary shape.

In the early development of data stream clustering, the focus of the research was to construct a suitable data structure for data streams based on traditional methods. Birch algorithm, LocalSearch algorithm, Stream algorithm and CluStream algorithm are classic data stream clustering methods proposed at the early time.

Birch algorithm (Zhang et al. 1996) is a typical integrated hierarchical clustering algorithm. Its basic idea is: calculating the distances between the new data point and all known data points; then comparing these distances with a threshold to determine the category of the new data point. Birch algorithm uses clustering feature tree \(C\!F\) and clustering features to group the data points. Clustering feature vector is defined as \(C\!F=(n\text{,}LS,SS)\), where \(L\!S\) is linear sum, \(S\!S\) is square sum. The tree \(C\!F\) stores the characteristics of hierarchical clustering, and it has two parameters: branching factor B and threshold T. Tree \(C\!F\) can be constructed dynamically and does not require all data to be stored in advance, so it is very suitable for the clustering of data streams.

The complexity of the Birch algorithm is \(O(n)\), and it can get good clustering results through only one traversal. But Birch algorithm does not work effectively for the data with arbitrary shape. When the sample data is non-convex, the performance of this algorithm is often unsatisfactory due to the poor clustering accuracy. In order to improve the deficiencies of Birch algorithm, researchers have done a lot of work, and M-Birch algorithm was proposed (Ling et al. 2007). M-Birch algorithm is an extension of Birch algorithm. M-Birch algorithm can dynamically adjust the threshold value, so that the clustering results can be optimized. Experiments show that M-Birch algorithm performs better than Birch algorithm. But it is still not able to essentially improve the quality of clustering. In 2003, Barbará proposed that data stream clustering algorithm must meet three requirements: compact expression, fast processing, dealing with outliers quickly and accurately (Barbará 2003). Most improved algorithms cannot satisfy these three requirements very well.

As early as around 2000, Guha et al. (2003) proposed LocalSearch algorithm. It uses divide and conquer strategy to get the clustering results of data streams. This algorithm can only describe data streams, but it cannot reflect the changes of data streams, still less can it forecast and analyze the tendency. So in 2002, the new algorithm—Stream was proposed by O’Callaghan et al. (2002). It is a continuation and development of LocalSearch algorithm. What’s more, O’Callaghan also proved that in most cases, the clustering performance of Stream algorithm is better than Birch algorithm.

In 2003, Guha further improved Stream algorithm (Guha et al. 2003), but he ignored the fact that data stream is time-varying and it presents different patterns in different time. So his study did not achieve a breakthrough. Then (Aggarwal et al. 2003) proposed CluStream algorithm to solve the problem of data stream clustering. This algorithm is a framework, in which the clustering process is divided into two stages: online micro-clustering and offline macro-clustering. In this way, CluStream algorithm can cluster not only the recent data, but also the data of the specified time period. The first stage is online micro-clustering. Then store the micro-clusters in a structure called pyramid time frame. In this step, the idea of micro-clustering is similar to Birch algorithm that they both use eigenvalues to mark sub-clusters, so they will share the same shortcomings. The second stage is offline macro-clustering, in which the results of micro-clustering will be further analyzed according to the user’s specific requirements. As the second step is based on the first step, the first step clustering results are vital to the entire data stream clustering.

CluStream algorithm can only use discrete attribute values. As for continuous attribute values, CluStream algorithm will lose some data information. So (Aggarwal et al. 2004) proposed HPstream algorithm, which can process high-dimensional data streams with continuous attribute values. HPsteam algorithm introduces the idea of subspace clustering, and creates a recession structure to classify data streams. When dealing with high dimensional data streams, the performance of HPstream algorithm is better than CluStream algorithm. HPstream algorithm abandons the snapshot storage technique used in CluStream framework, and uses projection method to reduce data dimensions. But it requires inputting the average clustering dimension first, and the value of this parameter is difficult to determine in practice. Besides it does not solve the clustering problem of non-convex data or the data with arbitrary shape. CD-Stream algorithm (Sun et al. 2004) proposed by Sun Huanliang is a new data stream clustering method based on space division. This algorithm is better than others on the data streams of non-convex shape. But it uses the whole space for clustering, so it cannot handle high-dimensional data streams very well.

There are many improved CluStream algorithms, and the typical one is A-Clustream (Zhu et al. 2006; Wu et al. 2009), which is an arbitrary shape clustering algorithm based on data stream proposed by Zhu Weiheng in 2006. This algorithm improves the time efficiency and clustering accuracy of CluStream algorithm. In 2007, Ni Weiwei improved existing algorithms and proposed CLUSMD algorithm (Ni et al. 2007; Yang et al. 2010). It uses k-means algorithm to partition data streams; then save the representative points of each partition, and give up the other data points to accommodate the capacity limit of the memory; finally classify these representative points according to their density. CLUSMD algorithm solves the problem that data streams are not evenly distributed in high-dimensional space. Yang and Zhou (2007) proposed mixed-attribute data stream clustering algorithm—HClustream. This improved algorithm has a good performance on standard data sets.

In 2004, Motoyoshi et al. (2004) presented a data stream clustering algorithm based on regression analysis. The advantage of this algorithm is that it doesn’t depend on the initial values and merging clusters. However, this algorithm is only suit for the data streams in which partial regression exist. In 2005, Song proposed an online data stream clustering algorithm based on probability density (Song and Wang 2005). This algorithm builds a new date structure frame according to Gaussian mixture model theorem. The above algorithms can only solve certain problems, and they are only valid in specific areas. Therefore, these algorithms failed to be applied or concerned widely.

3 New data stream clustering algorithms

With the development of data stream clustering algorithms, the research focus on the following aspects: the first one is the clustering for irregular distributed data, which is also the most important direction of research in the field, and until now it hasn’t achieved a breakthrough yet; the second is how to detect outliers and exclude the interference of noise data. Researchers have done a lot of work in these two areas, and proposed many efficient algorithms.

In 2005, Aggarwal tried to use sub-space projection technique (Aggarwal et al. 2005) to solve the problem of ”dimensions disaster”. After that, he proposed UMicro algorithm for clustering uncertain data streams (Aggarwal and Yu 2008a; Aggarwal et al. 2004). UMicro algorithm extends CF structures into ECF structure, which can describe the uncertain part in data streams. In China, Tang Changjie proposed probability intervention strategies (Wang et al. 2011, 2009) for uncertain data streams, and suggested that hot partitions should be separated and Kolmogorov complexity can be used in text data stream mining (Wang et al. 2011).

Based on these researches, in 2010, Zhang et al. (2010) put forward EMicro algorithm. This algorithm not only takes into account the distance between tuples and the attribute-level uncertainty of data tuples, but also emphasizes tuples’ own characteristics - the existence-level uncertainty. As ECF cannot handle the problem of existence-level uncertain data streams, EMicro algorithm uses UCF (uncertain clustering feature) data structure (Chang et al. 2007; Cao et al. 2007) to deal with the probability of data flow. In addition, this algorithm also develops a novel method to detect outliers, and introduces new evaluation standards for clustering quality.

EMicro algorithm uses buffer mechanism to solve the problem of outliers, and its specific strategy is: keep two buffers in memory to store clusters and outliers respectively. The cluster selection method uses the principle of gravity, and provides the Find Optimal Cluster algorithm. EMicro algorithm takes into account both the distance factor and probability factor during the process of assigning a new tuple to an existing micro-clusters, and comes up with a new way to solve the problems of outliers and data storage structure. Compared with the previous uncertain data stream clustering algorithms, EMicro algorithm has a lot of improvements and performs well for data streams.

The data in data streams has uncertainty. If the data appears with a certain probability, this data flow can be defined as probabilistic data stream (Cormode and Garofalakis 2007; Jayram et al. 2008, 2007). In 2009, P-Stream, a new data stream clustering algorithm, was proposed by Dai et al. (2009). They come up with the concept of strong cluster, weak cluster and excessive cluster for the probability tuples in data streams. They also design a cluster selection strategy, which can effectively assign each arrived tuple to an appropriate cluster.

Given an uncertain tuple sequence \(S=\{ {<\!\text{ v}^{\text{1}}, \text{ p}^{\text{2}}\!>} .\text{..<v}^{k}\text{,} \text{ p}^{k}\text{>}\left. {\text{...}} \right\} \), where v is the value of tuple, p is the probability of tuple, and p\(^{i}\) is in the range \(\text{[}\theta \text{,1]}\). Suppose a cluster of data tuples \(C=\{ {\!<\text{ v}^{\text{1}}, \text{ p}^{\text{2}}\!>} .\text{..}\left. {\text{<v}^{n}\text{,} \text{ p}^{n}\text{>}} \right\} \), its existing probability \(E\!P\!C\) is defined as the average existing probability of all tuples. For the given parameter \(a\text{(1}\ge a\ge \text{0)}\), if cluster \(C\) satisfy EPC \(\ge \) a, then cluster \(C\) can be called a strong cluster, and a is the lower bound of the strong cluster. P-Stream algorithm not only gives the concept of strong cluster, weaker cluster, excessive cluster, but also designs the method to find candidate clusters, and uses Model Outdated Process to deal with outliers. Find Candidate Cluster algorithm not only considers \(D_i (D_i \)is the distance between \(\text{<v,} \text{ p>}\) and the center of candidate cluster \(C_i )\), but also takes the value of \(EP^{C_i }\) into account when candidate cluster is selected for \(\text{<v,} \text{ p>}\). CluStream algorithm only considers the distance factor in the selection of candidate clusters. In P-Stream algorithm, Find Candidate Cluster strategy first sorts \(C_i \)to form a cluster sequence \({C}^{\prime }=\{C_{j1} \text{,} \,C_{jk} \}\), and then uses heuristic rules to determine the candidate clusters. It can be demonstrated that the new strategy can find more strong clusters than the original method when SSQ (sum of squared distance) is limited.

Due to the various flaws of CluStream algorithm for processing outliers, P-Stream algorithm uses adaptive model outdated process. In the offline clustering stage, the probability of data stream is transferred into statistical information by micro-cluster snapshots. When a clustering request \((f\text{,}\,k)\) arrived, the stored time indexes will extract the closest micro-cluster snapshot to the moment t from disk. As merging strong clusters can still get a strong cluster, the clustering doesn’t need to consider the relationship of the existence probability between clusters. Compared with CluStream algorithm, P-Stream can find a reasonable time to store micro-cluster snapshots, and when handing outliers, P-Stream algorithm is more optimized. So for the clustering of arbitrary shaped data streams, P-Stream algorithm should have larger research and development space.

Chen and Shi (2010) proposed wavelet synopsis based clustering of parallel data streams. This algorithm mentions two concepts: hierarchical amnesic summary (HAS) and discrete wavelet transform (DWT) (Guha and Harb 2005; Karras and Mamoulis 2005). Discrete wavelet transform is a data compression method, and the oblivion characteristic (Bulut and Singh 2003; Palpanas et al. 2004) of hierarchical amnesic structure can dynamically maintain the hierarchical data node of data streams. Using these two tools can construct synopsis structure W-HAS, which is wavelet based HAS in the hierarchical data stream. Then dynamically calculate the approximate distance and clustering center, and apply W-HAS to the on-line clustering of parallel data streams.

The idea of the dynamically maintenance is: continuously receive the arrival of data at a given level; if the data nodes reach a certain number, then merge the oldest N data nodes and generate a higher level data node. Obviously, the level lower and the data stream sequence shorter. The approximation between the corresponding summary information and the original information is better. However, as the hierarchy deepens, the approximation degree of summary information and the original information will be lower and lower. This also reflects the amnesic characteristic of this algorithm for data streams. But this feature should under control, otherwise the inaccuracy will be large, and the clustering results will be unsatisfactory. So researchers define the relative reconstruction error E of data information, and make \(E<\varepsilon \) (\(\varepsilon \) is a parameter). W-HAS clustering algorithm consists of three steps: calculate the wavelet coefficient of clustering center; extract data nodes from original data streams, and calculating the wavelet coefficient of normalized data streams; dynamically update the wavelet coefficient of data streams and use k-means method to get the final clusters.

W-HAS structure and the additive data nodes reduce the space complexity of the algorithm. And with small space complexity, it can meet the environmental requirements of data stream clustering. What’s more, because W-HAS saves the entire synopsis structure of data streams, we can select proper data nodes to analyze the required data streams, and are not limited to fixed length sliding window, which is an innovation compared with the data stream clustering based on sliding window.

In addition to the above researches, in China, Zhu et al. (2011) proposed a double-window-based classification algorithm for concept drifting data streams in 2011; Zhang et al. (2011) researched the continuous dynamic skyline queries over data streams; Han et al. (2011) explored load shedding strategies on sliding window joins over data streams. In foreign countries, Kavitha and Punithavalli (2010), committed to the study of time series data stream and proposed feasible data stream clustering algorithms. All of them have achieved remarkable achievements.

4 The further work

As we know, data stream clustering is a difficult task in the real world, and recently has caused wide public concern. The proposed studies provide a way to investigate the existing algorithms and techniques for time series data stream clustering and help to find the directions for future enhancement. Future research can be directed to the following aspects:

  1. 1)

    High dimensionality will affect the efficiency and accuracy of the algorithm, so for high dimensional data streams, it is necessary to find an effective dimension reduction method.

  2. 2)

    Adaptive partitioning and using slipping technique need a large amount of computation. How to reduce the computational complexity of data stream clearing is worthy of further research.

  3. 3)

    In most applications the data information are usually not sufficient for statistical analysis, so developing new methods to deal with uncertain data streams is also an important research aspect.

  4. 4)

    Time series data stream is very common in the real world, but its research has just begun. We should design an effective approach to predict the next value in time series.

  5. 5)

    Subspace clustering can effectively deal with massive high-dimensional data, so introducing subspace method into data stream clustering is also a good idea.

  6. 6)

    Combine data stream clustering with specific applications and explore the appropriate algorithm to solve practical problems. Test and improve data stream clustering algorithm in the process of practice.

5 Conclusion and prospect

Data stream clustering is the product of technological progress, and along with the development of information science. After ten years of research, it has made brilliant achievements. This paper provides an overview of various data stream clustering algorithms that proposed in recent years, and introduces the theoretical framework of these algorithms as well as their applications in our daily lives.

However, as a novel clustering method, data stream clustering is still immature and imperfect. Data information should be extracted more accurately and in real time, but it is hard for data stream clustering algorithm to meet this requirement. In reality, most data streams are of arbitrary shape and spread unevenly. Uncertain data and noise points in data streams will also interfere with the clustering process. These aspects are still the focal points and difficulties of future research for data stream clustering. In this paper, the uniqueness and drawbacks of past studies and some possible topics for further study are also discussed. In the future, we will research on data stream deeply, and try to develop an effective algorithm to solve the current problems of data stream clustering.