EvolveCluster: an evolutionary clustering algorithm for streaming data

Nordahl, Christian; Boeva, Veselka; Grahn, Håkan; Persson Netz, Marie

doi:10.1007/s12530-021-09408-y

EvolveCluster: an evolutionary clustering algorithm for streaming data

Original Paper
Open access
Published: 13 November 2021

Volume 13, pages 603–623, (2022)
Cite this article

Download PDF

You have full access to this open access article

Evolving Systems Aims and scope Submit manuscript

EvolveCluster: an evolutionary clustering algorithm for streaming data

Download PDF

Christian Nordahl ORCID: orcid.org/0000-0001-7199-8080¹,
Veselka Boeva¹,
Håkan Grahn¹ &
…
Marie Persson Netz¹

4372 Accesses
8 Citations
Explore all metrics

Abstract

Data has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

A Novel Multi-objective Differential Evolution Algorithm for Clustering Data Streams

CPOCEDS-concept preserving online clustering for evolving data streams

Article 28 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, data has become an integral part of our daily lives. Due to advances in hardware infrastructures, there are endless possibilities available to collect any type of data at a rapid pace. Examples of streaming data sources include weather sensors, mobile applications, Instagram posts, electricity consumption, shopping records, etc. (Bifet et al. 2010b).

These data streams are endless information sources that arrive in a timely fashion. Incoming data tends to be unlabeled, as it requires too much effort to label it by hand. The immense amount of data proves to be hard to manage with traditional supervised machine learning algorithms and caused the emergence of unsupervised learning techniques. Unsupervised learning is a branch of machine learning where algorithms learn by themselves, identifying the underlying structure of a dataset. Depending on the application, the results from the unsupervised learning algorithms can be used directly for analysis or as an intermediary step to gain an understanding of the data. One of the branches of unsupervised learning is the task of clustering analysis.

Clustering is the process of grouping data instances into groups based on their similarity to each other (Gama 2010). Intuitively, instances within a cluster are more similar to each other than to other instances belonging to another cluster (Jain et al. 1999; Zubaroglu and Atalay 2021). The objective of clustering algorithms is to detect these underlying characteristics of the instances that make each cluster unique. Traditional clustering algorithms, such as k-means (Lloyd 1982), require the entire dataset to be available. In data stream clustering, the data arrives incrementally in such a high quantity and pace that traditional clustering methods cannot cope with (Gama 2010).

In the evolving data stream scenario, we have a continuous data stream that contains changes over time. These changes cause traditional offline models to become obsolete over time as the new data no longer conforms to how the model has been trained. Incremental clustering algorithms, such as the one introduced by (Lughofer 2008), are one way to address the problem of evolving data streams. These algorithms process elements on a step-wise basis and injects them into the existing clustering solution, updating the clusters by merging or splitting if needed.

Many applications do, however, not necessarily need such rapid adaptations given by incremental clustering algorithms. Instead, by approaching the data stream segmentally, it is possible to view how the data is changing over time directly. For example, electricity providers can identify if and how the electricity consumption trend has altered in a single household, neighborhood, or an entire city, and determine if any remediation is required. Likewise, an online retailer can identify consumer shopping trends and how they change over time. When an overarching view of how the data aligns with previous structures and how it changes over time is desirable, there is a lesser need for direct updates to the clustering models.

In this study, we propose a novel evolutionary clustering algorithm capable of modeling data streams containing evolving data, entitled EvolveCluster, a continuation of our previous work (Boeva and Nordahl 2019). Instead of processing elements individually, we collect data over a defined period (creating segments) to trace how the data evolves. Two similar approaches have been identified to compare and evaluate the proposed algorithm, namely PivotBiCluster (Ailon et al. 2012) and Split-Merge Evolutionary Clustering (Boeva et al. 2019). Both these algorithms address the evolving data stream scenario by dividing the data into segments. In contrast to EvolveCluster, PivotBiCluster and Split-Merge Clustering map previous clustered data segments to fit with the newly arrived data segment. Both these algorithms combine the current data segment with the previous one by identifying similarities between the clusters from the two segments. However, the main drawback with both PivotBiCluster and Split-Merge Clustering is that they both require each data segment to be clustered in advance.

Our main contributions are as follows:

We introduce a new algorithm, entitled EvolveCluster, that is especially targeted at evolving data streams. The design of the algorithm makes it easy to understand how trends and patterns appear in the data segments (Sect. 4.2).
We provide a thorough analysis of the computational complexity of the proposed algorithm (Sect. 4.3).
We evaluate the performance of EvolveCluster, PivotBiCluster, and Split-Merge Clustering, and we identify and discuss their strengths and weaknesses (Sects. 6 and 7).

2 Background

In this section we provide the necessary background information. First a description of clustering analysis is provided, with a specific definition of k-medoids. We continue by introducing concept drift, dissimilarity measures, and conclude this section with an explanation of evaluation and validation measures.

2.1 Clustering algorithms

Clustering algorithms are designed to identify an underlying structure of data and use the detected relationships within the structure to group the data points into distinct groups. These algorithms usually decide upon themselves how to divide the data into subgroups, an unsupervised approach to increase knowledge about the data. There are numerous ways of approaching this task and we can group them into five major categories: density-based, grid-based, hierarchical, model-based, and partitioning algorithms (Berkhin 2006). This study focuses on partitioning algorithms due to the proposed evolutionary clustering algorithms characteristics (see Sect. 4).

Partitioning algorithms differ from the other algorithm types in their need to define the number of clusters in advance. The number of clusters, usually denoted as k, is a parameter given to the algorithms when they are initialized. But, identifying an appropriate k in advance is not easy. A common approach to identify a suitable k is having the algorithm execute multiple times with an increasing k value. More sophisticated methods exist, where the data is analyzed in advance with an initialization algorithm that estimates how many clusters are present in the dataset (Arthur and Vassilvitskii 2006). These initialization algorithms, however, do not promise to produce an optimal solution.

One of the most prolific examples of a partitioning algorithm is the k-means algorithm. k-means starts by assigning k initial cluster centroids, either randomly or by an initialization algorithm. All data points are distributed into each cluster based on their distance to the centroids. The solution is refined by first electing a new cluster centroid, based on the mean values of each data object in the cluster, and then redistributing the data points accordingly. k-means refines the solution until changes are no longer made or until a maximum limit of iterations has been reached.

k-medoids, or Partitioning Around Medoids (Vinod 1969), is similar to k-means and generally seen as a sister algorithm. k-medoids, however, use actual data points as cluster centroids instead of creating synthetic centroids. This approach makes the algorithm more robust than k-means, being less susceptible to noise and outliers.

2.2 Concept drift

One of the phenomenons present in data streams, especially in evolving data streams, is how the data changes and evolves. This non-stationarity of data over time is referred to as concept drift (Khamassi et al. 2018). Depending on the data and how it changes over time, different types of concept drifts may exist in the streams. Wadewale and Desai (2015) divided concept drift in six categories: sudden, incremental, gradual, recurring, blip, and noise. Sudden concept drifts are abrupt changes to the data, while incremental and gradual drifts happen more slowly. A recurring drift is a sudden, gradual, or incremental drift that happens periodically. Blip and noise are defined as outliers and random instances that should be filtered out, respectively.

Concept drift is a crucial aspect of learning from evolving data streams. As the data evolves, the algorithm needs to be capable of adapting and continuously learn about the underlying data structure to model it correctly. In addition to our comparative experiments (Sects. 6.1–6.3), we perform an additional set of experiments to focus solely on EvolveCluster’s ability to model data streams where concept drift is present (see Sect. 6.4).

2.3 Dissimilarity measures

Calculating the distance, or dissimilarity, between two objects is a requirement to enable the use of distance-based clustering algorithms such as k-medoids (Vinod 1969). These measures provide a numerical value that indicates how dissimilar or distant two data objects are. Numerous variants of measures exist, and their usage depends on the data itself. Two of the most common measures are the $L_1$ and $L_2$, commonly referred to as Manhattan and Euclidean distance (ED) (Wang et al. 2013), respectively. These two measures are relatively simple and tend to be very effective when the dataset has a lower dimensionality.

Based on the application, a variety of dissimilarity measures exist. Concerns such as dimensionality, computational efforts, type of data, etc., factor in choosing the measure (Shirkhorshidi et al. 2015). For instance, an electricity consumption dataset is a time series dataset. If the shape of the consumption, i.e., behavior, is desired, a measure such as DTW (Sakoe and Chiba 1978) is an eligible candidate measure. As an elastic measure, DTW could identify similar behaviors that occur at different times of day as closely related. If instead a strict measure was used, such as ED, those similar behaviors would likely not be identified as similar. However, if there was a concern that the behaviors should be performed at the exact time and place each day to be identified as similar, ED would be a better choice of measure Nordahl et al. (2019).

The datasets used in this study vary quite distinctively in their type and number of data points. None of them, however, is considered to be a high-dimensional dataset. Thus, we decided to use ED (Wang et al. 2013) on our datasets to focus on the algorithms and their properties. ED is defined as follows

$$\begin{aligned} \text {ED} \left( q, p \right) = \sqrt{\sum _{i=0}^{n}{\left( q_i - p_i \right) ^{2}}}, \end{aligned}$$

(1)

where q and p are two data vectors consisting of n-dimensions and $p_i$ and $q_i$ are individual points in p and q, respectively.

2.4 Cluster validation measures

The data mining literature provides a wide range of different cluster validation measures, which are broadly divided into two major categories: external and internal (Jain and Dubes 1988). External validation measures have the benefit of providing an independent assessment of clustering quality, evaluating the clustering results with respect to a pre-specified structure. Within the external evaluation, there are two distinct classes of measures: unary and binary (Handl et al. 2005). Unary measures often take a clustering solution as input and compare it against the ground truth. The clustering solution can be evaluated with regard to both the purity and the completeness of the clusters. F$_1$ is one example of such a validation measure (Chinchor 1992). In addition, to unary measures, a number of indices that assess the consensus between two partitioning solutions, based on the pairwise assignment of data points, are provided in the data mining litterature. Most of these indices are symmetric, making them well-suited for assessing the similarity of two clustering solutions, in which the Jaccard Index is a good example of (Jaccard 1912).

Internal validation techniques, on the other hand, avoid the need for using such additional knowledge. They evaluate the clustering solutions based upon the same information that were used to create the clusters, enabling them to evaluate the quality of the produced clustering solutions in different ways. Internal measures can be divided into four categories based on how they evaluate clustering solutions: compactness, separation, connectedness, and stability of the cluster partitions. A detailed overview of different types of validation measures is available in (Halkidi et al. 2001; Vendramin et al. 2010).

Traditionally, many researchers working in data stream clustering apply well known validation measures to evaluate the clustering solutions produced by their data stream clustering algorithms (Silva et al. 2013), such as Sum of Squared Errors (SSE) and purity. Specific validation measures do, however, exist in the realm of data stream clustering. In 2011, Cluster Mapping Measure (CMM) is proposed as an effective measure for data stream clustering (Kremer et al. 2011). CMM is a combination score of missed objects, misplaced objects, and noise inclusion, and is based on the ground truth. More recently, several adaptations of well-known validation measures have been proposed, including Silhouette Index (Da Silva et al. 2020), Davies-Bouldin (Da Silva et al. 2020), and Xie-Beni (Moshtaghi et al. 2019). All of the aforementioned measures are, however, designed for incremental clustering algorithms. The algorithm proposed in this paper divides the stream into fixed-sized segments and separates the segments to analyze and evaluate them individually. Due to the algorithm’s intended application and design, that a stream is divided into segments, it can be argued that it is not necessary to adopt the incremental validation measures. EvolveCluster does not process elements incrementally. Instead, each segment is statically clustered, which allows us to utilize traditional validation measures on each segment. Furthermore, as the algorithm only operates on entire segments and not individual instances, there is no directly applicable way to validate with these types of measures.

In the coming sub-sections, we describe and define the evaluation measures we apply in the study. We use two external (F$_1$-measure and Jaccard Index) and two internal (Silhouette Index and Average Intra-Cluster Distance) cluster validation measures to the clustering solutions generated by our experiments.

2.4.1 F$_1$-measure

F$_1$ is the harmonic mean of the precision and recall values of each cluster. Consider two clustering solutions, $A = \{A_1,\ldots ,A_k\}$ and $B = \{B_1,\ldots ,B_l\}$, of the same dataset. We define A as the known partitioning of the dataset and B as the partitioning produced by the applied clustering algorithm. We then define the F$_1$ for a cluster $B_j$ as:

$$\begin{aligned} \text {F}_1\left( B_j\right) = \frac{2 |A_i \cap B_j|}{|A_i| + |B_j|}, \end{aligned}$$

(2)

where $A_i$ is the cluster containing the maximum number of objects from $B_j$.

To evaluate the overall F$_1$ score for the clustering solution B, two common approaches are used, micro and macro average. Both versions are similar, but the macro average sees all classes as equal while the micro average corrects the score by each individual class’s frequency. In this study, the datasets used (see Sect. 5.1.1) have a fairly even distribution of the corresponding classes. Therefore, the macro F$_1$ is used and for the clustering solution B it is defined as:

$$\begin{aligned} \text {F}_1\left( B\right) = \frac{1}{l} \sum _{j=1}^{l}F(B_j), \end{aligned}$$

(3)

where l is the number of clusters within B. The F$_1$ score has a value between 0 and 1, with 1 indicating a perfect score.

2.4.2 Jaccard Index

For evaluating the stability of a clustering solution, Jaccard Index (JI) is a suitable candidate. Given two clustering solutions produced from the same dataset, A and B, we define JI between A and B as follows:

$$\begin{aligned} J\left( A, B \right) = \frac{|A \cap B|}{|A \cup B|}, \end{aligned}$$

(4)

where $|A \cap B|$ is the number of data points with the same class in the same clusters in A and B, and $|A \cup B|$ is the total number of data points in the same clusters in A and B. JI ranges from 0 to 1, where a higher value indicates a higher similarity between the clustering solutions.

2.4.3 Silhouette Index

Silhouette Index (SI) is a cluster validity index that is used to determine the quality of any clustering solution $C = \{C_1, \ldots , C_k\}$. It produces a score that is based on the compactness of each cluster and the separation between the clusters (Rousseeuw 1987). SI for a clustering solution C is defined as:

$$\begin{aligned} SI(C) = \frac{1}{m} \sum _{i=1}^{m} \frac{b_i - a_i}{max\left( a_i, b_i\right) }, \end{aligned}$$

(5)

where $a_i$ is the average distance from object i to the other objects in its cluster and $b_i$ is the minimum average distance from i to the objects of the other clusters. SI ranges from -1 to 1, where a value closer to 1 indicates a better clustering solution, and a value on the negative side of the range indicate that there are misplaced data points within the clustering solution.

2.4.4 Average Intra-Cluster Distance

Similarly to SI, the Average Intra-Cluster Distance (IC-av) measures how compact the produced clusters are. In contrast to SI, it does not assume a spherical shape of the produced clusters (Baya and Granitto 2013). Instead of calculating the radius around the clusters, IC-av produces a Minimum Spanning Tree (MST) of all data points based on the distance between the objects in the dataset. The edges containing the distance between the data points in the tree are then used to determine the compactness of the clusters in the clustering solution. For a particular clustering solution $C = \{C_1,\ldots , C_k\}$, IC-av is defined as:

$$\begin{aligned} \text {IC-av}\,\left( C\right) = \sum _{r=1}^{k}\frac{1}{n_r}\sum _{i,j\in C_r} d^{2}_{ij}, \end{aligned}$$

(6)

where $n_r$ is the number of objects in cluster $C_r$ ($r = 1, 2,\ldots , k$) and $d_{ij}$ is maximum edge distance which represents the longest edge in the path joining objects i and j in the MST. IC-av produces a score between zero and the maximum value of the edges in the MST and should be minimized.

3 Related work

In this section, we provide a review of studies related to our work. At the end of the section, we specifically review the two algorithms PivotBiCluster and Split-Merge Evolutionary Clustering.

3.1 Evolving data streams

The data stream clustering scenario differs from traditional clustering because the data is usually not available in its entirety. Additionally, the data in the stream arrives at such a rapid pace and in large quantities that it is impossible to keep the data in the main memory (Gama 2010; Bifet et al. 2010b). Traditional algorithms, such as k-means (Lloyd 1982) and DBSCAN (Ester et al. 1996), rely on the entire dataset being present. A naive approach to apply traditional algorithms on data streams would be to re-cluster the entire solution at each increment. However, this approach is unfeasible both in regards to time constraints and the resources needed by the algorithms (Mousavi et al. 2015; Zubaroglu and Atalay 2021).

In addition to the quantity and rapidness of data in data streams, evolving data streams have the additional dynamics of non-stationarity data over time, also known as concept drift (Khamassi et al. 2018). Multiple approaches have been investigated to capture the dynamic aspects of evolving data streams, including (O’callaghan et al. 2002; Gama et al. 2011; Kriegel et al. 2011; Ghesmoune et al. 2015; Zhou et al. 2008; Lühr and Lazarescu 2009; Angelov and Zhou 2008). More specifically, Lughofer (2008) proposes an incremental algorithm, where each increment causes the affected cluster to be split and merged separately, which was further developed in (Lughofer 2012). Similarly, (Aaron et al. 2014) extends the k-means algorithm to a dynamic incremental clustering algorithm. A common idea for capturing evolving data streams’ dynamic nature is to use incremental algorithms and add functionality to modify clusters by splitting and merging (Aaron et al. 2014; Lühr and Lazarescu 2009). In general, these algorithms are divided into two components: Online and Offline (Zubaroglu and Atalay 2021). The online component of the algorithms produces micro clusters that stay up to date which each increment of data objects that arrives, and the offline component runs periodically to finalize the clustering solution based on the produced microclusters.

3.2 Window based models

Generally, within data stream clustering, especially in contrast to traditional clustering, it can be more efficient to focus on the recent data instead of the entire stream. Several window models exist, but the following three are the most popular: damped window, sliding window, and landmark window (Zubaroglu and Atalay 2021). The damped window models approach the data limitation by incorporating a weight factor when the data is processed. No object is removed from the window, but older the data objects have lower importance for the model. It is usually performed by a negative exponential function, such as $f(t)=2^{-\lambda t}$. SNCStream (Barddal et al. 2015) and its extension SNCStream+ (Barddal et al. 2016) operate in a damped window scenario in a single pass manner. They are based on social network theory and use homophily to identify non-hyper spherical clusters. Similarly, pcStream (Mirsky et al. 2015) is also defined to operate in a damped window mode. pcStream dynamically detects and manages temporal contexts by fusing sensor data streams to infer the present concepts and detects new concepts as they emerge.

The sliding window models approach the data stream in a similar way as damped windows but can be seen as stricter. Instead of having a decaying function, the sliding window is of a fixed size, and all objects within the sliding window have the same level of importance. When an object is added at one end of the window, another object is removed at the other end of the window, providing a window sliding over the stream.

DenStream (Cao et al. 2006) and its extension HDDStream (Ntoutsi et al. 2012), that handles high-dimensional data, follow an online-offline design, and are based on the DBSCAN clustering algorithm. The online procedures of the algorithms produce micro-clusters based on the density, which are later fed to the offline procedures, where the real clusters are created. WCDS (Cardoso et al. 2017) also follows the online-offline approach, creating micro-clusters in the online phase, but the offline phase was based on an agglomerative clustering algorithm to define its top-level clusters.

In contrast, landmark window models divide the data by assigning fixed landmarks where all data between two landmarks is a window. When a landmark is reached and a window ends, the succeeding window starts from that landmark point. A typical approach for landmark window-based clustering algorithms is to use the divided data stream to cluster them separately and use the produced centroids for that segment as representation, usually with partition-based algorithms.

The Stream framework was one of the earliest methods for stream clustering (Guha and Mishra 2016). The data stream is divided into segments, and each segment is clustered by k-median. The produced cluster centers from the segments are added into buckets representing a prototype array, ending up with $k_i$ medians, where i is the number of clustered segments. Whenever the number of stored medians surpass a parameter m, k-medians is run upon the prototype array to produce a median of medians situation. Stream LSearch is an extension of the Stream method, where a more effective subroutine for the underlying k-median was introduced called LSearch (O’callaghan et al. 2002). In 2015, a stream adaptation called StreamKM++ (Anderson and Koh 2015) was proposed to the kmeans++ algorithms. StreamKM++ creates a coreset tree by sampling a subset of the segment and solves the optimization problem on that subset without touching the rest of the segment. These sets are then stored in buffers that are merged whenever a new segment is clustered. StreamXM is a continuation of StreamKM++ and operates similarly, with Xmeans as the underlying clustering algorithm (Anderson and Koh 2015).

DUCstream also divides the stream into segments that are manageable for the system memory, but its underlying structure is instead a density-based algorithm (Gao et al. 2005). DUCstream partitions the data space in units and map the incoming objects in the units; the more mapped objects to a unit, the denser it is. These dense units are then used to perform clustering.

None of the methods mentioned above are explicitly tailored for our specified problem. They aim to model the entire stream with a single clustering solution as best as possible. We instead intend to divide the stream into segments, cluster them separately, and use the clustered segments to see how the stream evolves. With clustering solutions produced of each segment, it is easier to analyze the data between segments and trace how clusters have remained, changed, disappeared, or appeared. We have identified two approaches that similarly address the data stream clustering problem, namely PivotBiCluster (Ailon et al. 2012) and Split-Merge Evolutionary Clustering algorithms (Boeva et al. 2019).

3.3 PivotBiCluster

The first algorithm we compare with is PivotBiCluster (Ailon et al. 2012), an algorithm related to Bipartite Correlation Clustering (BCC) (Amit 2004). BCC builds upon the notion of taking two clustering solutions and combine them into a larger solution. Either by directly combining two clusters from different solutions or dividing a cluster from one solution into several clusters in the other solution. The combination is decided upon the correlation between the clusters of the different clustering solutions.

Referring to our problem statement, located in Sect. 4.1, PivotBiCluster assumes two data segments have been clustered beforehand, e.g., $D_0$ and $D_1$, thus producing $C_0$ and $C_1$. These two clustering solutions ($C_0$ and $C_1$) are then given to PivotBiCluster, which tries to combine them together, creating $C'_1$, by merging clusters from each solution based on how similar they are to each other. The correlation clustering can be applied over and over; thus, in the formalized problem statement, the PivotBiCluster continues to create a large clustering solution by using $C'_1$ in combination with $C_2$ to produce $C'_2$

One of the drawbacks of the PivotBiCluster algorithm is a lack of the ability to split a cluster into several others in the other clustering solutions. This drawback was the primary motivation of the Split-Merge Evolutionary Clustering algorithm.

3.4 Split-merge evolutionary clustering

The Split-Merge Evolutionary Clustering (Split-Merge Clustering) algorithm builds upon the idea present in BCC clustering algorithms, with the addition of splitting a cluster into multiple clusters in the other clustering solution (Boeva et al. 2019). This means that if a larger cluster exists within one of the clustering solutions, it can be split up into multiple clusters in the algorithm’s output. Similar to the approach of PivotBiCluster, the Split-Merge Clustering algorithm also assumes the incoming data from the data stream D is clustered in advance. Additionally, just as the PivotBiCluster, the clustering can either occur continuously, thus create a final large clustering solution that contains all the data elements from the dataset. Another option can be to take intermediary steps to filter out the data from the previous segment(s) to create a more reflecting model on the present type of data in the data stream.

One of the significant benefits of Split-Merge Clustering, compared to PivotBiCluster, is the splitting of clusters. The authors claim that with that addition, the algorithm is less sensitive to under- and over-clustering of each data segment, as the clusters are now more easily modified over time.

In contrast to our proposed algorithm, which we present in the following section, the Split-Merge Clustering algorithm links the old and new clustering solutions together. Depending on what type of data is being analyzed, this can be counter-productive if the clustering aims not to solely focus on current trends in the data.

4 An evolutionary clustering algorithm

4.1 Problem statement

Let us formalize the evolving data scenario we aim to address. Assume that D is a continuous stream of data, and a vector of features represents each data point. $D_0$ is the initial data segment which has been partitioned into k clusters, $C_0 = \{C_{00}, \ldots , C_{0k}\}$. Additionally, $D_1,\ldots ,D_t$, where $t \rightarrow \infty$, are continuous segments of data in the stream to be partitioned. Our objective is to produce a clustering solution, or clustering solutions, modeling how the data evolves.

4.2 EvolveCluster: an evolutionary clustering algorithm

In this section, we formally describe the proposed sequential partitioning algorithm, entitled EvolveCluster. The main idea of EvolveCluster is to allow a continuous data behavior to be easily modeled, by incorporating gained knowledge from the previous data segments, in the form of the cluster centroids, to influence the clustering of the new data segment. Using previous centroids, we can trace how the clusters evolve as the clusters are related over the data segments. Likewise, with each segment being clustered individually, it is easy to identify reoccurring trends between segments and changes in the data. The algorithm idea is schematically illustrated in Fig. 1.

Similar to both PivotBiCluster and Split-Merge Clustering, EvolveCluster divides the data stream into individual data segments. Likewise, the initialization of EvolveCluster requires the first data segment to be clustered in advance. The remaining data segments are, however, clustered within EvolveCluster. Each segment is sequentially clustered, using a partitioning algorithm, with the aid of the cluster centroids (seeds) from the previous segment. EvolveCluster assumes the new data segment contains at least some of the structure from the previous segment by incorporating the clustering structure from the previous segment. The following operations are conducted on each new data segment:

The data points of the segment is initially clustered by seeding with the cluster centroids of the previous segment;
The old centroids are removed and any empty clusters are deleted;
New centroids for the clusters are elected, and the clustering solution is refined.

The refined clustering solution undergoes a “trial-and-error” approach to detect if any clusters should be split into two by applying a 2-means clustering algorithm on a cluster basis. The 2-means clustering algorithm is initialized with the two data points in the cluster that exhibit the furthest distance to each other. If the clustering solution containing the split clusters is deemed the better clustering solution, by a validation measure, it is kept. Otherwise, it is discarded. The algorithmic steps conducted at each data segment are defined in Algorithm 1.

4.3 Computational complexity

In this section, we examine the computational costs of the clustering and splitting operations of the proposed algorithm. Depending on what underlying clustering algorithm is used, the computational complexity will differ. The approach proposed in this study uses k-medoids, a distance based partitioning algorithm. k-medoids requires a distance matrix of size $n \times n$ to be computed, where n is the number of elements. The distance matrix occupies the majority of both the computations and memory consumption of the algorithm, being a complexity of $O\left( n^2d\right)$ and $O\left( n^2\right)$, respectively, where d is the feature space dimension.

In this study, we propose the use of k-medoids whose complexity has been thoroughly studied (Schubert and Rousseeuw 2019). We can divide k-medoids into two parts: i) Initialization and ii) Refinement. The initialization according to the original implementation, which opts to identify a beneficial starting point, generates a complexity of $O\left( n^2k\right)$ where k is the number of clusters. When initial medoids are provided, or randomly chosen, the complexity instead becomes $O\left( nk\right)$. However, the refinement process remains the same as originally defined, generating a complexity of $O(\left( n-k\right) ^2ki)$, where i is the number of iterations performed in the refinement. This we can simplify to $O\left( n^2ki\right)$ as $k<< n$.

Here, we present the computational complexity of a single iteration of EvolveCluster. Suppose n is the number of data instances in the entire dataset and $n'$ is the number of instances in each data segment, where $n'<< n$. The initial clustering occurs in two steps, InitialPartition and RefineSolution, as defined in Algorithm 1 (steps one and two, respectively). InitialPartition assigns each data object in the current segment to the closest centroid, removes the initial centroids, and deletes any empty clusters, with a computational cost of $O(n'k + k + k) \rightarrow O(n'k)$. RefineSolution is a direct implementation of the original k-medoids algorithm, giving a complexity of $O(n'^2ki)$. The initial clustering of EvolveCluster then becomes $O(n'k + n'^2ki)$, which can be simplified to $O(n'^2ki)$.

The split criterion of EvolveCluster is calculated once outside the loop and once for every time a split is performed inside the loop. In the proposed approach, we use the SI as our measure for the split criterion, which has a computational complexity of $O\left( 2n'+n'^2\right)$. EvolveCluster splits a cluster by first identifying the two elements that are the furthest apart, $O(n'^2)$. k-medoids is then used with $k=2$ with the two identified elements as initial centroids, i.e. $O(n'k + n'^2ki)$. A single iteration of the splitting loop then becomes $O(n'^2 + n'k + n'^2ki + 2n' + n'^2)$, which we can simplify to $O(n'^2ki)$.

As each cluster in the produced clustering solution $C_t$ is split at least once, the lower bound of iterations for the splitting part of EvolveCluster becomes k times. This gives the lower bound for splitting to be $O(k(n'^2ki)) \rightarrow O(n'^2k^2i)$. The upper bound, on the other hand, is dramatically higher. In the worst case, a split is performed in every iteration which causes the final clustering solution to consist solely by singleton clusters, i.e., $n'$ iterations. The upper bound then becomes $O(n'(n'^2ki)) \rightarrow O(n'^3ki)$. Finally, the total complexity for each increment, with the inclusion of the distance matrix calculation, of the EvolveCluster algorithm is $O(n'^2ki + n'^2k^2i + n'^2d) \rightarrow O(n'^2(k^2i + ki + d)) \rightarrow O(n'^2(k^2i + d))$. If we include the upper bound calculation, the complexity of EvolveCluster becomes $O(n'^2ki + n'^3ki + n'^2d) \rightarrow O(n'^3ki)$.

Similarly to EvolveCluster, the Evolutionary Split-Merge Clustering bases its complexity on the underlying clustering algorithm (Boeva et al. 2019). Evolutionary Split-Merge Clustering adds the additional computational overhead $O((k'+k')n')$. In combination with k-medoids, its complexitiy becomes $O((k'+k')n' + n'^2ki) \rightarrow O(n'^2ki)$, which is in line with the produced lower bound complexity of EvolveCluster. The auhors of PivotBiCluster, on the contrary, have not proposed their complexity calculations in the clustering scenario (Ailon et al. 2012). Thus, we have no direct comparisons to perform.

5 Data and experimental designs

We perform two sets of experiments to investigate the effectiveness of EvolveCluster. The first experiment compares EvolveCluster and two similar clustering algorithms on three different datasets to analyze their differences and performances. In our second experiment, we analyze how EvolveCluster handles different concept drift scenarios by generating a synthetic data stream.

5.1 Experiment 1: comparative analysis

5.1.1 Data

We evaluate and compare the performance of the proposed EvolveCluster algorithn to two other clustering algorithms (PivotBiCluster and Split-Merge Clustering) on three different datasets, explained in Table 1. The first is the S1 dataset, a 2-dimensional synthetic dataset created by the authors of (Fränti and Virmajoki 2006). This dataset is chosen to investigate the algorithms ability to identify new clusters as they arrive in the data stream, and how they manage with regard to clustering a constant type of behavior over time.

The second dataset is a subset of the Covertype dataset, available at the UCI repository (Hettich and Bay 1999). The motivation behind the use of this dataset is mainly to have a direct comparison to both the PivotBiCluster and Split-Merge Clustering algorithms, as the authors of the latter algorithm have performed experiments upon it in their paper (Boeva et al. 2019). However, it is also chosen due to its larger number of data points in combination with a higher dimensionality of its features compared to the S1 dataset.

Finally, the third dataset is a real world electricity consumption dataset, the Domestic Electrical Load Metering, Hourly Data (DELMH) (Toussaint 2019). DELMH contains consumption from a large number of households and metering stations in South Africa covering the period from 1994 to 2014, with measurements taken up to every 5 minutes. It is worth noting that the single household with the most prolonged consumption period amounts to roughly two years worth of consumption. This type of dataset is one of the main target areas for our proposed algorithm.

All information about the used datasets in their original form is presented in Table 1.

Table 1 Information regarding number of features and instances of each dataset in their original form. The number of instances in the DELMH dataset are individual measurements from 71 up to 2940 concurrent households over 21 years, varying between 1 measurement up to 12076 per household

EvolveCluster: an evolutionary clustering algorithm for streaming data

Abstract

Similar content being viewed by others

StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

A Novel Multi-objective Differential Evolution Algorithm for Clustering Data Streams

CPOCEDS-concept preserving online clustering for evolving data streams

Explore related subjects

1 Introduction

2 Background

2.1 Clustering algorithms

2.2 Concept drift

2.3 Dissimilarity measures

2.4 Cluster validation measures

2.4.1 F\(_1\)-measure

2.4.2 Jaccard Index

2.4.3 Silhouette Index

2.4.4 Average Intra-Cluster Distance

3 Related work

3.1 Evolving data streams

3.2 Window based models

3.3 PivotBiCluster

3.4 Split-merge evolutionary clustering

4 An evolutionary clustering algorithm

4.1 Problem statement

4.2 EvolveCluster: an evolutionary clustering algorithm

4.3 Computational complexity

5 Data and experimental designs

5.1 Experiment 1: comparative analysis

5.1.1 Data

5.1.2 Data pre-processing

5.1.2.1 S1 dataset

5.1.2.2 Covertype dataset

5.1.2.3 DELMH dataset

5.2 Experiment 2: concept drift analysis

5.3 Evaluation and validation

5.4 Implementation and availability

6 Results and analysis

6.1 Original S1 dataset

6.1.1 Original S1

6.1.2 Continuous S1 dataset

6.2 Covertype dataset

6.3 DELMH dataset

6.4 Concept drift analysis

7 Discussion

7.1 EvolveCluster properties

7.2 Comparison to other evolving clustering algorithms

7.3 Handling concept drift and outliers

8 Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation