Keywords

1 Introduction

The development of technology and the increasing number of devices connected to one general-purpose network have made it possible to download and store massive amounts of constantly streaming data. Regardless of the information type, which might be readings from sensor mesh or news outlets [1], it will always be required to process it into a resource that the users can further utilize – knowledge. Therefore, it has become inevitable for the topic to be widely researched, leading to the development of data mining techniques for data streams.

Undoubtedly, one of the discussed challenges is to design algorithms capable of continuous data stream processing [2]. Therefore, the posed models must adjust to the constantly changing data characteristics, which phenomenon is named concept drift [3]. Moreover, models have to provide a prediction before the next batch of data is available, which constrains the algorithm in terms of computational complexity.

The vast majority of the recent works focus on data stream classification, introducing various approaches for dealing with concept drift. Many of the proposed methods employ classifier ensembles [4] and dynamic selection of classifiers [5]. Moreover, various methods are based on concept drift detectors [6], and a prior probability estimation [7] which indicates when the data characteristics are changing.

Similarly, data streams were also researched in the context of clustering task [8]. Proposed algorithms are often adapting the existing methods to data stream processing [9]. At the same time, research on semi-supervised clustering with pair-wise constraints [10] has been mainly focused on the integration of constraints to classical algorithms [11] and methods based on optimization algorithms [12]. The adaptation of those algorithms to data streams was proposed only in a few recent works.

One of the mainly used techniques for this task is C-DenStream [13], which is adapting DBSCAN algorithm employing micro-clusters constructed on consecutive chunks. Another example is the SemiStream [14] algorithm, based on initial cluster initialization by MPCk-means, then detection of o-clusters, assigning them to s-clusters and m-clusters. Finally, the CE-Stream algorithm [15] extends the E-Stream algorithm, but it is limited to must-link constraints set.

Marking that there still is a lack of pair-wise constrained clustering algorithms for data streams, the main contribution of this work is a proposal for employing COPk-means and PCk-means algorithms for data streams clustering.

2 Algorithm

The data stream \(\mathcal{D}\mathcal{S}\) is a sequence of data chunks \(\mathcal{D}\mathcal{S} = \{ DS_1, DS_2, \ldots , DS_k\}\). Each data chunk contains a set of samples described by a feature vector X for which the clustering algorithm \(\kappa (X)\) assigns a label describing a cluster C. Additionally each chunk is also provided with two lists of pair-wise constraints ml and cl denoting must-list pairs and cannot-link pairs, respectively. Those extend a context of clustering with expert knowledge, which does not provide cluster labels directly, but only declares a relation between two samples.

The k-means clustering algorithm is an inductive learning method. The model is trained iteratively to minimize the intra-cluster and maximize the infra-cluster variance. In the first phase of the algorithm, k cluster centroids are initialized as points in feature space. Then, each observation is assigned to a cluster by finding the nearest centroid. After this procedure, the new centroids are calculated by averaging all assigned observations. The procedure is repeated until the model reaches convergence, which means that new centroids can not be further shifted. An additional stop condition, which guarantees algorithm execution in a feasible time, is the maximum number of maximum iterations. The algorithm pseudocode is presented in Algorithm 1.

figure a

Constraints can be integrated into the k-means algorithm using one of two rules, defined as follows:

  • hard rule: cluster labels assigned by the algorithm cannot be inconsistent with constraints. This rule is used by COPk-means (copk) algorithm. Each label assignment is verified for feasibility against constraints. If one of the rules is violated, the algorithm terminates, leaving the rest of the samples unassigned.

  • soft rule: cluster labels can be inconsistent with constraints, but constrain violation is penalized as an additional factor in minimized criteria. This rule is used by PCk-means (pck) algorithm. The criteria minimized in the algorithm is defined as follows:

    $$\begin{aligned} \begin{aligned}&\mathcal {J}_{pckm} = \frac{1}{2} \sum _{x_i \in X} || x_i - \mu _{l_i} ||^2 + \\&\sum _{(x_i, x_j) \in \textsc {ml}}w_{ij} \mathbbm {1}[l_i \not = l_j] + \sum _{(x_i, x_j) \in \textsc {cl}} \bar{w}_{ij} \mathbbm {1}[l_i = l_j] \end{aligned} \end{aligned}$$
    (1)

    This approach assures that all samples are assigned to estimated clusters.

Another crucial part of the algorithm is selecting a proper cluster initialization method. The common methods are:

  • random - which is very fast, but a poor initial match may lead to very slow convergence.

  • kmeans++ - allowing a better estimation of the initial centroids, but it introduces additional computation overhead.

For constrained clustering problems, centroids can also be initialized based on transitive closure of the must-link constraints, averaging the broadest spanning trees created during this procedure. The complexity of such a method depends on a backbone spanning-tree search algorithm.

Since the goal is to adapt the algorithm to process data streams, it is necessary to make changes to the initialization and modification of the centroids in the copk and pck algorithms. Authors propose two new algorithms namely copk-s and pck-s introducing modifications to described procedures.

Initialization. In sequential data chunks processing, the model created on the previous chunk has already determined the centroids that can be used in the next one. It can be assumed that those centroids are close to optimal, and, at the same time, reusing them reduces additional computational effort.

Assignment. Base algorithms assume that k clusters will be formed. However, sudden concept drift may lead to the complete disappearance of one of the clusters. It may turn out that no pattern will be assigned to a given centroid, which at the same time will prevent further shifting. Therefore, an additional modification is introduced to the proposed algorithms. Centroids to which no samples were assigned are omitted, but the same centroid will be reused for clustering in the next chunk.

3 Experiments

The proposed methods of adapting the k-means algorithms were tested in a series of experiments in order to find answers to the following research questions:

RQ1::

What impact do the proposed modifications have on the ability to adapt the algorithm to the occurring concept drifts?

RQ2::

What impact do the proposed modifications have on the clustering quality?

3.1 Research Protocol

The following experimental protocol was formulated to answer the research questions. Both synthetic streams – for which it was possible to observe the algorithm’s operation with the defined behavior of the concept drift – and the real-life data were used, allowing for actual quality assessment of the proposed methods.

Synthetic data streams were generated with both the concept drift and the dynamic imbalance to the stream based on the artificial classification problem. Additionally, to create a set dedicated to clustering algorithms, two additional sets were created, in which the defined clusters were generated from the normal distribution at the given distribution means. This modification was aimed at creating a set dedicated to the clustering algorithm, bearing in mind that, at the same time, the objective of the algorithm is to determine the local concentration of observations correctly and not – as opposed to classification – to determine the correct labels.

There are no real data sets for the constrained clustering task with constraints provided by an actual expert. Therefore, it was necessary to generate both the synthetic and real streams based on a given set of labels for both the synthetic and real streams. Each set can introduce natural disproportion between the ml and cl, which is related to the number of classes and their volume. However, with a complete set of labels, it is possible to determine \( \left( {\begin{array}{c}n\\ 2\end{array}}\right) \) pairs of n samples and constraint type based on the similarity among their label. Therefore, to propose the method of generating the constraints, the percentage of possible sample connections is used, and it defines a number of pairs selected randomly and transformed into constraints.

The quality of clustering on selected data streams was assessed in a sequential protocol typical for this type of research. The incoming data are divided into chunks of equal size. The model is trained using the entire chunk, while the evaluation process follows the protocol for evaluating clustering algorithms. Additionally, for all tested methods, constraints lists are provided for each data chunk, and the set of labels itself is only used to compute the clustering metric.

3.2 Experimental Setup

Eight data streams were used for the experiments, four of which were synthetically generated, and four were prepared based on real data sets, obtained from MOAFootnote 1 and CSEFootnote 2. A detailed description of the data sets is presented in the Table 1.

Table 1. Summary of benchmark datasets.

Selected data sets were divided into chunks of 200 samples. The real-life data streams were limited to the first 200 chunks, which preserved the original problem characteristics and provided a convenient length for reliable analysis.

The constraints were generated for each chunk using original problem labels. The ratio of generated constraints to all possible constraints was constant along the stream, but each data set was evaluated in three configurations of \(1\%\), \(2\%\), and \(5\%\) of possible constraints. The metric selected to measure and compare the clustering quality of evaluated algorithms was Adjusted Rand Index (ARI).

The tested algorithms and the experimental environment were implemented in Python programming language, using the scikit-learn [16] and stream-learn libraries [17]. The repository allowing for the replication of presented results is available at GithubFootnote 3.

3.3 Results

The graphs show ARI scores for both proposed methods and the reference approaches. In addition, for each method, the metric values are shown (a) for the entire data stream length and (b) averaged over all data chunks. This approach allows for a detailed analysis of the algorithm performance while enabling comparison in the context of the entire problem. Finally, all presented runs were smoothed using the cumulative sum.

Synthetic Datasets. Research on synthetic sets was carried out to study the algorithm behavior concerning the occurring drifts. The results obtained for synthetic data streams with concept drift are presented in Fig. 1.

One may observe that for the analyzed problem, the pck-s achieved the best results. It should be underlined that in the case of Blobs Gradual, the clustering quality of the base pck and for the proposition was equal. Moreover, all the algorithms performed poorly around the cluster intersection – which was expected behavior – but ARI significantly improved when 5% of constraints were provided. At the same time, attention should be paid to the massive variance of the obtained metric, indicating how high is its potential impact on model diversification. At the same time, despite the inferior quality of clustering employing the copk, its significant improvement is noticeable for the copk-s. However – addressing RQ1 – it cannot be unequivocally stated that the proposed methods allow for faster adaptation to the concept drift.

For the Classification Gradual problem, it can be seen that both pck-s and copk-s are better than the base algorithms and more stable – especially for the case where 2% of constraints were used. In the flows with the 1% of constraints used, an interesting behavior at the beginning of the stream is visible - improving overall model quality in the early run. It may suggest that the proposed initialization method causes some delay in achieving the algorithm’s convergence, which stabilizes only after some time.

It is also essential to pay attention to the number of constraints on the tested algorithms. In all cases, increasing this parameter did not affect the quality of clustering by copk and had a slight effect on copk-s while significantly improving the pck and pck-s algorithms, leading to a stable result in Classification Gradual with 5% limitations.

For the Blobs Gradual problem, the pck and pck-s algorithms achieve the promising result using 5% constraints and minimize their variance while becoming insensitive to the emerging imbalance. It is also worth noting that, in contrast to the previously analyzed problem, it is possible to point to a much slighter decrease in the quality of pck-s clustering at the time of imbalance. A similar relationship can be observed in Classification Gradual. In the scenario where the percentage of constraints is 0.5% and 1%, there is a sudden increase in the metric as the classes approach the prior equilibrium, followed by a rapid decrease. In the last case, pck-s adjusts to imbalance the fastest, giving the metric the most stable value.

Fig. 1.
figure 1

Results for synthetic datasets.

Real Datasets. The results of experiments carried out on real data streams are presented in Fig. 2. Results analysis is carried out to address RQ2. It is impossible to indicate the best clustering algorithm for the forest covertype and electricity sets where 1% and 2% restrictions were selected. Noticeable differences between the algorithms will appear only after considering the 5% constraints, which significantly improves the pck and pck-s algorithms. An interesting drop in the quality of the pck clustering on the convtypeNorm set starts in chunk 75, which cannot be observed for pck-s.

Significant differences can be observed in the case of the other two sets. It can be noted that for the results of airlines, all methods performed poorly, but there is an increase in the quality of clustering for copk and copk-s with an increasing number of constraints. Thus, this observation seems to be an exception to the rule observed in the previous sets. However, it should be noted that the metric only increased by 0.1 between the lowest and the highest percentage of constraints.

The most interesting observations can be made on the KddCup99 set, for which copk-s also turned out to be the best clustering algorithm. An interesting relationship can also be seen for this set, according to which adding constraints degenerate the quality of clustering. Most likely, it is related to the specific behavior of the algorithm, related to constraint feasibility [18].

Fig. 2.
figure 2

Results for real datasets.

4 Conclusions

Two clustering algorithms for data streams: pck-s and copk-s, were presented in this work. Both algorithms are adopted to the task rarely discussed in the literature. The series of studies showed the ability to adapt the proposed methods to the existing concept drifts and obtain better results than in the case of the base algorithms.

The presented results provide the basis for further research. One of the aspects discussed in this work is the proposal to extend the proposed methods to be better adapted to the changing nature of the constraints. In some cases, the clustering quality showed significant variance throughout the run. An appropriate selection of constraints may be crucial for stabilizing the results.

In addition, the practical aspect of using the proposed methods should be emphasized in the context of other machine learning tasks, especially for active learning. Evaluating only the relation between two samples is easier for an expert than assigning them to imposed classes, which will reduce the cost of obtaining such knowledge.