Abstract
Contemporary man is addicted to digital media and tools supporting his daily activities, which causes the massive increase of incoming data, both in volume and frequency. Due to the observed trend, unsupervised machine learning methods for data stream clustering have become a popular research topic over the last years. At the same time, semi-supervised constrained clustering is rarely considered in data stream clustering. To address this gap in the field, the authors propose adaptations of k-means constrained clustering algorithms for employing them in imbalanced data stream clustering. In this work, proposed algorithms were evaluated in a series of experiments concerning synthetic and real data clustering and verified their ability to adapt to occurring concept drifts.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The development of technology and the increasing number of devices connected to one general-purpose network have made it possible to download and store massive amounts of constantly streaming data. Regardless of the information type, which might be readings from sensor mesh or news outlets [1], it will always be required to process it into a resource that the users can further utilize – knowledge. Therefore, it has become inevitable for the topic to be widely researched, leading to the development of data mining techniques for data streams.
Undoubtedly, one of the discussed challenges is to design algorithms capable of continuous data stream processing [2]. Therefore, the posed models must adjust to the constantly changing data characteristics, which phenomenon is named concept drift [3]. Moreover, models have to provide a prediction before the next batch of data is available, which constrains the algorithm in terms of computational complexity.
The vast majority of the recent works focus on data stream classification, introducing various approaches for dealing with concept drift. Many of the proposed methods employ classifier ensembles [4] and dynamic selection of classifiers [5]. Moreover, various methods are based on concept drift detectors [6], and a prior probability estimation [7] which indicates when the data characteristics are changing.
Similarly, data streams were also researched in the context of clustering task [8]. Proposed algorithms are often adapting the existing methods to data stream processing [9]. At the same time, research on semi-supervised clustering with pair-wise constraints [10] has been mainly focused on the integration of constraints to classical algorithms [11] and methods based on optimization algorithms [12]. The adaptation of those algorithms to data streams was proposed only in a few recent works.
One of the mainly used techniques for this task is C-DenStream [13], which is adapting DBSCAN algorithm employing micro-clusters constructed on consecutive chunks. Another example is the SemiStream [14] algorithm, based on initial cluster initialization by MPCk-means, then detection of o-clusters, assigning them to s-clusters and m-clusters. Finally, the CE-Stream algorithm [15] extends the E-Stream algorithm, but it is limited to must-link constraints set.
Marking that there still is a lack of pair-wise constrained clustering algorithms for data streams, the main contribution of this work is a proposal for employing COPk-means and PCk-means algorithms for data streams clustering.
2 Algorithm
The data stream \(\mathcal{D}\mathcal{S}\) is a sequence of data chunks \(\mathcal{D}\mathcal{S} = \{ DS_1, DS_2, \ldots , DS_k\}\). Each data chunk contains a set of samples described by a feature vector X for which the clustering algorithm \(\kappa (X)\) assigns a label describing a cluster C. Additionally each chunk is also provided with two lists of pair-wise constraints ml and cl denoting must-list pairs and cannot-link pairs, respectively. Those extend a context of clustering with expert knowledge, which does not provide cluster labels directly, but only declares a relation between two samples.
The k-means clustering algorithm is an inductive learning method. The model is trained iteratively to minimize the intra-cluster and maximize the infra-cluster variance. In the first phase of the algorithm, k cluster centroids are initialized as points in feature space. Then, each observation is assigned to a cluster by finding the nearest centroid. After this procedure, the new centroids are calculated by averaging all assigned observations. The procedure is repeated until the model reaches convergence, which means that new centroids can not be further shifted. An additional stop condition, which guarantees algorithm execution in a feasible time, is the maximum number of maximum iterations. The algorithm pseudocode is presented in Algorithm 1.
Constraints can be integrated into the k-means algorithm using one of two rules, defined as follows:
-
hard rule: cluster labels assigned by the algorithm cannot be inconsistent with constraints. This rule is used by COPk-means (copk) algorithm. Each label assignment is verified for feasibility against constraints. If one of the rules is violated, the algorithm terminates, leaving the rest of the samples unassigned.
-
soft rule: cluster labels can be inconsistent with constraints, but constrain violation is penalized as an additional factor in minimized criteria. This rule is used by PCk-means (pck) algorithm. The criteria minimized in the algorithm is defined as follows:
$$\begin{aligned} \begin{aligned}&\mathcal {J}_{pckm} = \frac{1}{2} \sum _{x_i \in X} || x_i - \mu _{l_i} ||^2 + \\&\sum _{(x_i, x_j) \in \textsc {ml}}w_{ij} \mathbbm {1}[l_i \not = l_j] + \sum _{(x_i, x_j) \in \textsc {cl}} \bar{w}_{ij} \mathbbm {1}[l_i = l_j] \end{aligned} \end{aligned}$$(1)This approach assures that all samples are assigned to estimated clusters.
Another crucial part of the algorithm is selecting a proper cluster initialization method. The common methods are:
-
random - which is very fast, but a poor initial match may lead to very slow convergence.
-
kmeans++ - allowing a better estimation of the initial centroids, but it introduces additional computation overhead.
For constrained clustering problems, centroids can also be initialized based on transitive closure of the must-link constraints, averaging the broadest spanning trees created during this procedure. The complexity of such a method depends on a backbone spanning-tree search algorithm.
Since the goal is to adapt the algorithm to process data streams, it is necessary to make changes to the initialization and modification of the centroids in the copk and pck algorithms. Authors propose two new algorithms namely copk-s and pck-s introducing modifications to described procedures.
Initialization. In sequential data chunks processing, the model created on the previous chunk has already determined the centroids that can be used in the next one. It can be assumed that those centroids are close to optimal, and, at the same time, reusing them reduces additional computational effort.
Assignment. Base algorithms assume that k clusters will be formed. However, sudden concept drift may lead to the complete disappearance of one of the clusters. It may turn out that no pattern will be assigned to a given centroid, which at the same time will prevent further shifting. Therefore, an additional modification is introduced to the proposed algorithms. Centroids to which no samples were assigned are omitted, but the same centroid will be reused for clustering in the next chunk.
3 Experiments
The proposed methods of adapting the k-means algorithms were tested in a series of experiments in order to find answers to the following research questions:
- RQ1::
-
What impact do the proposed modifications have on the ability to adapt the algorithm to the occurring concept drifts?
- RQ2::
-
What impact do the proposed modifications have on the clustering quality?
3.1 Research Protocol
The following experimental protocol was formulated to answer the research questions. Both synthetic streams – for which it was possible to observe the algorithm’s operation with the defined behavior of the concept drift – and the real-life data were used, allowing for actual quality assessment of the proposed methods.
Synthetic data streams were generated with both the concept drift and the dynamic imbalance to the stream based on the artificial classification problem. Additionally, to create a set dedicated to clustering algorithms, two additional sets were created, in which the defined clusters were generated from the normal distribution at the given distribution means. This modification was aimed at creating a set dedicated to the clustering algorithm, bearing in mind that, at the same time, the objective of the algorithm is to determine the local concentration of observations correctly and not – as opposed to classification – to determine the correct labels.
There are no real data sets for the constrained clustering task with constraints provided by an actual expert. Therefore, it was necessary to generate both the synthetic and real streams based on a given set of labels for both the synthetic and real streams. Each set can introduce natural disproportion between the ml and cl, which is related to the number of classes and their volume. However, with a complete set of labels, it is possible to determine \( \left( {\begin{array}{c}n\\ 2\end{array}}\right) \) pairs of n samples and constraint type based on the similarity among their label. Therefore, to propose the method of generating the constraints, the percentage of possible sample connections is used, and it defines a number of pairs selected randomly and transformed into constraints.
The quality of clustering on selected data streams was assessed in a sequential protocol typical for this type of research. The incoming data are divided into chunks of equal size. The model is trained using the entire chunk, while the evaluation process follows the protocol for evaluating clustering algorithms. Additionally, for all tested methods, constraints lists are provided for each data chunk, and the set of labels itself is only used to compute the clustering metric.
3.2 Experimental Setup
Eight data streams were used for the experiments, four of which were synthetically generated, and four were prepared based on real data sets, obtained from MOAFootnote 1 and CSEFootnote 2. A detailed description of the data sets is presented in the Table 1.
Selected data sets were divided into chunks of 200 samples. The real-life data streams were limited to the first 200 chunks, which preserved the original problem characteristics and provided a convenient length for reliable analysis.
The constraints were generated for each chunk using original problem labels. The ratio of generated constraints to all possible constraints was constant along the stream, but each data set was evaluated in three configurations of \(1\%\), \(2\%\), and \(5\%\) of possible constraints. The metric selected to measure and compare the clustering quality of evaluated algorithms was Adjusted Rand Index (ARI).
The tested algorithms and the experimental environment were implemented in Python programming language, using the scikit-learn [16] and stream-learn libraries [17]. The repository allowing for the replication of presented results is available at GithubFootnote 3.
3.3 Results
The graphs show ARI scores for both proposed methods and the reference approaches. In addition, for each method, the metric values are shown (a) for the entire data stream length and (b) averaged over all data chunks. This approach allows for a detailed analysis of the algorithm performance while enabling comparison in the context of the entire problem. Finally, all presented runs were smoothed using the cumulative sum.
Synthetic Datasets. Research on synthetic sets was carried out to study the algorithm behavior concerning the occurring drifts. The results obtained for synthetic data streams with concept drift are presented in Fig. 1.
One may observe that for the analyzed problem, the pck-s achieved the best results. It should be underlined that in the case of Blobs Gradual, the clustering quality of the base pck and for the proposition was equal. Moreover, all the algorithms performed poorly around the cluster intersection – which was expected behavior – but ARI significantly improved when 5% of constraints were provided. At the same time, attention should be paid to the massive variance of the obtained metric, indicating how high is its potential impact on model diversification. At the same time, despite the inferior quality of clustering employing the copk, its significant improvement is noticeable for the copk-s. However – addressing RQ1 – it cannot be unequivocally stated that the proposed methods allow for faster adaptation to the concept drift.
For the Classification Gradual problem, it can be seen that both pck-s and copk-s are better than the base algorithms and more stable – especially for the case where 2% of constraints were used. In the flows with the 1% of constraints used, an interesting behavior at the beginning of the stream is visible - improving overall model quality in the early run. It may suggest that the proposed initialization method causes some delay in achieving the algorithm’s convergence, which stabilizes only after some time.
It is also essential to pay attention to the number of constraints on the tested algorithms. In all cases, increasing this parameter did not affect the quality of clustering by copk and had a slight effect on copk-s while significantly improving the pck and pck-s algorithms, leading to a stable result in Classification Gradual with 5% limitations.
For the Blobs Gradual problem, the pck and pck-s algorithms achieve the promising result using 5% constraints and minimize their variance while becoming insensitive to the emerging imbalance. It is also worth noting that, in contrast to the previously analyzed problem, it is possible to point to a much slighter decrease in the quality of pck-s clustering at the time of imbalance. A similar relationship can be observed in Classification Gradual. In the scenario where the percentage of constraints is 0.5% and 1%, there is a sudden increase in the metric as the classes approach the prior equilibrium, followed by a rapid decrease. In the last case, pck-s adjusts to imbalance the fastest, giving the metric the most stable value.
Real Datasets. The results of experiments carried out on real data streams are presented in Fig. 2. Results analysis is carried out to address RQ2. It is impossible to indicate the best clustering algorithm for the forest covertype and electricity sets where 1% and 2% restrictions were selected. Noticeable differences between the algorithms will appear only after considering the 5% constraints, which significantly improves the pck and pck-s algorithms. An interesting drop in the quality of the pck clustering on the convtypeNorm set starts in chunk 75, which cannot be observed for pck-s.
Significant differences can be observed in the case of the other two sets. It can be noted that for the results of airlines, all methods performed poorly, but there is an increase in the quality of clustering for copk and copk-s with an increasing number of constraints. Thus, this observation seems to be an exception to the rule observed in the previous sets. However, it should be noted that the metric only increased by 0.1 between the lowest and the highest percentage of constraints.
The most interesting observations can be made on the KddCup99 set, for which copk-s also turned out to be the best clustering algorithm. An interesting relationship can also be seen for this set, according to which adding constraints degenerate the quality of clustering. Most likely, it is related to the specific behavior of the algorithm, related to constraint feasibility [18].
4 Conclusions
Two clustering algorithms for data streams: pck-s and copk-s, were presented in this work. Both algorithms are adopted to the task rarely discussed in the literature. The series of studies showed the ability to adapt the proposed methods to the existing concept drifts and obtain better results than in the case of the base algorithms.
The presented results provide the basis for further research. One of the aspects discussed in this work is the proposal to extend the proposed methods to be better adapted to the changing nature of the constraints. In some cases, the clustering quality showed significant variance throughout the run. An appropriate selection of constraints may be crucial for stabilizing the results.
In addition, the practical aspect of using the proposed methods should be emphasized in the context of other machine learning tasks, especially for active learning. Evaluating only the relation between two samples is easier for an expert than assigning them to imposed classes, which will reduce the cost of obtaining such knowledge.
References
Ksieniewicz, P., Zyblewski, P., Choras, M., Kozik, R., Gielczyk, A., Wozniak, M.: Fake news detection from data streams. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, Glasgow, July 2020
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion 37, 132–156 (2017)
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 2009, pp. 139–148. Association for Computing Machinery, New York (2009)
Zyblewski, P., Sabourin, R., Woźniak, M.: Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Inf. Fusion 66, 138–154 (2021)
Guzy, F., Woźniak, M., Krawczyk, B.: Evaluating and explaining generative adversarial networks for continual learning under concept drift. In: International Conference on Data Mining Workshops (ICDMW), pp. 295–303 (2021)
Komorniczak, J., Zyblewski, P., Ksieniewicz, P.: Prior probability estimation in dynamically imbalanced data streams. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE Shenzhen, July 2021
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.D., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–31 (2013)
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise, pp. 328–339 (2006)
Davidson, I.: A survey of clustering with instance level constraints. ACM Trans. Knowl. Discov. Data (41) (2007)
González, S., García, S., Li, S.-T., John, R., Herrera, F.: Fuzzy k-nearest neighbors with monotonicity constraints: moving towards the robustness of monotonic noise. Neurocomputing 439, 106–121 (2021)
González-Almagro, G., Luengo, J., Cano, J.-R., García, S.: Enhancing instance-level constrained clustering through differential evolution. Appl. Soft Comput. 108, 107435 (2021)
Ruiz, C., Menasalvas, E., Spiliopoulou, M.: C-DenStream: using domain knowledge on a data stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS (LNAI), vol. 5808, pp. 287–301. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04747-3_23
Halkidi, M., Spiliopoulou, M., Pavlou, A.: A semi-supervised incremental clustering algorithm for streaming data. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS (LNAI), vol. 7301, pp. 578–590. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30217-6_48
Sirampuj, T., Kangkachit, T., Waiyamai, K.: CE-stream : evaluation-based technique for stream clustering with constraints. In: The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 217–222. IEEE, Khon Kaen, Thailand, May 2013
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Ksieniewicz, P., Zyblewski, P.: Stream-learn-open-source Python library for difficult data stream batch analysis. arXiv:2001.11077 [cs, stat], January 2020
Davidson, I., Wagstaff, K.L., Basu, S.: Measuring constraint-set utility for partitional clustering algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_15
Acknowledgement
This work was supported by the Polish National Science Centre under the grant No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department of Systems and Computer Networks, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Wojciechowski, S., González-Almagro, G., García, S., Woźniak, M. (2022). Adapting K-Means Algorithm for Pair-Wise Constrained Clustering of Imbalanced Data Streams. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2022. Lecture Notes in Computer Science(), vol 13469. Springer, Cham. https://doi.org/10.1007/978-3-031-15471-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-15471-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15470-6
Online ISBN: 978-3-031-15471-3
eBook Packages: Computer ScienceComputer Science (R0)