Keywords

1 Introduction

Data stream mining is a recent methodology that deals with the analysis of large volumes of ordered sequences of data records. Data streams are a manifestation of Big Data, which are characterized by the four ‘V’ dimensions, namely Volume, Velocity, Variety and Veracity [1]. In particular, data stream mining assumes that the volume of the sequence of data is so large that records can be used few times (or just once) for the analysis. Data streams are produced by sensor networks, e-mails, online transactions, network traffic, weather forecasting, health monitoring, social networks, etc., just to cite the most common applications made available by current technology [2, 3].

The requirement of using data records few times for extracting useful information involves the development of special-purpose data analysis methods, which should not require to store the whole stream of data in memory [4,5,6]. An approach to analyze data streams exploits an incremental generation of informational patterns, which represent a synthesized view of all data records analyzed in past and progressively evolve as new data records are available. Incremental and on-line algorithms are potentially useful to deal with continuous arrival of data in rapid, time-varying, and potentially unbounded streams since they continuously incorporate information into their model [7, 8].

Data stream mining is applied for different tasks, such as classification, clustering and frequent pattern mining. In this paper, we focus on classification of data records in a stream, which is deeply studied in literature [4, 9,10,11,12,13,14]. Differently from most works in literature, which focus on supervised methods [15, 16], we specialize into semi-supervised methods as we do not assume that all data records are completely labeled; on the other hand, we recognize that, in many contexts, labeled samples are difficult or expensive to obtain, meanwhile unlabeled data are relatively easy to collect. For example it is quite easy to collect new sensor data coming from continuous streams but it may be difficult or even impossible to manually label all such data. Semi-supervised learning in the context of data streams is relatively new when compared to supervised and unsupervised learning [17,18,19,20]. Despite several semi-supervised learning methods have been developed in the literature [21], only few of them have been applied to classify data streams [22, 23]. Moreover, there are few attempts of using fuzzy clustering for data stream mining, despite fuzzy clustering could be particularly useful to capture the continuous changes in the clustering structure [24,25,26,27,28].

Based on the idea of combining the benefits of semi-supervised learning and fuzzy clustering, recently we developed an incremental semi-supervised clustering method for data stream classification [29], which applies the Semi-Supervised Fuzzy C-Means algorithm (SSFCM) [30] to data chunks. The method has been further refined by enabling the dynamic determination of the number of clusters through an appropriate splitting procedure, leading to the DISSFCM (Dynamic Incremental Semi-Supervised FCM) algorithm [31]. In essence, DISSFCM applies SSFCM to data chunks that correspond to a fixed-size collection of contiguous data records coming from a stream. Furthermore, SSFCM is modified in order to allow the incremental evolution of clusters; cluster quality is evaluated by reconstruction error so that, when the quality goes below a threshold, a splitting procedure is applied in order to divide a low-quality cluster into two higher-quality clusters. While splitting is effective in improving the quality of clusters, a repeated application without counter-balance may induce many small-scale clusters that do not represent meaningful patterns.

In this paper we enhance DISSFCM by introducing a merging procedure that merges clusters when there are too many clusters or there are clusters with too few data records. Clusters are merged when they are sufficiently close so as to not hamper the overall quality of the cluster structure.

The organization of the rest of the paper is as follows. Section 2 presents our method for data stream classification and its extension proposed in this work. In Sect. 3 the effectiveness of the extended method is evaluated on a benchmark dataset. The last section draws the conclusion and outlines future work.

2 Dynamic Incremental Semi-Supervised FCM

In this section we describe the complete DISSFCM (Dynamic Incremental Semi-Supervised FCM) algorithm [31], including a merging mechanism to avoid small-scale clusters and improve the structure of clusters.

DISSFCM assumes that data belonging to C different classes are continuously available during time and processed as chunks. Namely, a chunk of \(N_1\) data is available at time \(t_1\), a chunk of \(N_2\) data is available at \(t_2\) and so onFootnote 1. We denote by \(X_t\) the data chunk available at time t. No assumption is made on the dimension of chunks that may vary from one chunk to another. One key feature of DISSFCM is the possibility to exploit partial supervision when available. Namely, when some pre-labeled data are available in a chunk, their labels can be used to drive the clustering process. The presence of pre-labeled data is not mandatory but it should be assured in the first chunk in order to initialize properly the cluster prototypes.

The core of DISSFCM is the SSFCM (Semi-Supervised FCM) algorithm [30] that is applied incrementally so as to enable continuous update of clusters based on new data chunks. At each time step SSFCM granulates data in the current chunk by producing a set of K clusters represented by K labeled prototypes \(P=\{\mathbf{p }_1,\mathbf{p }_2,\ldots , \mathbf{p }_K\}\) representatives for the local data chunk they model. Each prototype \(\mathbf p _k\) is a medoid, i.e. it is the datapoint closest to the center \(\mathbf c _k\). Before starting the clustering process, K labeled data are randomly chosen to initialize the prototypes, so that each cluster prototype is associated to a class label (\(K=C\)). To take into account the evolution of the data during the incremental clustering process, the cluster prototypes discovered from the previous chunk are used as pre-labeled prototypes for the current chunk.

To better take into account the data evolution, DISSFCM is equipped with a splitting mechanism [31] that is applied to the current clusters in order to divide a low-quality cluster into two higher-quality clusters. The cluster quality is evaluated in terms of the reconstruction error [30]:

$$\begin{aligned} V_k = \sum _{\mathbf {x}_{j} \in C_{k}} {\Vert \mathbf {x}_{j} - \mathbf {\hat{x}}_{j} \Vert ^2} \end{aligned}$$
(1)

that measures the difference between the original data \(\mathbf {x}_{j}\) and their “reconstructed” counterpart \(\mathbf {\hat{x}}_{j}\) that is derived using the clustering outcome (prototypes and membership degrees) as follows:

$$\begin{aligned} \mathbf {\hat{x}}_{j} = \frac{\sum ^{K}_{k=1}{u_{jk}^m\mathbf {p}_{k}}}{\sum ^{K}_{k=1}{u_{jk}^m}} \end{aligned}$$
(2)

The splitting mechanism is activated when the reconstruction error on the current chunk exceeds a tolerance value \(\epsilon \) the reconstruction error computed on the previous chunk. This means that the current number of clusters is not enough to effectively represent the data, hence the number of clusters should be augmented.

The cluster having the highest value of the reconstruction error, i.e. the cluster with lowest reconstruction ability, is selected as candidate for splitting. The splitting is performed by means of the conditional fuzzy clustering [32] applied to the collection of data belonging to the cluster so as to create two novel prototypes. If we denote by \(S^*\) the set of data belonging to the cluster \(k^*\) selected for splitting and by \(\mathbf {z}_{1}\) and \(\mathbf {z}_{2}\) the two novel prototypes, the conditional clustering minimizes the following objective function:

$$\begin{aligned} J=\sum ^{2}_{k=1} \sum _{j \in S^*}f_{jk}^{m}\Vert \mathbf {x}_{j} - \mathbf {z}_{k} \Vert ^2 \end{aligned}$$
(3)

under the constraint \(f_{j1}+f_{j2} = u_{jk^*}\) where \(f_{jk}\) is the membership degree of \(\mathbf {x}_{j}\) to the new cluster k. The objective function (3) is minimized by iteratively computing the membership values \(f_{jk}\) and the prototypes \(\mathbf {z}_{k}\) according to:

$$\begin{aligned} f_{jk} = \frac{u_{jk^*}}{ \sum ^{2}_{c=1} { \left( \frac{\Vert \mathbf {x}_{j} - \mathbf {z}_{k} \Vert }{\Vert \mathbf {x}_{j} - \mathbf {z}_{c} \Vert } \right) } ^{1/(m-1)} } \end{aligned}$$
(4)

and

$$\begin{aligned} \mathbf {z}_{k} = \frac{ \sum _{j \in S^*}{ f_{jk}^{m}\mathbf {x}_{j} } }{ \sum _{j \in S^*}{ f_{jk}^{m}} } , \qquad k = 1,2; \end{aligned}$$
(5)

After conditional clustering, the prototype \(\mathbf {p}_{k^*}\) is replaced by the two novel prototypes \(\mathbf {z}_{1}\) and \(\mathbf {z}_{2}\) that inherit the class label from \(\mathbf {p}_{k^*}\). Then membership values \(u_{ik}\) are recomputed as in SSFCM. The splitting can be repeated until the reconstruction error drops below the previous value. A maximum pre-fixed number \(N_s\) of splittings is allowed for each chunk.

Since a repeated application of the splitting without counter-balance may induce many small-scale clusters that do not represent meaningful patterns, in this work we enhance DISSFCM by introducing a merging procedure that merges clusters when there are too many clusters or there are clusters with too few data records in a chunk. Clusters are merged when their prototypes are close so as to not hamper the overall quality of the cluster structure. The merging mechanism is activated when one of the following conditions is met:

  1. 1.

    the number of clusters exceeds a predefined threshold \(\theta \);

  2. 2.

    the number of data belonging to a cluster is below a predefined threshold \(\lambda \).

In case 1. we select the nearest prototypes having the same class label as candidates for merging. We denote by \(\mathbf p _s\) and \(\mathbf p _l\) the nearest prototypes among all the current cluster prototypes sharing the same label. The new prototype p obtained by merging \(\mathbf p _s\) and \(\mathbf p _t\) is given by the following formula:

$$\begin{aligned} \mathbf p = \frac{\sum ^{N}_{i=1} (u_{is} + u_{it})^m \mathbf x _i}{\sum ^{N}_{i=1} (u_{is} + u_{it})^m} \end{aligned}$$
(6)

where \(u_{is}\) and \(u_{it}\) are the membership values of \(\mathbf x _i\) to cluster s and cluster t. In case 2. the prototype of the cluster with low number of data is merged with the closest cluster prototype, using Eq. (6). In each case, the merging reduces the number of clusters by one. The merging is repeated until there are no small clusters nor too many clusters. However, a maximum pre-fixed number \(N_m\) of merges is allowed for each chunk.

figure a
Fig. 1.
figure 1

Outline of DISSFCM. (Color figure online)

The overall scheme of DISSFCM enhanced with merging is shown in Fig. 1 and described in Algorithm 1. The algorithm requires the data stream as a sequence of chunks and an initial collection of labeled prototypes such that each class label is represented by at least one prototype. After application of SSFCM clustering (Step 6) the resulting prototypes are labeled automatically due to the semi-supervised nature of SSFCM. The derived prototypes are the basis for the classification process (Step 20). Indeed, the derived labeled prototypes are used to classify all the data in the current chunk via a matching mechanism. Namely, each data sample is matched against all prototypes and assigned to the class label of the best-matching prototype. The matching mechanism is based on the standard Euclidean distance. At the end, the algorithm returns the most recent collection of the prototypes, reflecting the data structure of the last data chunk. Notice that the returned collection can be used as input for a new run of the algorithm as long as new data are available from the data stream.

3 Experimental Results

Numerical experiments were conducted to evaluate the effectiveness of the proposed algorithm in data stream classification. The Optical Recognition of Handwritten Digits datasetFootnote 2 has been considered. It contains 5, 620 images of handwritten digits belonging to 10 classes (namely, 0, 1, 2, \(\dots \), 9). We used \(10\%\) of the samples as test set, and we partitioned the remaining \(90\%\) in a fixed number of chunks in order to simulate a data stream. The class distribution was preserved both in the chunks and in the test set.

The accuracy measure has been used to evaluate the classification results:

$$Acc= \frac{\left| \left\{ \mathbf x _{j}|y_{j}=a_{j} \right\} \right| }{N_{t}}$$

where \(\mathbf x _{j}\) is the j-th data point, \(y_{j}\) is the true class label and \(a_{j}\) is the predicted class label, \(N_{t}\) is the number of data points. After the t-th chunk has been processed, accuracy is computed not only on the test set, but also on the t-th chunk and on the previous processed chunks.

The purity external clustering measure has been used to evaluate the extent to which clusters contain a single class, after each chunk arrival. To compute purity, each cluster \(C_k\) is assigned to the class of \(a_k\) of its prototype, and then the accuracy of this assignment is measured by counting the number of correctly assigned data points and dividing by the cardinality of the cluster:

$$Pur(k)= \frac{\left| \left\{ \mathbf x _{j}|\mathbf x _{j}\in C_k \cap y_{j}=a_{k} \right\} \right| }{|C_k|}$$

Then an average purity is computed on all the clusters.

Table 1. Parameters of the enhanced DISSFCM algorithm.

We carried out some preliminary experiments by varying the parameters of the DISSFCM algorithm. Table 1 summarizes the experimental settings. A first evaluation was done by observing the reconstruction error. As an example, Fig. 2 shows the trend of the reconstruction error with #Chunk = 15 and \(\epsilon = 50\). Green dots correspond to the error after processing the current chunk, the blue dots indicate the error after a split and the yellow ones the error after a merge. Numbers on the dots indicate the number of prototypes (clusters). It can be seen that every time the reconstruction error exceeds the previous value plus the threshold \(\epsilon \), a split is activated and a new cluster is created (the number of clusters upon the blue dot is increased by one). When a cluster with a small number of samples occurs, a merge is activated and the number of clusters is reduced. It can be seen that most peaks occur when a new chunk arrives. This means that DISSFCM is still learning the correct model to fit the data and it improves the model as soon as a new chunk arrives (i.e. more training data). We observe that the split and merge steps help the model to fit the data. This could be better observed from Fig. 3, where the average purity values obtained on single chunks during the learning process are reported. It can be seen that after processing the fifth chunk, the average value of purity decreases. When the sixth chunk arrives one split and one merge are applied (Fig. 2) rising the purity value. The same behavior could be observed after chunks 14-th and 15-th are processed. The processing of all the chunks ends with 18 cluster prototypes that are used to represent the 10 original classes. The number of cluster prototypes for each class is reported in Table 2.

Fig. 2.
figure 2

Trend of the reconstruction error \(V_{max}\) with #Chunk = 15 and \(\epsilon \) = 50. (Color figure online)

Table 2. Number of cluster prototypes for each class at the end of the incremental process with #Chunk = 15, \(\epsilon \) = 50.

Table 3 reports the accuracy computed on the chunks at each step \(t_i\), during the incremental process with #Chunk = 15 and \(\epsilon \) = 50. Bold terms represent accuracy values on the current chunk. We observe that the model is properly adapted to the new arrived chunk. At each time step we also evaluated the classification accuracy of the current model on the previously seen chunks to verify if the model still fits the old data.

Table 3. Accuracy obtained on single chunks during the incremental process, with #Chunk = 15, \(\epsilon \) = 50.
Fig. 3.
figure 3

Average purity obtained on single chunks during the incremental process, with #Chunk = 15, \(\epsilon \) = 50 on training and test sets. (Color figure online)

To assess the effectiveness of DISSFCM, we evaluated the classification accuracy of the final models for each configuration of parameters (#Chunk, \(\epsilon \)). Results are summarized in Table 4. Both on the test and the training sets we can observe that the impact of the tolerance \(\epsilon \) is higher when the number of chunks grows (i.e. the data samples in each sample decreases). Indeed the accuracy values with 5 and 10 chunks are stable when varying the values of \(\epsilon \). With 15 and 20 chunks the accuracy is more sensitive to the value of \(\epsilon \). This behavior can be better observed in the plots of Fig. 4 that show the trend of the accuracy on the test set during the processing of the chunks, varying the \(\epsilon \) tolerance.

This is explained by observing that the higher the number of chunks, the less the number of samples in each chunk; therefore the algorithm has fewer samples to learn from. Thus the number of the samples in each chunk affects the stability of the algorithm. With 5 and 10 chunks (high number of data) the algorithm keeps the same behavior as new chunks arrive (Fig. 4(a) and (b)). As the number of chunks increases (and hence the number of data in each chunk decreases), the algorithm is more unstable and needs more time to converge to an accurate model (Fig. 4(c) and (d)).

Table 4. Classification accuracy on the whole training set (a) and the test set (b), varying the number of chunks and the tolerance \(\epsilon \).
Fig. 4.
figure 4

Accuracy on the test set varying \(\epsilon \) for #Chunk equal to 5 (a), 10 (b), 15 (c) and 20 (d). (Color figure online)

Finally, DISSFCM enhanced with merge was compared with its previous version [31] and with ILFM (Incremental Learning Fuzzy Measures) [33], which is a supervised incremental method based on Choquet integrals to classify data streams. Comparative results with #chunks = 15, \(\epsilon = 50\) and labeling = 75% are plotted in Fig. 5.

It can be seen that the introduction of the merging mechanism in DISSFCM slightly deteriorates the classification results with respect to the previous version which only applies splits. However, it should be noted that the final classification model provided by the novel version of DISSFCM is very simple (18 clusters) in comparison to the final model obtained by the previous version of DISSFCM which was based on 70 clusters.

The models obtained by DISSFCM were also compared to the model built by ILFM. It can be seen that the classification accuracy of ILFM is slightly better. However it should be noted that ILFM is a supervised method, thus it requires completely labeled data, that are difficult to find in real applications. Conversely, DISSFCM works with partially labeled data. Moreover the model produced by ILFM is an ensemble of classifiers, hence it is far more complex than our model. On the overall, DISSFCM achieves a good balance between accuracy and complexity of the classification model, while taking into account the evolution of data.

Fig. 5.
figure 5

Comparing the enhanced DISSFCM against its previous version (no merge), and ILMF. (Color figure online)

4 Conclusions

In this work we have described DISSFCM, a dynamic incremental semi-supervised version of the standard FCM clustering that is suitable for data stream classification. DISSFCM enables the structure of clusters to change dynamically: when the reconstruction error of data given a clustering structure becomes inadequate, the most troublesome clusters are split into finer grained clusters that better represent data. Moreover, when few samples are grouped in a cluster, a merge step is activated for reducing the number of groups. Numerical preliminary analysis has shown that the split tolerance \(\epsilon \) influences the accuracy results when the chunks dimension is small. Finally, it has been observed that the merge mechanism has a small negative impact on the accuracy of the model, when compared with DISSFCM without merge. However, in the face of such accuracy reduction we observe a significant simplification of the final model (18 cluster for DISSFCM with split and merge, against 70 for DISSFCM with split only). Similar considerations can be derived by comparing DISSFCM (with merge) and ILMF.

Further work is devoted to analyze the influence of the chunk composition on DISSFCM, so as to better take into account real data stream scenarios, where the incoming chunks may have different sizes and may contain data with inhomogeneous class distributions. Moreover further research is going on along the direction of introducing a mechanism to detect outliers, concept drift and the emergence of new classes.