1 Introduction

Stream mining is the process of finding a complex structure within a large volume of data where the data evolves and arrives in an unbounded stream. A data stream is a sequence of continuous data that imposes a single pass restriction. Random access to the data is not feasible, and it is impractical to store all the arriving data. In this case, we store cluster features or synopses that typically include descriptive statistics for a cluster. In many cases, data stream algorithms have to observe space and time constraints. Data arriving in streams often contains noise and outliers. Thus, data stream clustering should be able to detect, distinguish and filter this data before the clustering task.

In this paper, we also present subspace clustering is an extension of feature selection, which tries to identify clusters in different subspaces of the same dataset. As a feature selection, subspace clustering needs a search method and an evaluation criterion. Also, subspace clustering must somehow restrict the scope of the evaluation criterion to consider different subspaces for each distinct cluster [1].

The rest of the paper is organized as follows: in Sect. 2, we present related literature. In Sect. 3, we explain our algorithm Soft Subspace Neural Gas for Data Stream Clustering (S2G-Stream) and evaluate its performance in Sect. 4. Finally, we conclude this paper in Sect. 6.

2 Related works

2.1 Subspace clustering

Subspace clustering consists of finding each class in a subspace composed of relevant features, and a feature may be suitable for one or more clusters. More sophisticated heuristics that can be grouped into two categories are then developed to optimally determine the subspaces associated with the classes of the classification. Based on the way subspaces are determined, subspace clustering methods are classified into two main categories: hard subspace clustering (HSC) and soft subspace clustering (SSC). SSC algorithms perform clustering in high dimensional spaces by assigning a weight to each feature to measure the contribution of individual features in the formation of a particular cluster [1]. In HSC, all features contribute equally in the clustering process. Figure 1 shows a classification of subspace clustering algorithms.

HSC algorithms are divided into bottom-up and top-down search methods [2]. Top-down approaches determine the first clustering using all the features. A weight associated with each feature is then used in a new phase of an iterative process to reassign the observations to the classes. The main difficulty in this category is the determination of the number of clusters and the number of features forming the subspace associated with a cluster. PROCLUS is a top-down approach [3], which uses three phases (initialization, iteration and refinement). In the first step, a greedy algorithm selects a set of potential medoids. Then, a set of dimensions is computed corresponding to each medoid so that points assigned to the best medoid form a cluster in the subspace determined by those dimensions. In the end, new dimensions of each medoid are computed based on the obtained clusters. Then, points are reassigned to medoids, and outliers are removed.

Bottom-Up approaches use clustering methods based on a mesh of the observation space by defining a histogram for each dimension. Then, intervals with a density of observations above a threshold set a priori to denote clusters for each feature. Starting from 1-dimensional clusters, which are combined iteratively to form the clusters in the higher dimensional subspaces [4]. Although these algorithms can find arbitrary-shaped subspace clusters, they fail to scale with the dimensions. CLIQUE [5] partitions every single dimension of the data space into equal-sized units using a fixed size grid. A unit is considered dense if the number of points in it exceeds the density support threshold. The maximal set of adjacent dense units forms a cluster.

SSC (soft subspace clustering) algorithms assign a weight to each dimension in a clustering process viewed as the degree of contribution of that dimension in that cluster. After the clustering process, the weights identify the subspaces of different clusters. The purpose of these algorithms is to select important features from the whole dataset. From this perspective, soft subspace clustering can be viewed as multiple feature weighting clustering. SSC methods can be classified into three categories [1] as presented in Fig. 1.

Firstly, CSSC (conventional subspace clustering) uses a feature weighting process in a two-steps clustering process. First, it uses some weighting strategies to find subspaces. Then, clustering is performed on the subspace that was obtained (separated feature weighting). Clustering can also be obtained by performing the two processes simultaneously (coupled feature weighting) [1]. C-k-means (Convex k-means) uses a separated feature weighting [6]. It involves two separate processes: subspace identification and clustering in subspace. It begins by assigning a set of weights to each data, which is not practical for high dimensional data. Another method named W-k-means uses a coupled feature weighting [7], which means that the weights are updated adaptively in the clustering process.

Secondly, we have ISSC (independent subspace clustering) method where each cluster has its weight vector to form its subspace. This kind of clustering uses many weighting methods such as fuzzy and entropy weighting. AWFCM [8] uses fuzzy weighting and takes into account all features in the clustering task. The feature selection is carried out during the learning phase providing information about the influence of selected features.

Finally, XSSC (extended subspace clustering) has been proposed to enhance the performance of CSSC and ISSC, employing many strategies to improve the clustering process. In [9], authors present IEWKM (improved entropy weighting K-means) by developing a new procedure to generate the weight for each attribute from each cluster within the framework of the k-means-type algorithm. Coevolutionary SSC [10] introduces a weighting system based on a coevolutionary learning technique. The coevolutionary algorithm extends the evolutionary methods to deal with complex problems. It uses several populations; each one of them is evolved in an environment that depends on the other population. The population is combined with its collaborators to form a complete solution, and an objective function is evaluated. Ensemble learning approaches [11] combine the benefits of the other two types of methods. Its goal is to use the opinion of several partitions to reach a consensus or a synthesis. These partitions can be obtained by varying clustering algorithms, initial parameters, subsets of data, etc.

Table 1 presents a comparison between some subspace clustering methods.

Fig. 1
figure 1

Hierarchy of subspace clustering algorithms

Table 1 Comparison between subspace clustering algorithms

2.2 Data stream clustering

In this subsection, we discuss some data stream clustering algorithms and present a comparison between these methods. HDDStream [15], SDStream [16], Stream Optics [17] and DENGRIS-Stream [18] are density-based clustering algorithms. They consider clusters as dense regions of the data space that are separated by low-density regions. These algorithms use temporal windows to consider only recent data. The user sets parameters window size, threshold value and radius. E-Stream [19] is a hierarchical algorithm. It represents each cluster as a fading cluster, which can be removed or split. Two clusters can also be merged during the process. An extension of E-Stream for various data streams with uncertainty, named HUE-Stream [20] associates a histogram to both numerical and categorical attributes. A two-component partitioning clustering method named CluStream [21] clusters data using an online micro clustering and an offline macro-clustering component. In the first phase, it makes a summary of the data stream. The second phase uses these statistics as well as other inputs to create clusters. SWClustering [22] is a clustering algorithm for evolving data streams over sliding windows. It introduces a new data structure called the exponential histogram of cluster features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change in cluster distribution. The proposed EHCF synopsis can provide sufficient information to calculate the final clusters. GCHDS [23] is a grid-based clustering algorithm to cluster high-dimensional data streams. It maintains a grid structure on a data stream in the main memory to create a synopsis of the data stream. The initial grid structure GS is built on the first chunk of data points in the data stream. GCHDS uses a simple heuristic method to find dimensions that are useful for clustering. Then, each cell in the grid structure is projected into this subspace. The connected cells are labeled as clusters. Historical past data in the grid structure GS fade by a factor \(\epsilon \). The clustering steps are performed whenever there is a request for clustering. If the number of data points in a cluster is less than a parameter p, the cluster is merged into another cluster. In SVStream [24], the data elements of a stream are mapped to kernel space, and the support vectors are used as summary information of the historical elements to construct cluster boundaries of arbitrary shapes. SVStream is based on support vector clustering (SVC) and support vector domain description (SVDD).

IDEStream [25] is an ensemble clustering method, using k-means algorithm to first group N points in \( \sqrt{N} \) micro-clusters. Then estimate the probability of density of each incoming characteristic vector using the degree of dispersion to detect outliers and aggregating the clustering of two-time windows into one. The same process is then repeated. The goal is to find the clustering that minimizes the information variation VI between clusterings. For this purpose, the method uses a reinforcement learning equation. Authors in [26] propose G-Stream based on growing neural gas. By introducing the notion of the reservoir to save distant points temporarily and applying a fading function, nodes can be created or removed during the learning process. Tables 2 and 3 list the differences between some clustering data stream methods.

Table 2 Comparison between data stream clustering algorithms
Table 3 Comparison between the processes of each data stream clustering algorithm

2.3 GNG-based clustering algorithms

The Incremental Growing Neural Gaz [33] algorithm has been proposed to follow the evolution of the graph. The method creates a new node each time the distance between the new data point and the existing node is higher than a threshold. This threshold value is a global parameter that corresponds to the average distance of the data to the center of the dataset. It has to be pre-defined, which makes it difficult to guess the right parameter to make the graph grow correctly. To resolve this weakness, I2GNG [34] associates a threshold variable to each neuron. The only problem is that those thresholds have to be initialized at the beginning of the process. AING [35] is an incremental GNG that introduces an adaptive parameter-free distance threshold. It automatically learns the distance thresholds of nodes based on its neighbors and data points assigned to the node of interest. The algorithm overcomes the shortcoming of excessive number of neurons by condensing them based on a probabilistic criterion and building a new topology with a fewer number of neurons, thus preserving time and memory resources.

3 Model proposition

In this section, we introduce S2G-Stream based on the growing neural gas (GNG) model taking into account the block structuring of features. We assume that features and blocks of features contribute at different levels to the determination of clusters. These contributions made by the features and blocks in each class are then measured by weights. We assume that the data stream consists in a sequence \({\mathcal {X}} = \{{\mathbf {x}}_1,{\mathbf {x}}_2,\ldots ,{\mathbf {x}}_n\} \) of n (potentially infinite) elements arriving at times \(t_1, t_2, \ldots , t_n\), where \({\mathbf {x}}_i = (x_{i}^1, x_{i}^2, \ldots ,x_{i}^d )\). At each time, S2G-Stream is represented by a graph \({\mathcal {C}}\) with K nodes, where each node represents a cluster. Each node \(c \in {\mathcal {C}}\) is associated with: (1) A prototype \({\mathbf {w}}_c=(w_c^1, w_c^2, \dots , w_c^d)\) representing its position (2) A weight \(\pi _c\) (3) An error variable error(c) representing the distance between this node and the assigned data points. For each pair of nodes (rc), we denote the shortest path linking r and c by \(\delta (c,r)\) the length of the shortest chain linking r and c on the graph. \({{\mathcal {K}}}^{\mathrm{T}}(\delta ) = {{\mathcal {K}}}(\delta /T)\) is the neighborhood function, T controls the width of \({\mathcal {K}}\).

Based on [36, 37], S2G-Stream introduces a double weighting system for features denoted by \(\beta \) and subspaces denoted by \(\alpha \). From the two types of weights, the subspaces of clusters can be revealed. We propose two types of weighting: the local weighting model (LWM), where the clusters influence the weights, and each feature and block have different weight vector for each cluster. And the global weighting model (GWM) where the weights are independent of the clusters, each feature and block have the same weight vector. Table 4 describes notations used in our method S2G-Stream.

Table 4 Notations used in S2G-Stream

3.1 Main contribution

In the previous work [26], authors consider all features equally important to the clustering task. However, some features or subspaces of features might be more influential to the clustering process. Our contribution in this paper is to introduce a double weight system to make relevant features and subspaces contribute more to the clustering process. The contribution is made by introducing the weights into the cost function of our algorithm. In our previous works, [38, 39], we proposed a local weighting model that depends on each cluster. The weights are different from one cluster to another, which makes it difficult to obtain scores to compare subspaces and features. Therefore, further experiments could not be conducted. The method presented in this paper has the following merits compared to the previous works:

  • The introduction of the global and local weighting model. We used these weights as scores to conduct more experiments. The subspace clustering method with the global weighting model was used as a dimensionality reduction method.

  • For the stream subspace clustering, more tests were conducted in this paper to respond to the following questions: Does the order of data significantly impact the data stream analyzes? Does the overlap between data windows impact the data stream analyzes? Is the type of weighting model important for the clustering process?

3.2 Cost function

Based on [36, 37], we propose to minimize the new cost function defined below for data batch \({\mathcal {X}}^{(t+1)}=\{ {\mathbf {X}}_1, {\mathbf {X}}_2, \ldots , {\mathbf {X}}_{t+1}\}\) with two weighting models. Figure 2 illustrates the difference between the two weighting models presented below.

3.2.1 Local weighting model (LWM)

In local weighting model, \(\alpha \) is a \(K \times P\) matrix where \(\alpha _{c}^b\) is the weight of block b in node c of \({\mathcal {X}}\). \(\beta \) is a \(K \times d\) matrix where \(\beta _b\) is a \(K \times d_b\) matrix, where \(\beta _{cb}^j(j=1,\ldots ,d_b)\) is the weight of the jth feature in block b for node c with \(\sum _{j=1}^{d_b} \beta _{cb}^{j}=1\) and \(\sum _{b=1}^{P} \alpha _{c}^{b}=1\), \(\forall c \in {\mathcal {C}}\).

$$\begin{aligned}&\jmath _{(LWM)}^{(t+1)}(\phi ,{\mathcal {W}},\alpha ,\beta ) \nonumber \\&\quad = \sum _{c \in C} \sum _{b \in P} \quad \sum _{{\mathbf {x}}_i \in {\mathcal {X}}^{(t+1)}} {{\mathcal {K}}}^{\mathrm{T}}(\delta (c,\phi ({\mathbf {x}}_i))) \nonumber \\&\qquad \alpha _{c}^b {{\mathcal {D}}}_{\beta _{cb} } + J_{cb}+I_{c} \end{aligned}$$
(1)

where:

$$\begin{aligned} I_{c} = \lambda \sum _{b = 1}^{P} \alpha _{c}^b \log (\alpha _{c}^b) \\ J_{cb} = \eta \sum _{j = 1}^{d_b} \beta _{c}^j \log (\beta _{c}^j) \end{aligned}$$

and:

$$\begin{aligned} {{\mathcal {D}}}_{\beta _{cb}} = \sum _{j=1}^{d_b} \beta _{c}^j(x_{i}^j - \omega _{c}^j)^2 \end{aligned}$$

\(I_{c}\) and \(J_{cb}\), respectively, represent the weighted negative entropies associated with the block weight vectors and the feature weight vectors. The parameters \(\lambda \) and \(\eta \) are used to adjust the relative contributions made by the features and blocks to the clustering.

3.2.2 Global weighting model (GWM)

In order to see if the weights can be meaningful and show the importance of blocks and features when these weights are independent from prototypes \({\mathcal {W}}\), we present in new cost function Eq. (2) another cost function where the weights \(\alpha \) and \(\beta \) are global and do not depend on prototypes \({\mathcal {W}}\).

For this model, \(\alpha \) is a \(1 \times P\) matrix where \(\alpha _b\) is the weight of block b. \(\beta \) is a \(1 \times d\) matrix where \(\beta _b\) is a \(1 \times d_b\) matrix, where \(\beta _{b}^j(j=1,\ldots ,d_b)\) is the weight of the jth feature in block b with \(\sum _{j=1}^{d_b} \beta _{b}^{j}=1\) and \(\sum _{b=1}^{P} \alpha _{b}=1\).

$$\begin{aligned}&\jmath _{(\mathrm{GWM})}^{(t+1)}(\phi ,{\mathcal {W}},\alpha ,\beta ) \nonumber \\&\quad = \sum _{c \in C} \sum _{b = 1}^{P} \quad \sum _{{\mathbf {x}}_i \in {\mathcal {X}}^{(t+1)}} {{\mathcal {K}}}^{\mathrm{T}}(\delta (c,\phi ({\mathbf {x}}_i))) \nonumber \\&\qquad \alpha _b {{\mathcal {D}}}_{\beta _{b} } + J _{b}+I \end{aligned}$$
(2)

Where

$$\begin{aligned} I = \lambda \sum _{b = 1}^{ P} \alpha _b \log (\alpha _b)\\ J_{b} = \eta \sum _{j = 1}^{ d_b} \beta ^j \log (\beta ^j) \end{aligned}$$

And

$$\begin{aligned} {{\mathcal {D}}}_{\beta _{b}} = \sum _{j=1}^{d_b} \beta ^j(x_{i}^j - \omega _{c}^j)^2 \end{aligned}$$
Fig. 2
figure 2

Difference between global and local weighting models. \(\alpha _{c}^b\) is the weight of the subspace b in the node c, and \(\beta _{cb}^j(j=1,\ldots ,d_b)\) is the weight of the jth feature in the subspace b for the node c. \({w}_c\) is the prototype of node c

3.3 Optimization algorithm

The optimization of the cost function is performed alternately for each batch \({\mathcal {X}}^{(t+1)}\) in four steps corresponding to the four parameters \({\mathcal {W}},\phi ,\alpha \) and \(\beta \):

  1. 1.

    Assignment function For a fixed \({\mathcal {W}},\alpha \) and \(\beta \), the assignment function \(\phi ({\mathbf {x}}_{i})\) is described below for both local and global weighting models. In order to reduce the computational cost, neighboring nodes are not considered in the assignment.

    • Local weighting model

      The assignment function for LWM is described in Eq. (3)

      $$\begin{aligned} \phi ({\mathbf {x}}_{i}) = \arg \min _{c \in {\mathcal {C}}}\left( \sum _{b=1}^{P}\alpha _{c}^{b}\sum _{j=1}^{d_b} \beta _{c}^j \left( x_{i}^j - \omega _{c}^j \right) ^2 \right) \end{aligned}$$
      (3)
    • Global weighting model

      The assignment function for GWM is described in Eq. (4)

      $$\begin{aligned} \phi ({\mathbf {x}}_{i}) = \arg \min _{c \in {\mathcal {C}}}\left( \sum _{b=1}^{P}\alpha _{b}\sum _{j=1}^{d_b} \beta ^j \left( x_{i}^j - \omega _{c}^j \right) ^2 \right) \end{aligned}$$
      (4)
  2. 2.

    Update prototypes \({\mathcal {W}}\): For a fixed \(\phi ,\alpha \) and \(\beta \) the prototypes \({\mathbf {w}}_c\) are updated for every batch of data following the equation defined below. Since the prototypes are independent of the weights, we have only one update function for both models.

    $$\begin{aligned} {\mathbf {w}}_{c}^{(t+1)}=\frac{{\mathbf {w}}_{c}^{(t)}n_{c}^{(t)}\gamma +\sum _{r \in C} {{\mathcal {K}}}^{\mathrm{T}}(\delta (r,c)){\mathbf {w}}_{r}^{(t)}m_{r}^{(t)}}{n_{c}^{(t)}\gamma +\sum _{r \in C} {{\mathcal {K}}}^{\mathrm{T}}(\delta (r,c))m_{r}^{(t)}} \end{aligned}$$
    (5)

    where \({\mathbf {w}}_{c}^{(t)}\) is the previous prototype, \(n_{c}^{(t)}\) is the number of points assigned to the cluster, \({\mathbf {w}}_{r}^{(t)}\) is the previous prototype for the cluster r (which is a neighbor of c), and \(m_{r}^{(t)}\) is the number of points added to the cluster r in the current batch: \(n_{c}^{(t+1)} = n_{c}^{(t)}+m_{c}^{(t)} \).

  3. 3.

    Update weights \(\alpha \) for a fixed \(\phi \), \({\mathcal {W}}\) and \(\beta \), we minimize the objective function (1) with respect to \(\alpha _{c}^{b}\) the weight of block b in the c-th cluster. Since there exists a constraint \(\sum _{b=1}^{P} \alpha _{c}^{b}=1\). We form the Lagrangian by isolating the terms which contain \(\alpha \) and adding Lagrangian multipliers \(\mu \) as follows:

    $$\begin{aligned} {\mathcal {L}}_{\mathrm{LWM}}(\alpha ,\lambda ) &= \jmath ^{(t+1)}(\phi ,{\mathcal {W}},\alpha ,\beta ) \nonumber \\&- \sum _{b \in P} \mu _c \left( \sum _{b=1}^P \alpha _{cb} - 1\right) \end{aligned}$$
    (6)

    Taking the derivative with respect to \(\alpha _{c}^{b}\) and setting it to zero yields a minimum of \(\alpha _{c}^{b}\) as follows.

    • Local weighting model

      $$\begin{aligned} \alpha _{c}^{b} = \frac{{\mathrm{e}}^{\frac{-D_{cb}}{\lambda } }}{\sum _{s=1}^{P} {\mathrm{e}}^{\frac{-D_{cs}}{\lambda } }} \end{aligned}$$
      (7)

      with

      $$\begin{aligned} D_{cb} &= \sum _{{\mathbf {x}}_{i} \in {\mathcal {X}}^{(t)}}{{\mathcal {K}}}^{\mathrm{T}}(\delta (\phi ({\mathbf {x}}_i),c)) \nonumber \\&\sum _{j=1}^{d_b}\beta _{c}^j \left( {\mathbf {x}}_{i}^j - w_{c}^{j}\right) ^2 \end{aligned}$$
      (8)
    • Global weighting model

      $$\begin{aligned} \alpha _{c}^{b} = \frac{{\mathrm{e}}^{\frac{-D_{b}}{\lambda } }}{\sum _{s=1}^{P} {\mathrm{e}}^{\frac{-D_{s}}{\lambda } }} \end{aligned}$$
      (9)

      with

      $$\begin{aligned} D_{b} &= \sum _{{\mathbf {x}}_{i} \in {\mathcal {X}}^{(t)}}{{\mathcal {K}}}^{\mathrm{T}}(\delta (\phi ({\mathbf {x}}_i),c)) \nonumber \\&\sum _{j=1}^{d_b}\beta ^j \left( {\mathbf {x}}_{i}^j - w_{c}^{j}\right) ^2 \end{aligned}$$
      (10)
  4. 4.

    Update weights \(\beta \) : for a fixed \(\phi \), \({\mathcal {W}}\) and \(\beta \), we minimize the objective function (1) with respect to \(\beta ^{j}_{cb}\) (the weight of feature j of block b in the c-th cluster). Since there exists a constraint \(\sum _{j=1}^{d_b} \beta _{cb}^{j}=1\). We form the Lagrangian by isolating the terms which contain \(\beta \) and adding Lagrangian multipliers \(\mu \) as follows:

    $$\begin{aligned} {\mathcal {L}}_{LWM}(\alpha ,\lambda ) &= \jmath ^{(t+1)}(\phi ,{\mathcal {W}},\alpha ,\beta ) \nonumber \\&- \sum _{c \in C} \mu _{cb} \left( \sum _{j=1}^{d_b} \beta ^{j}_{cb} - 1\right) \end{aligned}$$
    (11)

    Taking the derivative with respect to \(\beta ^{j}_{cb}\) and setting it to zero yields a minimum of \(\beta ^{j}_{cb}\) as follow:

    • Local weighting model

    The update function for LWM is presented in Eq. (12)

    $$\begin{aligned} \beta ^{j}_{c} = \frac{{\mathrm{e}}^{\frac{-E_{c}^j}{\eta }}}{\sum _{h \in P_j} {\mathrm{e}}^{\frac{-E_{c}^h}{\eta } }} \end{aligned}$$
    (12)

    with

    $$\begin{aligned} E_{c}^j = \sum _{{\mathbf {x}}_i \in {\mathcal {X}}^{(t)}}\alpha _{c}^{b_j}{{\mathcal {K}}}^{\mathrm{T}}(\delta (\phi ({\mathbf {x}}_i),c)) \left( {\mathbf {x}}_{i}^j - w_{c}^{j}\right) ^2 \end{aligned}$$
    (13)

    where \(b_j\) is the block where the jth feature belongs.

    • Global weighting model

      The update function for GWM is presented in Eq. (14):

      $$\begin{aligned} \beta _{j} = \frac{{\mathrm{e}}^{\frac{-E_j}{\eta }}}{\sum _{h \in P_j} {\mathrm{e}}^{\frac{-E_{h}}{\eta } }} \end{aligned}$$
      (14)

      with

      $$\begin{aligned} E_{j} = \sum _{{\mathbf {x}}_i \in {\mathcal {X}}^{(t)}}\alpha _{b_j}{{\mathcal {K}}}^{\mathrm{T}}(\delta (\phi ({\mathbf {x}}_i),c)) \left( {\mathbf {x}}_{i}^j - w_{c}^{j}\right) ^2 \end{aligned}$$
      (15)

3.4 S2G-Stream algorithm

S2G-Stream aims at extending the G-Stream algorithm [40] to subspace clustering by introducing block and feature entropy weighting. Starting with two nodes, and as a new data point is available, we link the nearest and the second-nearest nodes by an edge. The nearest node with its topological neighbors is moved towards the data point. We present below the main functions of the S2G-Stream algorithm.

3.4.1 Fading function

Most data stream algorithms consider most recent data as more important and reflect better the changes in the data distribution. For that, the notion of time windows is used. There are three window models commonly studied in data streams: landmark, sliding and damped [41]. We consider the damped window model, in which the weight of each node decreases exponentially with time via a fading function by introducing a decay factor parameter \(0<\gamma <1\).

$$\begin{aligned} \pi _{c}^{(t+1)}=\pi _{c}^{(t)}\gamma \end{aligned}$$
(16)

If the weight of a node is less than a threshold value, this node is considered outdated and removed (along with its links).

3.4.2 Edge management

An edge linking two nodes can be strengthened or removed. Its age grows with the exponential function \(2^{\tau _{age}^{(t-t_0)}}\), where \(\tau _{age} > 0\) defines growth rate of the age over time, t denotes the current time, and \(t_0\) is the creation time of the edge. A new edge can be added to connect two nodes. It can be removed if it exceeds the maximum age.

figure a
Fig. 3
figure 3

An insertion of 3 nodes during the same time window

3.4.3 Node insertion

Nodes can be inserted into the graph between the two nodes having the highest error value. If the weight of a node is lower than a threshold value, then this node is considered as outdated and removed (along with its links). Figure 3 illustrates the process of insertion of 3 nodes. The description of node Insertion process can be found in Algorithm 2.

figure b

The complete description of the S2G-Stream algorithm can be found in Algorithm 3.

figure c

4 Experimental results

4.1 Datasets and quality criteria

The S2G-Stream method described in this article was implemented in Spark/Scala and is available on Clustering4Ever GitHub repository.Footnote 1 We evaluated clustering quality of S2G-Stream on several synthetic and real dataset. Synthetic datasets are DS1 and DS2 generated using this tools.Footnote 2 The real datasets were taken from the UCI repository [42] and are described below:

  • Waveform Each class is generated from a combination of 2 of 3 “base” waves; the second 20 features contain noise (mean 0, variance 1).

  • Image segmentation (IS) Image data described by high-level numeric-valued attributes. The instances were drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. Each instance is a 3x3 region.

  • Cardiotocography (CTG) 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them.

  • pendigits Pen-based recognition of handwritten digits data, set-digit database of 250 samples from 44 writers (Table 5).

    Table 5 Description of datasets used in experimentation

For the quality measures, we used normalized mutual information (NMI) [43] and adjusted Rand index (ARAND) [44]. NMI provides a measure that is independent of the number of clusters as compared to purity. It reaches its maximum value of 1 only when the two sets of labels have a perfect one-to-one correspondence. The NMI of a clustering solution C is calculated as follows :

$$\begin{aligned} {\mathrm{NMI}}(Y, C) = \frac{2 \times I(Y; C)}{H(Y) + H(C)} \end{aligned}$$
(17)

where Y are true labels and C are labels predicted by the algorithm. \(I(Y; C) = H(Y) - H(Y|C)\) and H(C) is the entropy of the partition calculated as follow.

$$\begin{aligned} \sum _{k=1}^{K} \frac{n_k}{N}\log ({n_k}{N}) \end{aligned}$$
(18)

where \(n_k\) is the number of points assigned to the partition k. ARAND index is a measure of agreement between two partitions: one given by the clustering process and the other defined by external criteria. Given the following contingency matrix, where \(n_{ks}\) represents the number of points assigned to both clusters k and l of partitions C and \(C^{'}\) (Table 6).

Table 6 Contingency matrix between two partitions C and \(C^{'}\) of r and s clusters, respectively

The adjusted Rand index is calculated as follows:

$$\begin{aligned}&{\mathrm{ARAND}} \nonumber \\&\quad = \frac{ \sum _{ij} { \left( {\begin{array}{c}{n_{ij}}\\ 2\end{array}}\right) } - \left[ \sum _{i} { \left( {\begin{array}{c}{a_{i}}\\ 2\end{array}}\right) } \sum _{j} \left( {\begin{array}{c}{b_{j}}\\ 2\end{array}}\right) \right] / { \left( {\begin{array}{c}{n}\\ 2\end{array}}\right) } }{ \frac{1}{2} \left[ \sum _{i} { \left( {\begin{array}{c}{a_{i}}\\ 2\end{array}}\right) } + \sum _{j} { \left( {\begin{array}{c}{b_{j}}\\ 2\end{array}}\right) } \right] - \left[ \sum _{i} { \left( {\begin{array}{c}{a_{i}}\\ 2\end{array}}\right) } \sum _{j} { \left( {\begin{array}{c}{b_{j}}\\ 2\end{array}}\right) } \right] / {\left( {\begin{array}{c}{n}\\ 2\end{array}}\right) } } \end{aligned}$$
(19)

4.2 Experimental settings

Assuming large high-dimensional data arrives as a continuous stream, S2G-Stream divides the streaming data into batches and processes each batch continuously. The batch size depends on the available memory and the size of the original dataset. We set the time interval between two batches to 1 s. The parameters of S2G-Stream are described in Table 4. We repeated our experiments with different initialization and have chosen those giving the best results. We set \(\mu =3\), \(\gamma =0.99\) and \(age_{max}=250\). \(\lambda \) and \(\eta \) and the batch size for each dataset are described in Table 7. The weights \(\alpha \) and \(\beta \) are initialized randomly under the two constraints \(\sum _{j=1}^{d_b} \beta _{cb}^{j}=1\) and \(\sum _{b=1}^{P} \alpha _{c}^{b}=1\), \(\forall c \in {\mathcal {C}}\).

Table 7 Initialization of parameters \(\lambda \) and \(\eta \) and batch size for each dataset

4.3 Clustering evaluation

To show the effectiveness of our method, we compared it to different subspace and stream clustering algorithms. As stream clustering methods, we chose CluStream and DStream from R package streamMOA.Footnote 3 For subspace clustering methods, we implemented 2S-SOM [36] in the Scala language, and the code is available on the C4E GitHub repository. We compared the method to CLIQUE,Footnote 4 PROCLUSFootnote 5, and W-K-means.Footnote 6 We also compared our method to the GNG algorithm. For each algorithm, we repeated the experiments 10 times on each dataset. The results are reported in Tables 8 and 9. It is noticeable that S2G-Stream gives better results than the other methods except for DStream on CTG and Waveform with NMI metric and on CTG and DS1 with ARAND metric.

We observe that the global weighting model of S2G-Stream (GWM) performs better than the CluStream algorithm, but DStream gives better results in most cases. These results are since S2G-Stream detects noisy features, which allows relevant features to contribute more to clustering. The notion of fading also improves the clustering quality of our method compared to the other methods by reducing the impact of irrelevant data. For the subspace clustering methods, our method outperforms most of the methods except for CLIQUE on the waveform dataset with ARAND metric, and W-K-means on CTG with ARAND metric. We recall that all the subspace algorithms and GNG make several iterations on data, while all our algorithm makes just one pass over the data. The four subspace clustering algorithms provide good detection of relevant subspace, but the clustering process is unable to detect noise and arbitrary-shaped clusters. Our method detects both relevant subspaces and features. Those detected subspaces contribute to the clustering process to improve the clustering quality.

Table 8 Comparing S2G-Stream local (LWM) and global weighting model (GWM) with different stream algorithms and GNG
Table 9 Comparing S2G-Stream local (LWM) and global weighting model (GWM) with different subspace algorithms

Figures 4567, 8 and 9 show a comparison of our models to CluStream and DStream algorithms in terms of NMI and ARAND for each window. For almost all cases, the NMI and ARAND for S2G-Stream(LWM) are higher than for CluStream and DStream, except for some windows and also some datasets (DStream on waveform with NMI). This is due to the evolution of the nodes with S2G-Stream, and its capacity to remove noisy points and outdated prototypes over time, while the number of nodes of CluStream and DStream remains static. The results of S2G-Stream (GWM) are better than those of CluStream on most windows. DStream algorithm outperforms our second model (in certain cases). S2G-Stream shows a greater ability to partition high-dimensional data and is more stable in subspace clustering analysis.

Fig. 4
figure 4

Evolution of NMI and ARAND for waveform dataset compared with CluStream and DStream algorithms

Fig. 5
figure 5

Evolution of NMI and ARAND for IS dataset compared with CluStream and DStream algorithms

Fig. 6
figure 6

Evolution of NMI and ARAND for CTG dataset compared with CluStream and DStream algorithms

Fig. 7
figure 7

Evolution of NMI and ARAND for pendigits dataset compared with CluStream and DStream algorithms

Fig. 8
figure 8

Evolution of NMI and ARAND for DS1 dataset compared with CluStream and DStream algorithms

Fig. 9
figure 9

Evolution of NMI and ARAND for DS2 dataset compared with CluStream and DStream algorithms

4.4 Detection of subspaces

For this section, we settled to two datasets Waveform and CTG since the blocks of this datasets are defined in their descriptions. CTG dataset describes fetal cardiotocograms and is composed of 3 blocks. Block 1 contains seven features related to the heart rate of a fetus. Block 2 contains four features describing heart rate variability. Block 3 is composed of 10 features defining histograms of fetal cardiography. Waveform dataset is composed of 2 blocks of 20 features, where the second block is composed of noisy features.

4.4.1 Local weighting model

Figure 10 represents prototypes \({\mathcal {W}}\), \(\beta \) weights and \(\alpha \) weights for the final batch of CTG dataset. We observe in Fig. 10b that weights of features (8, 9, 10, 11) which are, respectively, ASTV,Footnote 7 MSTV,Footnote 8 ALTVFootnote 9 and MLTV,Footnote 10 are higher than the weights of the other features for most clusters. We observe that these four features influence better the clustering process and are more important than the other features for most clusters. In Fig. 10c, we observe that weight \(\alpha \) of the second block that contains these four features is also higher than the weights of the other two blocks. We conclude from this experiment that heart rate variability influences the clustering of fetal cardiotocograms.

Fig. 10
figure 10

Results of local weights \(\alpha \) and \(\beta \), and prototypes \({\mathcal {W}}\) for the final batch for CTG dataset. Each color represents a node

Figure 11 illustrates the capability of S2G-Stream to detect noise on the Waveform dataset. It represents prototypes \({\mathcal {W}}\), weights \(\beta \) and weights \(\alpha \) for the final batch of Waveform dataset. We can clearly observe in Fig. 11b that weights of the first 20 features increase over time, while weights of the 20 other features decrease. The weights of the 20 noisy features are much lower than the weights of the other features. In Fig. 11c, weights \(\alpha \) for block 1 are higher than the weights of block two, which make sense since the second block contains only noisy features. We can see that our algorithm assigns higher weights to the first 20 features, while the weights of noisy features are lower.

Fig. 11
figure 11

Results of local weights \(\alpha \) and \(\beta \), and prototypes \({\mathcal {W}}\) for the final batch for Waveform dataset. Every color represents a node

4.4.2 Global weighting model

To show the effectiveness of the global weighting model, we report in Figs. 12 and 13 the weights \(\alpha \), \(\beta \) and prototypes \({\mathcal {W}}\) for the final batch for CTG and Waveform datasets, for multiple training epochs.

For CTG dataset, the blocks are detected more accurately than by the local weighting model. The weight of second block and the features contained within it is always higher than the weights of the other blocks and their features. Prototype \({\mathcal {W}}\) is the same as in the first model since the prototypes update function is the same for both models.

Fig. 12
figure 12

Results of global weights \(\alpha \) and \(\beta \), and prototypes \({\mathcal {W}}\) for the final batch for CTG dataset for the second weighting model

For Waveform datasets, better blocks are also detected, but the model assigns lower weights to some relevant features detected by LWM (features 1–5). This is due to the fact that the model does not take into consideration the nodes and their preferences. We can assume that the first model is well adapted for this kind of datasets.

Fig. 13
figure 13

Results of global weights \(\alpha \) and \(\beta \), and prototypes \({\mathcal {W}}\) for the final batch for waveform dataset for the second weighting model

4.4.3 Feature selection with GWM

Even if the results of GWM are lower than the results of LWM, as we see in Table 8, GWM gave us a global weighting that we will use further to select the best features. We can see here the importance of the GWM on subspace detection.

Based on the previous experiments, we performed S2G-Stream on only the reduced dataset i.e., only the relevant blocks from the dataset based on the average of block weights \(\alpha \). We present the NMI and ARAND results compared with the results on the whole dataset in Fig. 14. For the CTG dataset, both NMI and ARAND results on the reduced dataset are better than the results using all features and blocks. Block 1 with block 2 provides a lower ARAND value. We conclude that block 1 gives better results compared to other blocks, which confirms previous results. For the Waveform dataset, NMI and ARAND measured on the reduced dataset (block 1) are higher than the whole dataset since the noisy features on the whole dataset affect the results. The same results are observed on the IS dataset. For pendigits, the results of NMI and ARAND on the reduced dataset are lower, since both blocks contain features that are required for the pen-based recognition.

Fig. 14
figure 14

NMI and ARAND for all real datasets compared with results on reduced datasets

4.5 Clustering evolution

Figure 15 shows an example of evolution of the graph S2G-Stream on waveform dataset (using Sammon’s nonlinear mapping), as the data flows (colored points represent labeled data points and black points represent nodes of the graph with edges in black lines). We can clearly see that S2G-Stream, beginning with two randomly chosen nodes (Fig. 15a), is able to recognize gradually the structure of the data stream (Fig. 15b, c). At the end of the training, we can observe that the topology recovers all the data structure (Fig. 15d). It is noticeable that our method manages to recognize the structures of the data stream and can separate these structures with the best visualization. It can also detect arbitrary-shaped clusters.

Fig. 15
figure 15

Evolution of graph creation of S2G-Stream on waveform dataset

4.6 Execution time

Figure 16 shows the execution times of S2G-Stream and the other stream clustering algorithms. We can notice that the CluStream algorithm has the shortest execution time, but our method is faster than DStream on all the datasets. In the meantime, S2G-Stream outperforms all the algorithms based on the results shown above. This result shows that S2G-Stream is nondominated across all datasets since no other algorithm yields faster computation times. In other words, no other algorithm can produce better results within equal or less time.

Fig. 16
figure 16

Execution time of S2G-Stream compared to other stream algorithms

4.7 Evolving data streams

In order to demonstrate the effectiveness of S2G-Stream in clustering evolving data streams, we evaluate in this subsection our method on the same datasets used above where the points are ordered by their class (i.e., data points of the first-class arrive first, then the ones of the second, third, etc.). The old classes disappear due to the use of the fading function. We report in Table 10 results of S2G-Stream on datasets used above with and without ordering classes. We ran both models in each case, on both ordered and unordered datasets, compared to CluStream and DStream based on NMI and ARAND measures. The results of this comparison are presented in Figs. 17 and 18.

We can see that the values of NMI and ARAND for sorted datasets are slightly lowered then non-sorted dataset, but with no large differences. Unlike CluStream and DStream, the performances of the clustering are much affected by order of the points. We conclude from this experiment that our method is well adapted for evolving data streams for most datasets.

Table 10 Comparing results of S2G-Stream with and without sorted classes
Fig. 17
figure 17

NMI for different datasets with and without ordering classes compared with CluStream and DStream algorithms

Fig. 18
figure 18

ARAND for different datasets with and without ordering classes compared with CluStream and DStream algorithms

4.8 Clustering over sliding windows

Most data stream algorithms consider most recent data as more important and reflecting better the changes in the data distribution. Therefore, the notion of sliding window is introduced in order to analyze only the most recent data points and the model obtained from the previous ones. Each window t contains P points; we consider that these P recent points contain some points from the previous window. Figure 19 illustrates the principle of the sliding window model.

Fig. 19
figure 19

Sliding window model with windows overlap

The overlap ratio between the two windows is the percentage of points that are kept from the previous window and re-used for learning with the new points. If at each step, the M oldest points are removed, and M points are appended, then the overlap ratio is defined as \((1 - M/P)\). We tested S2G-Stream (LWM) and S2G-Stream (GWM) with different overlap ratios for each dataset. The results are reported in Tables 11 and 12. We observe that the NMI and ARAND increase with the overlap ratio; since these datasets are small and medium sized, the overlap between windows helps overcoming the problem of the small-sized dataset and as a result enhances the performance of the method. While for large datasets, we can see a slight drop in performance since the overlap of the windows makes it difficult to learn from large datasets.

Table 11 NMI and ARAND of S2G-Stream (LWM) while changing the overlap percentage of sliding windows for each dataset
Table 12 NMI and ARAND of S2G-Stream (GWM) while changing the overlap percentage of sliding windows for each dataset

5 Reproducibility

To facilitate further experiments and reproducible research, we provide our contributions through an open-source API that contains several clustering algorithms, including both versions presented in this paper and also: S2G-Stream (local and global version), the 2S-SOM implemented in Spark/Scala and the API documentation at Clustering4Ever.Footnote 11

Moreover, Spark Notebooks of our method and other clustering algorithms are available at Clustering4Ever notebooks.Footnote 12

6 Conclusion

In this paper, we have proposed S2G-Stream with two models of feature and block weighting (global and local), an efficient method for subspace clustering of an evolving data stream in an online manner. We used the weights obtained as scores to conduct more experiments. The subspace clustering method with the global weighting model was used as a dimensionality reduction method. We proved the impact on the order of data point and also the overlapping of the windows to the clustering quality. Experimental evaluation demonstrates the effectiveness and efficiency of S2G-Stream in discovering clusters of arbitrary shapes and relevant features and blocks. The global model weighting gave us a comprehensive view of the features and blocks that we used to select the best subset of features, and enhance the clustering performances.

The performance of S2G-Stream, in terms of clustering quality as compared to other relevant data stream algorithms, is promising; we plan to extend S2G-Stream to deal with binary, categorical and mixed data streams. We also plan to introduce multi-objective clustering to our method, which is an efficient method to detect arbitrary-shaped clusters and provide better grouping by optimizing more than one objective function. These methods can be combined with subspace clustering methods, which is what we plan to do.