Keywords

1 Introduction

In recent years, the topics connected with data stream mining attracted much attention of machine learning researchers [1, 6, 9,10,11,12,13,14,15,16,17, 24, 27,28,29,30,31,32,33,34, 37, 38]. In data streams it is quite common that the underlying data distribution can change over time. It is known in the literature as the ‘concept drift’ [40, 41]. Another characteristics of the streaming data is their potentially infinite size and high rates of arriving at the system. Therefore, proper data stream mining algorithms should be resource-aware [8, 15].

The data stream can be defined as a sequence of data elements

$$\begin{aligned} S = (x_{0},x_{1},x_{2},\dots ). \end{aligned}$$
(1)

In this paper the case of unsupervised learning is considered, which means that there are no classes assigned to data elements. Hence, the data stream element \(s_{n}\) is a D-dimensional vector defined as follows

$$\begin{aligned} x_{n} = \left[ x^{1}_{n},\dots ,x_{n}^{D}\right] . \end{aligned}$$
(2)

The data stream mining algorithm should be able to react to possible changes in data distribution [16, 20, 41]. The concept drift handling can be realized in two ways, i.e. passively or actively. In the passive approach, a concept drift reaction mechanism is incorporated in the data stream mining algorithm itself. It can bed done by applying the sliding window techniques [7] or ensemble methods [28, 29]. In the active approach, there is an external drift detector functioning in parallel to the main algorithm. Based on the information from the detector the algorithm can make a decision about rebuilding the model. Many active drift detectors focus on monitoring the current accuracy of the model. If it decreases, then it means that the model becomes invalid and needs to be modified. En examples of such methods are the Drift Detection Method (DDM) [18] and the Early Drift Detection Method (EDDM) [3]. Other active drift detection methods rely on statistical tests. They look for possible changes in parameters of data probability density functions. Among others, the Welch’s, the Kolmogorov-Smirnov’s [27], or the Page-Hinkley’s tests are often used [19].

In this paper, we present an active concept drift detection approach based on autoencoders. This approach is similar to the one presented in [23], in which the Restricted Boltzmann Machine (RBM) [39] was applied as a drift detector. This idea was further extended with resource-awareness strategies to deal with fast data streams [26] or to deal with missing values [25]. Autoencoders are important neural network models often used in the field of deep learning [21]. They are used, for example, in greedy-pretraining methods of learning deep neural networks [4, 5]. Similarly to RBMs, the properly learned autoencoder contains in its middle layer the information about the data probability distribution. It can be then used to check whether the data from another part of the data stream belong to the same distribution as the part on which the autoencoder was trained. This can be checked, for example, by measuring the reconstruction error or the cross-entropy of new data.

The rest of the paper is organized as follows. In Sect. 2 autoencoders are presented in more detail. In Sect. 3 methods for concept drift detection using autoencoders are proposed. In Sect. 4 the results obtained in numerical simulations are demonstrated. Sect. 5 concludes the paper and indicates possible future research directions.

2 Autoencoders

The autoencoder is a kind of feed-forward neural network, which is learned to reconstruct input data [22]. There are many types of autoencoders [21], like the denoising autoencoder [2], the sparse autoencoder [35] or the contractive autoencoder [36]. Generally, the autoencoder is composed of two parts: the encoder and the decoder. The aim of the encoder is to transform input data in a nonlinear way, extracting the most important features of the data. The decoder is learned to reconstruct the output of the encoder back to the input data as close as possible. In this paper, we consider autoencoders consisting of \(L+1\) layers. Let \(y_{j}^{(l)}(x_{n})\) denote the value of the j-th neuron in the l-th layer obtained for input data \(x_{n}\). Then, the values of neurons in the l-th layer can be computed using the following formula

$$\begin{aligned} y^{(l)}_{j}(x_{n}) = \sigma \left( \sum _{i=1}^{N_{l-1}}w^{(l)}_{ij}y^{(l-1)}_{i}(x_{n})+b_{j}^{(l)}\right) ,\; l = 1,\dots ,L, \end{aligned}$$
(3)

where \(w_{ij}^{(l)}\) are weights of synapses between the \((l-1)\)-th and the l-th layers, \(b_{j}^{(l)}\) are the biases, and \(N_{l}\) is the size of the l-th layer. Function \(\sigma (z)\) is an activation function. Many different types of activation functions can be applied in autoencoder. In this paper, we apply the most common one, i.e. the sigmoid function defined below

$$\begin{aligned} \sigma (z)=\frac{1}{1+\exp (-z)}. \end{aligned}$$
(4)

The 0-th layer is the input layer, i.e. \(y_{j}^{(0)}(x_{n}) = x_{n}^{j}\). The size of the output layer is equal to the input layer size, i.e. \(N_{L} = N_{0} = D\). If we do not apply any sparsity constraints while learning the autoencoder, the middle layer should be of the smallest size, i.e.

$$\begin{aligned} D = N_{0}> N_{1}> \dots > N_{\frac{L}{2}}< \dots< N_{L-1} < N_{L} = D \end{aligned}$$
(5)

The autoencoder is trained as a usual feedforward neural network by minimizing a cost function. Let \(C(x_{n})\) be the cost function obtained for the \(x_{n}\) data element. Then, for example, a weight \(w_{ij}^{(l)}\) would be updated in each step using the following formula

$$\begin{aligned} w_{ij}^{(l)} := w_{ij}^{(l)} - \eta \frac{\partial C(x_{n})}{\partial w_{ij}^{(l)}}, \end{aligned}$$
(6)

where \(\eta \) is the learning rate. Analogous formula applies for biases.

In practice, the learning is often performed on minibatches of data instead of single data elements. Let us assume that the minibatch \(M_{t}\) consists of B subsequent data

$$\begin{aligned} M_{t} = (x_{tB},\dots ,x_{(t+1)B-1}). \end{aligned}$$
(7)

Then, the neural network parameters are obtained using the arithmetic average of gradients obtained for all data from the minibatch

$$\begin{aligned} w_{ij}^{(l)} := w_{ij}^{(l)} - \eta \frac{1}{B}\sum _{m=tB}^{t(B+1)-1}\frac{\partial C(x_{m})}{\partial w_{ij}^{(l)}}. \end{aligned}$$
(8)

Many different types of cost functions can be used. In this paper, we consider two of them. The cross-entropy:

$$\begin{aligned} C_{CE}(x_{n}) = -\sum _{j=1}^{D}\left[ x_{n}^{j}\log _{2}\left( y^{(L)}_{j}(x_{n})\right) +\left( 1-x_{n}^{j}\right) \log _{2}\left( 1-y^{(L)}_{j}(x)\right) \right] , \end{aligned}$$
(9)

and the reconstruction error:

$$\begin{aligned} C_{RE}(x_{n}) = \sqrt{\sum _{j=1}^{D}\left( x_{n}^{j} - y^{(L)}_{j}(x_{n})\right) ^{2}}. \end{aligned}$$
(10)

The monitoring of cost function values can be used to detect potential drifts in data distribution. The detection procedure is described in the next section.

3 Concept Drift Detection with the Autoencoder

We suspect that the autoencoder properly trained on one part of the data stream can be applied to monitor possible changes in data distribution in the following parts. For any cost function C (given by (9) or (10), the average value of this cost function for minibatch \(M_{t}\) of size B is given by

$$\begin{aligned} C(M_{t}) = \frac{1}{B}\sum _{x_{m}\in M_{t}}C(x_{m}) \end{aligned}$$
(11)

Apart from the cost function itself, we want to investigate also the trend Q(Ct) of cost function C for subsequent minibatches. To ensure that the trend is computed only on the basis of the most recent data, we apply additionally the forgetting mechanism, as it was done in [23]. The trend at time t is given by the following formula

$$\begin{aligned} Q(C,t) = \frac{\overline{n}_{t}\overline{TC}_{t} - \overline{T}_{t}\overline{C}_{t}}{\overline{n}_{t}\overline{T^{2}}_{t} - \left( \overline{T}_{t}\right) ^{2}}. \end{aligned}$$
(12)

In standard regression procedure, the terms in formula (12) are given simply by appropriate arithmetic averages. In our case, because of the forgetting mechanism applied, the arithmetic averages are replaced by the following recurrent formulas

$$\begin{aligned} \overline{TC}_{t}= & {} \lambda \overline{TC}_{t-1} + t*C(M_{t}),\; \overline{TC}_{0} = 0, \end{aligned}$$
(13)
$$\begin{aligned} \overline{T}_{t}= & {} \lambda \overline{T}_{t-1} + t,\; \overline{T}_{0} = 0, \end{aligned}$$
(14)
$$\begin{aligned} \overline{C}_{t}= & {} \lambda \overline{C}_{t-1} + C(M_{t}),\; \overline{C}_{0} = 0, \end{aligned}$$
(15)
$$\begin{aligned} \overline{T^{2}}_{t}= & {} \lambda \overline{T^{2}}_{t-1} + t^{2},\; \overline{T^{2}}_{0} = 0, \end{aligned}$$
(16)
$$\begin{aligned} \overline{n}_{t}= & {} \lambda \overline{n}_{t-1} + 1,\; \overline{n}_{0} = 0, \end{aligned}$$
(17)

where \(\lambda \) is a forgetting factor and it takes values lower than 1. The trend of the cost function should be close to 0 as long as there is no concept drift in the data stream. After the drift occurs, then Q(Ct) is suspected to increase to some positive value. Since we consider two cost functions in this paper, there are two trends analyzed as well, i.e. \(Q(C_{CE},t)\) and \(Q(C_{RE},t)\).

4 Experimental Results

In this section, we present preliminary simulation results of experiments conducted on simple synthetic datasets with concept driftFootnote 1. The dataset was generated in the same manner as it was done in [23], i.e. based on synthetic Boltzmann machines with randomly chosen parameters. At the beginning, we applied two Boltzmann machines, which then were used to generate to datasets with static probability distributions

$$\begin{aligned} S_{1} = \left( x_{1,1},x_{1,2},x_{1,3}\dots \right) ,\end{aligned}$$
(18)
$$\begin{aligned} S_{2} = \left( x_{2,1},x_{2,2},x_{2,3},\dots \right) . \end{aligned}$$
(19)

The dimensionality of the data in both sets is \(D=20\). The datasets \(S_{1}\) and \(S_{2}\) were used to generate two datasets with concept drifts. The dataset with sudden concept drift is denoted as \(S_{s} = \left( x_{s,1},x_{s,2},\dots \right) \), where \(x_{s,n}\) is taken from \(S_{1}\) if \(n < 500000\) and from \(S_{2}\) if \(n \ge 500000\). The dataset with gradual concept drift is denoted as \(S_{g} = \left( x_{g,1},x_{g,2},\dots \right) \), where \(x_{g,n}\) is from \(S_{1}\) if \(n < 500000\) and from \(S_{2}\) if \(n \ge 600000\). In the interval [500000; 600000) the data element \(x_{g,n}\) is drawn from \(S_{1}\) with probability \(\frac{600000\,-\,n}{100000}\) and from \(S_{2}\) with probability \(\frac{n\,-\,500000}{100000}\). Hence, the two datasets \(S_{s}\) and \(S_{g}\) are constructed in such a way, that the concept drift occurs at after processing 500000 data elements.

In the experiments, we used the simple 3-layered autoencoder (\(L=3\)). The sizes of the 0-th and the 2-nd layer are equal to \(N_{0} = N_{2} = D = 20\), to match the dimensionality of the data. The middle layer is chosen to be of size \(N_{1} = 15\). The learning rate and minibatches sizes are set to \(\eta = 0.05\) and \(B = 20\), respectively. In each experiment, the autoencoder is trained only for the first 400000 data elements. Then, it is turned to the monitoring mode (no learning is performed, only the values of the cost functions are analyzed). The forgetting factor \(\lambda \) for trend measures was set to 0.998.

4.1 Sudden Concept Drift Detection

The first experiment was conducted on the \(S_{s}\) dataset, with the sudden drift. The results obtained for the cross-entropy loss function are presented in Fig. 1. As can be seen, the value of the cross-entropy decreases while the autoencoder learns until the 400000-th data element. Then it monitors the incoming data. Since the distribution of data does not change until the 500000-th element, the cross-entropy values fluctuate around some fixed level. Then the sudden drift occurs and the cross-entropy suddenly achieves higher values. It is more clearly visible in Fig. 1(b), where the trend of cross-entropy values is presented. The peak at \(n=500000\) matches the occurrence of the sudden drift.

Fig. 1.
figure 1

Cross-entropy values and trend as a function of the number of processed data elements, for dataset \(S_{s}\) with sudden concept drift.

In the case of the reconstruction error, the results can be additionally compared with the drift detector based on the RBM, presented in [23]. We applied the same parameters as for the autoencoder. The RBM was learned using the Contrastive Divergence (CD-k) method with K = 1. The results are shown in Fig. 2.

Fig. 2.
figure 2

Reconstruction error values and time as a function of the number of processed data elements, for data set \(S_{s}\) with sudden concept drift.

It seems that the autoencoder wins with the RBM approach in both considered aspects. In Fig. 2(a) we can observe that the values of the reconstruction error are lower for the autoencoder. It might be important if we wanted to use the network simultaneously for other purposes, besides the concept drift detection. In Fig. 2(b) it is visible that the signal peak of sudden change detection is significantly higher for the autoencoder.

4.2 Gradual Concept Drift Detection

In the next experiment, analogous simulations were carried out for the dataset \(S_{g}\) with gradual concept drift. The results concerning the cross-entropy are presented in Fig. 3, whereas in Fig. 4 the comparisons of reconstruction error values are demonstrated.

Fig. 3.
figure 3

Cross-entropy values and trend as a function of the number of processed data elements, for dataset \(S_{g}\) with gradual concept drift.

Fig. 4.
figure 4

Reconstruction error values and trend as a function of the number of processed data elements, for data set \(S_{g}\) with gradual concept drift.

In this experiment, the values of cost functions raise smoothly whilst the gradual drift takes place between the 500000-th and the 600000-th data elements. Figure 4 confirms the superiority of the autoencoder over the RBM-based approach also in the case of gradual drift.

5 Conclusions and Future Work

In this paper, the autoencoder was proposed as a drift detector in time-changing data streams. First, the neural network learns the model on the beginning part of the stream. Then it is used to monitor possible changes in the following parts. The variations of the cost function values are analyzed. If he changes turn out to be large, it means that the concept drift occurred, and it can be a signal for the data stream mining algorithm to rebuild the current model. Two cost functions were applied in this paper, i.e. the cross-entropy and the reconstruction error. The preliminary results obtained in experiments carried out on simple synthetic datasets confirmed that the autoencoder can be successfully used as a concept drift detector. It was demonstrated that it is capable of handling both sudden and gradual concept drifts. In future research, we plan to extend the proposed method to make it more resource-aware for fast-changing data streams and to make it able to deal with missing values.