Abstract
In this paper, the problem of concept drift detection in data stream mining algorithms is considered. The autoencoder is proposed to be applied as a drift detector. The autoencoders are neural networks that are learned how to reconstruct input data. As a side effect, they are able to learn compact nonlinear codes, which summarize the most important features of input data. We suspect that the properly learned autoencoder on one part of the data stream can then be used to monitor possible changes in the following stream parts. The changes are analyzed by monitoring variations of the autoencoder cost function. Two cost functions are applied in this paper: the cross-entropy and the reconstruction error. Preliminary experimental results show that the proposed autoencoder-based detector is able to handle different types of concept drift, e.g. the sudden or the gradual.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, the topics connected with data stream mining attracted much attention of machine learning researchers [1, 6, 9,10,11,12,13,14,15,16,17, 24, 27,28,29,30,31,32,33,34, 37, 38]. In data streams it is quite common that the underlying data distribution can change over time. It is known in the literature as the ‘concept drift’ [40, 41]. Another characteristics of the streaming data is their potentially infinite size and high rates of arriving at the system. Therefore, proper data stream mining algorithms should be resource-aware [8, 15].
The data stream can be defined as a sequence of data elements
In this paper the case of unsupervised learning is considered, which means that there are no classes assigned to data elements. Hence, the data stream element \(s_{n}\) is a D-dimensional vector defined as follows
The data stream mining algorithm should be able to react to possible changes in data distribution [16, 20, 41]. The concept drift handling can be realized in two ways, i.e. passively or actively. In the passive approach, a concept drift reaction mechanism is incorporated in the data stream mining algorithm itself. It can bed done by applying the sliding window techniques [7] or ensemble methods [28, 29]. In the active approach, there is an external drift detector functioning in parallel to the main algorithm. Based on the information from the detector the algorithm can make a decision about rebuilding the model. Many active drift detectors focus on monitoring the current accuracy of the model. If it decreases, then it means that the model becomes invalid and needs to be modified. En examples of such methods are the Drift Detection Method (DDM) [18] and the Early Drift Detection Method (EDDM) [3]. Other active drift detection methods rely on statistical tests. They look for possible changes in parameters of data probability density functions. Among others, the Welch’s, the Kolmogorov-Smirnov’s [27], or the Page-Hinkley’s tests are often used [19].
In this paper, we present an active concept drift detection approach based on autoencoders. This approach is similar to the one presented in [23], in which the Restricted Boltzmann Machine (RBM) [39] was applied as a drift detector. This idea was further extended with resource-awareness strategies to deal with fast data streams [26] or to deal with missing values [25]. Autoencoders are important neural network models often used in the field of deep learning [21]. They are used, for example, in greedy-pretraining methods of learning deep neural networks [4, 5]. Similarly to RBMs, the properly learned autoencoder contains in its middle layer the information about the data probability distribution. It can be then used to check whether the data from another part of the data stream belong to the same distribution as the part on which the autoencoder was trained. This can be checked, for example, by measuring the reconstruction error or the cross-entropy of new data.
The rest of the paper is organized as follows. In Sect. 2 autoencoders are presented in more detail. In Sect. 3 methods for concept drift detection using autoencoders are proposed. In Sect. 4 the results obtained in numerical simulations are demonstrated. Sect. 5 concludes the paper and indicates possible future research directions.
2 Autoencoders
The autoencoder is a kind of feed-forward neural network, which is learned to reconstruct input data [22]. There are many types of autoencoders [21], like the denoising autoencoder [2], the sparse autoencoder [35] or the contractive autoencoder [36]. Generally, the autoencoder is composed of two parts: the encoder and the decoder. The aim of the encoder is to transform input data in a nonlinear way, extracting the most important features of the data. The decoder is learned to reconstruct the output of the encoder back to the input data as close as possible. In this paper, we consider autoencoders consisting of \(L+1\) layers. Let \(y_{j}^{(l)}(x_{n})\) denote the value of the j-th neuron in the l-th layer obtained for input data \(x_{n}\). Then, the values of neurons in the l-th layer can be computed using the following formula
where \(w_{ij}^{(l)}\) are weights of synapses between the \((l-1)\)-th and the l-th layers, \(b_{j}^{(l)}\) are the biases, and \(N_{l}\) is the size of the l-th layer. Function \(\sigma (z)\) is an activation function. Many different types of activation functions can be applied in autoencoder. In this paper, we apply the most common one, i.e. the sigmoid function defined below
The 0-th layer is the input layer, i.e. \(y_{j}^{(0)}(x_{n}) = x_{n}^{j}\). The size of the output layer is equal to the input layer size, i.e. \(N_{L} = N_{0} = D\). If we do not apply any sparsity constraints while learning the autoencoder, the middle layer should be of the smallest size, i.e.
The autoencoder is trained as a usual feedforward neural network by minimizing a cost function. Let \(C(x_{n})\) be the cost function obtained for the \(x_{n}\) data element. Then, for example, a weight \(w_{ij}^{(l)}\) would be updated in each step using the following formula
where \(\eta \) is the learning rate. Analogous formula applies for biases.
In practice, the learning is often performed on minibatches of data instead of single data elements. Let us assume that the minibatch \(M_{t}\) consists of B subsequent data
Then, the neural network parameters are obtained using the arithmetic average of gradients obtained for all data from the minibatch
Many different types of cost functions can be used. In this paper, we consider two of them. The cross-entropy:
and the reconstruction error:
The monitoring of cost function values can be used to detect potential drifts in data distribution. The detection procedure is described in the next section.
3 Concept Drift Detection with the Autoencoder
We suspect that the autoencoder properly trained on one part of the data stream can be applied to monitor possible changes in data distribution in the following parts. For any cost function C (given by (9) or (10), the average value of this cost function for minibatch \(M_{t}\) of size B is given by
Apart from the cost function itself, we want to investigate also the trend Q(C, t) of cost function C for subsequent minibatches. To ensure that the trend is computed only on the basis of the most recent data, we apply additionally the forgetting mechanism, as it was done in [23]. The trend at time t is given by the following formula
In standard regression procedure, the terms in formula (12) are given simply by appropriate arithmetic averages. In our case, because of the forgetting mechanism applied, the arithmetic averages are replaced by the following recurrent formulas
where \(\lambda \) is a forgetting factor and it takes values lower than 1. The trend of the cost function should be close to 0 as long as there is no concept drift in the data stream. After the drift occurs, then Q(C, t) is suspected to increase to some positive value. Since we consider two cost functions in this paper, there are two trends analyzed as well, i.e. \(Q(C_{CE},t)\) and \(Q(C_{RE},t)\).
4 Experimental Results
In this section, we present preliminary simulation results of experiments conducted on simple synthetic datasets with concept driftFootnote 1. The dataset was generated in the same manner as it was done in [23], i.e. based on synthetic Boltzmann machines with randomly chosen parameters. At the beginning, we applied two Boltzmann machines, which then were used to generate to datasets with static probability distributions
The dimensionality of the data in both sets is \(D=20\). The datasets \(S_{1}\) and \(S_{2}\) were used to generate two datasets with concept drifts. The dataset with sudden concept drift is denoted as \(S_{s} = \left( x_{s,1},x_{s,2},\dots \right) \), where \(x_{s,n}\) is taken from \(S_{1}\) if \(n < 500000\) and from \(S_{2}\) if \(n \ge 500000\). The dataset with gradual concept drift is denoted as \(S_{g} = \left( x_{g,1},x_{g,2},\dots \right) \), where \(x_{g,n}\) is from \(S_{1}\) if \(n < 500000\) and from \(S_{2}\) if \(n \ge 600000\). In the interval [500000; 600000) the data element \(x_{g,n}\) is drawn from \(S_{1}\) with probability \(\frac{600000\,-\,n}{100000}\) and from \(S_{2}\) with probability \(\frac{n\,-\,500000}{100000}\). Hence, the two datasets \(S_{s}\) and \(S_{g}\) are constructed in such a way, that the concept drift occurs at after processing 500000 data elements.
In the experiments, we used the simple 3-layered autoencoder (\(L=3\)). The sizes of the 0-th and the 2-nd layer are equal to \(N_{0} = N_{2} = D = 20\), to match the dimensionality of the data. The middle layer is chosen to be of size \(N_{1} = 15\). The learning rate and minibatches sizes are set to \(\eta = 0.05\) and \(B = 20\), respectively. In each experiment, the autoencoder is trained only for the first 400000 data elements. Then, it is turned to the monitoring mode (no learning is performed, only the values of the cost functions are analyzed). The forgetting factor \(\lambda \) for trend measures was set to 0.998.
4.1 Sudden Concept Drift Detection
The first experiment was conducted on the \(S_{s}\) dataset, with the sudden drift. The results obtained for the cross-entropy loss function are presented in Fig. 1. As can be seen, the value of the cross-entropy decreases while the autoencoder learns until the 400000-th data element. Then it monitors the incoming data. Since the distribution of data does not change until the 500000-th element, the cross-entropy values fluctuate around some fixed level. Then the sudden drift occurs and the cross-entropy suddenly achieves higher values. It is more clearly visible in Fig. 1(b), where the trend of cross-entropy values is presented. The peak at \(n=500000\) matches the occurrence of the sudden drift.
In the case of the reconstruction error, the results can be additionally compared with the drift detector based on the RBM, presented in [23]. We applied the same parameters as for the autoencoder. The RBM was learned using the Contrastive Divergence (CD-k) method with K = 1. The results are shown in Fig. 2.
It seems that the autoencoder wins with the RBM approach in both considered aspects. In Fig. 2(a) we can observe that the values of the reconstruction error are lower for the autoencoder. It might be important if we wanted to use the network simultaneously for other purposes, besides the concept drift detection. In Fig. 2(b) it is visible that the signal peak of sudden change detection is significantly higher for the autoencoder.
4.2 Gradual Concept Drift Detection
In the next experiment, analogous simulations were carried out for the dataset \(S_{g}\) with gradual concept drift. The results concerning the cross-entropy are presented in Fig. 3, whereas in Fig. 4 the comparisons of reconstruction error values are demonstrated.
In this experiment, the values of cost functions raise smoothly whilst the gradual drift takes place between the 500000-th and the 600000-th data elements. Figure 4 confirms the superiority of the autoencoder over the RBM-based approach also in the case of gradual drift.
5 Conclusions and Future Work
In this paper, the autoencoder was proposed as a drift detector in time-changing data streams. First, the neural network learns the model on the beginning part of the stream. Then it is used to monitor possible changes in the following parts. The variations of the cost function values are analyzed. If he changes turn out to be large, it means that the concept drift occurred, and it can be a signal for the data stream mining algorithm to rebuild the current model. Two cost functions were applied in this paper, i.e. the cross-entropy and the reconstruction error. The preliminary results obtained in experiments carried out on simple synthetic datasets confirmed that the autoencoder can be successfully used as a concept drift detector. It was demonstrated that it is capable of handling both sudden and gradual concept drifts. In future research, we plan to extend the proposed method to make it more resource-aware for fast-changing data streams and to make it able to deal with missing values.
Notes
- 1.
All the experiments were conducted using our own software implemented in C/C++ and it can be found at www.iisi.pcz.pl/~mjaworski.
References
Aggarwal, C.: Data Streams: Models and Algorithms. Springer, New York (2007)
Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data-generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)
Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R., Morales-Bueno, R.: Early drift detection method. In: Fourth International Workshop on Knowledge Discovery from Data Streams, vol. 6, pp. 77–86 (2006)
Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. NIPS 2006, pp. 153–160. MIT Press, Cambridge, MA, USA (2006)
Bifet, A.: Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam, Berlin (2010)
Bifet, A., Gavaldá, R.: Learning from time-changing data with adaptive windowing, pp. 443–448 (2007)
Bilski, J., Kowalczyk, B., Grzanek, K.: The parallel modification to the Levenberg-Marquardt algorithm. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 15–24. Springer, Cham (2018)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)
Duda, P., Rutkowski, L., Jaworski, M., Rutkowska, D.: On the Parzen Kernel-based probability density function learning procedures over time-varying streaming data with applications to pattern classification. IEEE Trans. Cybern. 50(4), 1683–1696 (2020)
Duda, P., Jaworski, M., Cader, A., Wang, L.: On training deep neural networks using a streaming approach. J. Artif. Intell. Soft Comput. Res. 10(1), 15–26 (2020)
Duda, P., Jaworski, M., Rutkowski, L.: Convergent time-varying regression models for data streams: tracking concept drift by the recursive Parzen-based generalized regression neural networks. Int. J. Neural Syst. 28(02), 1750048 (2018)
Duda, P., Jaworski, M., Rutkowski, L.: Knowledge discovery in data streams with the orthogonal series-based generalized regression neural networks. Inf. Sci. 460–461, 497–518 (2018)
Dyer, K.B., Capo, R., Polikar, R.: COMPOSE: a semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 12–26 (2014)
Gaber, M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. Sigmod Rec. 34(2), 18–26 (2005)
Gałkowski, T., Krzyżak, A., Filutowicz, Z.: A new approach to detection of changes in multidimensional patterns. J. Artif. Intell. Soft Comput. Res. 10(2), 125–136 (2020). https://doi.org/10.2478/jaiscr-2020-0009
Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5_29
Gama, J., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2009, pp. 329–338. ACM, New York (2009)
Gomes, J., Gaber, M., Sousa, P., Menasalvas, E.: Mining recurring concepts in a dynamic feature space. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 95–110 (2014)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Proceedings of the 6th International Conference on Neural Information Processing Systems. NIPS 1993, pp. 3–10. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)
Jaworski, M., Duda, P., Rutkowski, L.: On applying the restricted Boltzmann machine to active concept drift detection. In: Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence Honolulu, USA, pp. 3512–3519 (2017)
Jaworski, M., Duda, P., Rutkowski, L.: New splitting criteria for decision trees in stationary data streams. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2516–2529 (2018)
Jaworski, M., Duda, P., Rutkowska, D., Rutkowski, L.: On handling missing values in data stream mining algorithms based on the restricted Boltzmann machine. In: Gedeon, T., Wong, K.W., Lee, M. (eds.) ICONIP 2019. CCIS, vol. 1143, pp. 347–354. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36802-9_37
Jaworski, M., Rutkowski, L., Duda, P., Cader, A.: Resource-aware data stream mining using the restricted Boltzmann machine. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2019. LNCS (LNAI), vol. 11509, pp. 384–396. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20915-5_35
Lemaire, V., Salperwyck, C., Bondu, A.: A survey on supervised classification on data streams. In: Zimányi, E., Kutsche, R.-D. (eds.) eBISS 2014. LNBIP, vol. 205, pp. 88–125. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17551-5_4
Ludwig, S.A.: Applying a neural network ensemble to intrusion detection. J. Artif. Intelli. Soft Comput. Res. 9(3), 177–188 (2019)
Pietruczuk, L., Rutkowski, L., Jaworski, M., Duda, P.: How to adjust an ensemble size in stream data mining? Inf. Sci. 381(C), 46–54 (2017)
Rafajłowicz, E., Rafajłowicz, W.: Testing (non-) linearity of distributed-parameter systems from a video sequence. Asian J. Control 12(2), 146–158 (2010)
Rafajłowicz, E., Rafajłowicz, W.: Iterative learning in repetitive optimal control of linear dynamic processes. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9692, pp. 705–717. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39378-0_60
Rafajłowicz, E., Rafajłowicz, W.: Iterative learning in optimal control of linear dynamic processes. Int. J. Control 91(7), 1522–1540 (2018)
Rafajłowicz, E., Wnuk, M., Rafajłowicz, W.: Local detection of defects from image sequences. Int. J. Appl. Math. Comput. Sci. 18(4), 581–592 (2008)
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. NIPS 2006, pp. 1137–1144. MIT Press, Cambridge (2006)
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th International Conference on Machine Learning. ICML 2001, pp. 833–840. Omnipress, Madison (2011)
Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1048–1059 (2015)
Rutkowski, L., Pietruczuk, L., Duda, P., Jaworski, M.: Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25(6), 1272–1279 (2013)
Smolensky, P.: Parallel distributed processing: explorations in the microstructure of cognition. In: Information Processing in Dynamical Systems: Foundations of Harmony Theory, vol. 1, pp. 194–281. MIT Press, Cambridge (1986)
Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report. TCD-CS-2004-15. Computer Science Department, Trinity College Dublin, Ireland (2004)
Zliobaite, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 27–39 (2014)
Acknowledgments
This work was supported by the Polish National Science Centre under grant no. 2017/27/B/ST6/02852.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jaworski, M., Rutkowski, L., Angelov, P. (2020). Concept Drift Detection Using Autoencoders in Data Streams Processing. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science(), vol 12415. Springer, Cham. https://doi.org/10.1007/978-3-030-61401-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-61401-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61400-3
Online ISBN: 978-3-030-61401-0
eBook Packages: Computer ScienceComputer Science (R0)