Keywords

1 Introduction

In recent years data stream mining became a very interesting and challenging branch of data mining [3, 14,15,16]. In this paper, we define the data stream as a sequence of data elements

$$\begin{aligned} S = (\mathbf s _{1},\mathbf s _{2},\dots ), \end{aligned}$$
(1)

which potentially can be of infinite size. Each data element is a D-dimensional vector of binary values

$$\begin{aligned} \mathbf s _{n} = \left[ s_{n,1},\dots ,s_{n,D}\right] \in \{0;1\}^{D} \end{aligned}$$
(2)

A proper data stream mining algorithm should ensure the best trade-off between the accuracy and resources consumption. In the literature, many algorithms based on traditional machine learning or data mining tools have been proposed, e.g. neural networks with the stochastic gradient descent method [4], decision trees [10] or ensemble methods [12].

The problem of data stream mining becomes more challenging if the underlying data distribution can change over time [17]. In this paper, we focus on the issue of applying the Restricted Boltzmann Machine (RBM) to detect possible changes in the data distribution. This idea was first proposed in [8], and extended in [9] to allow dealing with labeled data. In [11] the resource-awareness of the RBM in data stream scenario was investigated. In this paper, we continue the topic by proposing modifications of the RBM learning algorithm to handle data streams with missing values.

The RBM is a special type of a wider class of neural networks called Boltzmann Machines [7]. It consists of two layers of neurons: the visible one, consisting of D neurons \(\mathbf v = \left[ v_{1},\dots ,v_{D}\right] \) and the hidden one, which is formed by H hidden units \(\mathbf h = \left[ h_{1},\dots ,h_{H}\right] \). For each possible state \((\mathbf v ,\mathbf h )\) of the RBM an energy can be calculated, which is defined as follows

$$\begin{aligned} E(\mathbf v ,\mathbf h ) = -\sum _{i=1}^{D}v_{i}a_{i} - \sum _{j=1}^{H}h_{j}b_{j} - \sum _{i=1}^{D}\sum _{j=1}^{H}v_{i}h_{j}w_{ij}, \end{aligned}$$
(3)

where \(w_{ij}\), \(a_{i}\) and \(B_{j}\) are RBM weights and biases. The energy function is used to define a probability distribution of \((\mathbf v ,\mathbf h )\)

$$\begin{aligned} P(\mathbf v ,\mathbf h ) = \frac{\exp \left( -E(\mathbf v ,\mathbf h )\right) }{Z}, \end{aligned}$$
(4)

where Z is a normalization constant. Let us assume that the data stream (1) is partitioned into minibatches of size B, i.e. the t-th minibatch is given by

$$\begin{aligned} S_{t} = \left( \mathbf s _{Bt+1},\dots ,\mathbf s _{Bt+B}\right) ,\; t = 0,1,\dots . \end{aligned}$$
(5)

Then, the cost function for \(S_{t}\) is given by the following formula

$$\begin{aligned} C(S_{t}) = -\log P(S_{t}) = -\frac{1}{B}\sum _{n=1}^{B}\sum _\mathbf{h }\log P(\mathbf v =\mathbf s _{Bt+n},\mathbf h ) \end{aligned}$$
(6)

and its gradient with respect to weight \(w_{ij}\) is expressed as follows (see. e.g. [2, 4])

$$\begin{aligned} \frac{\partial {C(S_{t})}}{\partial {w_{ij}}} = \sum _\mathbf{v ,\mathbf h }P(\mathbf v ,\mathbf h )v_{i}h_{j} - \frac{1}{B}\sum _{n=1}^{B}\sum _\mathbf{h }P(\mathbf h |\mathbf v =\mathbf s _{Bt+n})v_{i}h_{j}. \end{aligned}$$
(7)

The first term on the right-hand side (‘negative phase’), is intractable to compute and can be approximated by the Contrastive Divergence (CD) algorithm [5]. In this paper we propose some modifications of the CD algorithm, allowing the RBM to handle incomplete data.

The rest of the paper is organized as follows. In Sect. 2 the CD algorithm for learning the RBM is recalled. It is shown how it is used for approximating the gradient of the RBM cost function. In Sect. 3 two modifications are proposed which allow the RBM to handle incomplete data. Preliminary results of experimental verification of presented methods are demonstrated in Sect. 4. Conclusions are discussed in Sect. 5.

2 Contrastive Divergence Learning Algorithm

As can be seen in (7), the gradient of the cost function \(\frac{\partial C}{\partial w_{ij}}\) consists of two terms. Each term is based on sampling from different probability distributions. The second term, called the ‘positive phase’, requires the procedure of inferring the states of the hidden units from the data element, which is presented in Algorithm 1.

figure a

The first term of gradient \(\frac{\partial C}{\partial w_{ij}}\), called the ‘negative phase’, is intractable to compute. In the CD algorithm, it is approximated by performing a Gibbs sampling algorithm [1], presented in Algorithm 2.

figure b
figure c

For both phases the gradient values can be updated using the procedure presented in Algorithm 3 (where \(sgn=-1\) and \(sgn=1\) correspond to the positive and negative phases, respectively). Finally, the CD algorithm for minibatch \(S_{t}\), consisting of all mentioned previously components, is presented in Algorithm 4.

3 RBM for Handling Incomplete Data

figure d

In the practical guide for training RBMs [6] several methods for inferring missing values were proposed. However, none of them seems to work fast enough to be suitable for data stream mining tasks. In the sequel, we propose two modifications of the CD algorithm to make it able to handle data streams with missing values.

For each minibatch of data elements \(S_{t}\) we assume that there exists a minibatch of masks \(M_{t} = \left( \mathbf m _{Bt+1},\dots ,\mathbf m _{Bt+B}\right) \). Each mask \(\mathbf m _{n}\) is a D-dimensional vector of \(\{TRUE,FALSE\}\) values. If \(m_{n,i}\) is TRUE, then the value of \(s_{n,i}\) is unknown. When necessary, by default this value is assumed to be equal to 0, until it is not restored. The first modification of the basic CD algorithm is to introduce a restoring function, presented in Algorithm 5. This procedure performs Gibbs sampling, however, only unknown units of the visible layer are updated.

The second proposed modification changes the gradients updating method. In the basic CD method, updates of gradients are calculated as the arithmetic average over the whole minibatch of data (as in Algorithm 3). In our approach, we introduce variable-sized minibatches. The size of the minibatch for the i-th dimension is equal to the number of data elements, for which the mask in the i-th dimension is FALSE

$$\begin{aligned} B_{i}(M_{t}) = \sum _{n=1}^{B}1_{\{m_{Bt+n,i}==FALSE\}}. \end{aligned}$$
(8)

Let \(\mathbf B =(B_{1},\dots ,B_{D})\) be a D-dimensional vector of dimension-dependent minibatch sizes. Then the method for gradients update, which takes the missing values into account, is presented in Algorithm 6.

The final form of the Contrastive Divergence algorithm for data with missing values, which we abbreviate here as CDM, is presented in Algorithm 7.

figure e

Comparing to the standard CD algorithm it requires several additional arguments. These are the minibatch of masks \(M_{t}\), corresponding to the minibatch of data \(S_{t}\), and the number of steps Q of the restoring procedure. Three last parameters, i.e. Rest, PosMask, and NegMask are boolean flags, which allow to turn on or off previously discussed modifications in the CDM algorithm.

4 Experimental Results

In this section, we present some preliminary results of the experimental verification of the presented methods. The numerical simulations were carried out on the MNIST dataset [13]. It contains 60000 gray-scale images of handwritten digits of size \(28 \times 28\). In experiments, we treat the dataset as a stream. The data order is mixed randomly. Then, it is processed with minibatches of size \(B=20\). For each data element, a mask of missing values was assigned. The mask was in the form of a square of size \(z \times z\) pixels. The position of this square on the image was chosen randomly, with equal probability for each possible location. The parameters for learning the RBM were set as follows: \(D=784\), \(H=40\), \(K=1\), \(Q=1\), the learning rate \(\eta = 0.05\). We applied standard stochastic gradient method with momentum – the friction parameter was equal to \(\gamma = 0.9\).

Looking at Algorithm 7, one can see that there are many possible variants of the proposed CDM algorithm. In the simulations we focus on three of them together with the standard CD algorithm:

  • CD: \(Rest=FALSE\), \(PosMask=FALSE\), \(NegMask=FALSE\);

  • CDM(TFF): \(Rest=TRUE\), \(PosMask=FALSE\), \(NegMask=FALSE\);

  • CDM(TTT): \(Rest=TRUE\), \(PosMask=TRUE\), \(NegMask=TRUE\);

Algorithms were evaluated in the prequential manner using the reconstruction error. For the considered minibatch of data \(S_{t}\) a set of reconstructions \(\tilde{S}_{t}=(\tilde{\mathbf{s }}_{Bt+1},\dots ,\tilde{\mathbf{s }}_{Bt+B})\) has to be obtained first using the RBM. Then, the average reconstruction error is expressed as follows

$$\begin{aligned} R(S_{t}) = \frac{1}{B}\sum _{n=1}^{B}\sum _{i=1}^{D}\left( s_{Bt+n,i}-\tilde{s}_{Bt+n,i}\right) ^{2}. \end{aligned}$$
(9)

In the first experiment, the considered algorithms were run with three various sizes of missing values masks: \(z = 2\), \(z = 6\) and \(z = 14\). The comparison of each algorithm performance for various values of z is demonstrated in Fig. 1. As can be seen, for each algorithm the reconstruction error is positively correlated with the amount of noise in data elements. Let us now look at the results of this experiment in another configuration. In Fig. 2 the algorithms are compared for each considered value of z. Although the values of reconstruction error fluctuate significantly in each case, it is possible to notice that the algorithm with all considered previously mechanisms turned on (i.e. the CDM(TTT) algorithm) is slightly better than the two others, whereas the standard CD algorithm is always the worst. It is the most clearly seen for the case with the biggest noise (i.e. \(z=14\)). Although the differences are not striking, it can be concluded that the proposed modifications improve the performance of the CD algorithm when the incomplete data have to be handled.

Fig. 1.
figure 1

Reconstruction error obtained for various sizes of missing values masks for three considered algorithms.

Fig. 2.
figure 2

Reconstruction error of the CD, CDM(TFF) and CDM(TTT) algorithms for three different sizes of missing values masks.

5 Conclusions

In this paper, we considered the problem of mining stream data with missing values using the Restricted Boltzmann Machine (RBM), focusing our analysis on the Contrastive Divergence (CD) algorithm. To make it able to handle incomplete data, we proposed two modification. The first one is to introduce an additional Gibbs sampling procedure at the beginning of processing each data element. However, only those units of the visible layer are updated for which the value of the corresponding dimension in the data element is missing. In the second modification, the fixed size of minibatch is replaced by minibatches with dimension-dependent sizes. This means that not all data from the minibatch take part in updating gradients of RBM weights or visual layer biases. The proposed methods were verified experimentally, demonstrating their usability for concept drift detection in data streams with incomplete data.