Keywords

1 Introduction

Recommender System (RS) is an effective tool to deal with information overload problem [1]. As the main thrust in the field of RS research, collaborative filtering(CF) methods are widely used for its knowledge domain free. CF-based approaches can be divided into two branches: neighborhood-based and model-based [2]. Neighborhood-based approaches focus on spotting the similarity relationships among either users or items. Herlocker et al. [3] carried out a comprehensive research into user-based collaborative algorithms. Badrul et al. [4] presented the item-based collaborative system to pre-compute the item-item similarities.

In contrast to the neighbor-based method, singular value decomposition (SVD) [6] is the most prevalent model-based method for Matrix factorization (MF) [5]. SVD performs well in building accurate models for RS by applying elaborated and succinct algebraic structure. Nevertheless, SVD method is only well-defined when the matrix is complete, still facing the problem of sparseness. Imputation [7] is a technique to get a relatively denser matrix filled with baseline estimations for missing ratings. However, inaccurate imputation might distort the original data considerably.

In recent years, research in the neural network has taken a breakthrough in deep layer architecture. The deep belief networks (DBN) [8] inspire researchers to integrate neural network with CF for patterns learning or useful features extracting [9] uses Restricted Boltzmann Machines (RBM) to perform CF, and [10] extended this work to a noo-iid framework. Autoencoder (AE) as a special neural network has achieved good performance in the text processing, image classification tasks. In this paper, we employ AE as a user preference learning component in collaborative filtering and develop a stacked denoising autoencoder (SDAE) based model to alleviate sparseness issue in RS.

2 Traditional Autoencoder

A traditional autoencoder is a feedforward neural network illustrated in Fig. 1. AE can learning representations of the raw data by reconstructing the input data from its output. Usually, it has a three-layered structure, the input and the output has the same number of neurones, and a hidden layer is in the middle.

Fig. 1.
figure 1

Traditional autoencoder

Training an autoencoder is to minimise the reconstruction error, solve the following optimisation problem:

$$ \begin{aligned} \mathop {L(x,\hat{x})}\limits_{\theta ,\theta '} = & \frac{1}{2m}\sum\limits_{i = 1}^{m} {\left\| {\hat{x}^{i} - x^{i} } \right\|^{2} } + \frac{\lambda }{2}(\left\| {W^{(1)} } \right\|^{2} + \left\| {W^{(2)} } \right\|^{2} ) \\ & \mathop {\arg \hbox{min} }\limits_{\theta ,\theta '} L(x,\hat{x}) \\ \end{aligned} $$
(1)

To get the hyper-parameters \( {\text{W}}^{(l)} ,b^{(l)} ,l = 1,2 \) in (1). Stochastic Gradient Descent (SGD) is an efficient method to train autoencoder. The key steps of SGD is computing the partial derivatives. We use backpropagation algorithm to compute these partial derivatives. Steps as follows:

For each training example \( x^{i} \)

  1. 1.

    Feedforward pass the inputs, we get activations for each layer: \( x^{i} \),\( h^{i} \), \( \hat{x}^{i} \).

  2. 2.

    Backpropagate the errors:

    1. (1)

      Compute the error of output layer,

      $$ \delta^{(2)} = (\hat{x}^{i} - x^{i} ) \cdot \hat{x}^{i} \cdot (1 - x^{i} ) $$
      (2)
    2. (2)

      Backpropagate the output error \( \delta^{(2)} \) to the hidden layer,

      $$ \delta^{(1)} = (W^{(2)} )^{T} \cdot \delta^{(2)} \cdot h^{i} \cdot (1 - h^{i} ) $$
      (3)
  3. 3.

    Compute the desired partial derivatives, set as

    $$ \frac{\partial L}{{\partial W^{(1)} }} = \delta^{(1)} (x^{i} )^{T} ,\frac{\partial L}{{\partial b^{(1)} }} = \delta^{(l)} $$
    (4)
    $$ \frac{\partial L}{{\partial W^{(2)} }} = \delta^{(2)} (h^{i} )^{T} ,\frac{\partial L}{{\partial b^{(2)} }} = \delta^{(2)} $$
    (5)
  4. 4.

    After T rounds iterations when finally the value of (1) tends to be stable, we can take \( h^{i} \), the high-order vector of original \( x^{i} \), as a more concentrated and effective feature representation of raw input \( x^{i} \).

3 Construct SDAE

SDAE stands for Stacked Denoising Autoencoder, which is a variant of traditional autoencoder. We can impose constraints on AE to discover more useful representation. By altering the number of units in the hidden layer, we can get a lossy compressed representation feature. But purely altering the size of bottleneck layers may not guarantee to extract good features consistently. [11] proposed a strategy to corrupt the clean input partially for constructing a denosing autoencoder structure(DAE). There are three commonly used noise models to corrupt the input, the additive Gaussian noise, the salt-and-pepper noise and the Masking noise.

The training process of DAE is the same as an AE, except that it takes corrupted input \( \tilde{x} \) generated from a particular noise model and then minimise reconstruction error between its outputs and the original clean inputs. That is, the overall loss function is still the same. Figure 2. shows a schematic for these procedures.

Fig. 2.
figure 2

The schematic of denoising autoencoder

Based on all of the foundations above, learning even higher level representations would be possible by stacking denoising autoencoders. SDAE is a bit deeper neural network constructed with multiple levels of DAEs in which the hidden outputs of each level is accepted as inflow to the successive.

Greedy layer-wise training [12] is an efficient way to obtain parameters for SDAE or SAE. After training the first level autoencoder for corrupted input, its learnt encoding operator \( f_{\theta } \) acts on clean input to get the first level hidden layer vector, then use it to train the second level DAE to learn encoder \( f_{\theta }^{(2)} \). From here, repeat the procedure. After training a stack of encoders, perform backpropagation through the whole system to fine-tune the hyper-parameters globally. Figure 3 illustrates the schematic for training the SDAE.

Fig. 3.
figure 3

Schematic for training the SDAE

3.1 SDAE Applied in Collaborative Filtering

In view of the ability of AE to mining inherent structure of data in many domains, we bring SDAE into collaborative filtering, trying to excavate delicate structure of user’s preference, and obtain the compressed representation of user’s behaviour vector, then apply these high-level features in recommendation task.

We model user’s preference behaviour as a vector consisting of ratings over items, denoting \( p_{{u_{i} }} \in \Re^{n} \), for example, in a 5-stars scaled rating system for movies. There exist a target user \( u_{t} \) which has a rated items collection \( I_{{u_{t} }} = \{ i_{1} ,i_{3} ,i_{4} ,i_{5} \} \), in which \( r_{t,1} = 5,r_{t,3} = 2,r_{t,4} = 3,r_{t,5} = 4 \), then we can get \( p_{{u_{i} }} = (5,NULL,2,3,4,NULL, \cdots ,NULL)^{T} \).

As shown above, due to the sparseness of rating matrix for RS, the user behaviour preference vector containing a large number of NULL value or missing value. There are many reasons for these missing ratings beyond only not liking it, so we can’t fill these NULLs with “0” and feed this kind of input to autoencoder, otherwise, all these zero ones will be treated as negative preference leading to a strong bias in the system.

To address issue stated above, we extend the user preference vector to a user preference matrix denoting as \( p_{{u_{i} }} \in \Re^{n \times S} \) where S is the numerical rating scale. In this \( n \times S \) matrix, if exists the rating \( r_{i,j} \) on item \( i_{j} \) by user \( u_{i} \) valued with c, then we set the \( p_{{u_{i} ,i_{j} }}^{c} = 1 \),otherwise 0, and we set the formula as

$$ p_{{u_{i} ,i_{j} }}^{c} = \left\{ {\begin{array}{*{20}c} {1,} & {if\, \mathrel\backepsilon r_{i,j} = c,} & {i_{j} \in I_{{u_{i} }} } \\ {0,} & {if\,c \ne r_{i,j} ,\,} & {i_{j} \in I_{{u_{i} }} } \\ {0,} & {otherwise\,if} & {i_{j} \notin I_{{u_{i} }} } \\ \end{array} } \right. $$
(6)

As for the values of \( p_{{u_{i} ,i_{j} }}^{c} \), we illustrate the same example aforementioned, in a 5-starts scaled(S = 5) rating system for movies, as for the partial feature comes from item \( i_{1} \) in the combined user’s preference matrix,we get \( p_{{u_{t} ,i_{1} }}^{5} = 1 \) and set the rest part as 0, thus we get vector \( p_{{u_{t} ,i_{1} }} = (1, 0, 0, 0, 0) \), the same process as for the rest part, we get \( p_{{u_{t} ,i_{3} }} = (0, 1, 0, 0, 0) \), \( p_{{u_{t} ,i_{4} }} = (0, 0, 1, 0, 0) \), \( p_{{u_{t} ,i_{5} }} = (0, 0, 0, 1, 0) \). For those items \( i_{other} \notin I_{{u_{i} }} \) called missing value still can be derived according to (6). Finally, we combine each partial feature vector as a feature matrix

Note that the value “0” is deferent from the binary indicator vector \( {\mathbf{0}} \). And we suppose \( 0 \times NULL = 0 \), which will be useful in the next computation process of modified AE.

In this article, we modify the autoencoder to adapt the NULL-value (sparseness) scenario illustrated as above. The particular modifications are listed as follows.

1. For a particular user \( u_{i} \), only activate the input units for the items rated by that user, and then feed those data to SDAE for encoding.

2. In steps of building each level AE of SDAE, when involving decoding the hidden layer in each local level DAE. We also only reconstruct the ratings for the items rated by the user \( u_{i} \), which also means only backpropagate the error coming from rated items by the current user.

In this modified model, it seems that every user in the system will have their own personal autoencoder based on their own collection of rated items. Each autoencoder only has a single training case. Weights and biases belonging to one user-specific autoencoder contribute to the entire global neural networks. Thus if the item was co-rated by many users, these users would share the corresponding weights for that item through the path from input to output in their own autoencoder networks. Figure 4 illustrate the modified AE model.

Fig. 4.
figure 4

Modified AE model in collaborative filtering

After all above preparation, then we can train the basic DAE and then stack them all to get SDAE. For each user \( u_{i} \), we substitute \( p_{{u_{i} }} \) for \( x^{i} \) and carry out the feedforward step illustrated in Sect. 2, the final output \( \hat{p}^{c}_{{u_{i} i_{j} }} \in [0,1] \) substituted for \( \hat{x}^{i} \) represents the probability of item \( i_{j} \) rated in value c. Here we can regard c as a confidence level for expressing not only the user rated the item but also indicate the preference extent the user showed when rating it.

By limiting the number of hidden neurons d’ to a much smaller but reasonable scale than the total number of items \( n \), the modified AE transform the sparse users behavior vector with null values into a dense feature vector. Thus decreasing the dimensionality of the user behavior vector space from n to d’, in which small perturbances inherited from noise in the data are eliminated, leaving only the strongest effects or robust dependency features. Consequently, it decreases the storage and computational requirements for the next collaborative filtering workflow.

3.2 Generate Recommended List and Prediction

After the workflow of SDAE, we can obtain the compact and efficient feature vector learnt from user’s original preference behaviour. For a particular user \( u_{i} \), we regularise the hidden feature vector \( h^{{u_{i} }} \) to \( bicode_{{u_{i} }} \), the process of binarization given as follows, where 0.5 is the threshold value to decide if the hidden unit was highly activated by some latent features.

$$ bicode_{{u_{i} k}} = \left\{ {\begin{array}{*{20}c} {1,} & {if\,h^{{u_{i} }} \ge 0.5} \\ {0,} & {otherwise} \\ \end{array} } \right.\begin{array}{*{20}c} {1 \le k \le d^{\prime}} \\ \end{array} $$
(7)

We use the modified Hamming distance between these \( bicode_{{u_{i} }} \) as the similarity metric formulated as

$$ sim(u_{1} ,u_{2} ) = 1 - \frac{{Hamm(bicode_{{u_{1} }} ,bicode_{{u_{2} }} )}}{len(hidden\,\,units\,\,of\,\,final\,\,features)} $$
(8)

In the last step, we use the similarity between users to generate prediction ratings and recommendation list. Given the particular user \( u_{i} \), and a certain item \( i_{j} \), where \( i_{j} \notin I_{{u_{i} }} \), we define \( {\rm{deg}}_{{u_{i} ,i_{j} }} \) to express the degree for item \( i_{j} \) worth of being recommended to user \( u_{i} \).

$$ {\rm{deg}}_{{u_i},{i_j}} = \frac{{\sum\limits_{v \in Nbr({u_i},{i_j})} {sim(u,v)} }}{K} $$
(9)

where Nbr(u,i) are the neighbourhood users who have rated item \( i_{j} \), and shared the most top-K similarity with the user \( u_{i} \). The parameter K is the neighborhood size, needed to be optimized and determined experimentally. Then we can get the Top-N items in a descending order of \( {\rm{deg}}_{{u_{i} ,i_{j} }} \) as recommendation list for the target user.

We also define prediction ratings \( score_{{u_{i} ,i_{j} }} \) as

$$ score_{{u_{i} ,i_{j} }} = \bar{r}_{{u_{i} }} + \frac{{\sum\limits_{{v \in Nbr(u_{i} ,i_{j} )}} {(r_{{v,i_{j} }} - \bar{r}_{v} )sim(u,v)} }}{{\sum\limits_{{v \in Nbr(u_{i} ,i_{j} )}} {sim(u,v)} }} $$
(10)

4 Experiments and Discussion

4.1 Dataset and Evaluation

We use MovieLens datasets provided by GroupLens research team to evaluate the performance of our SDAE-CF approach and compare the results with several alternative baseline CF-based methods. To mimic different extent sparseness of the rating matrix, in this paper we experiment on MovieLens 100 k, 1 M, and 10 M datasets with their statistics summarised in Table 1.

Table 1. Statistics for data sets of the MovieLens

We randomly sample 80% ratings for each user as the training set and the rest 20% is used as the testing set. We generate five independent splits and report the averaged performance in our evaluations. We firstly take MAE as our evaluation metric to report experiments in user preference prediction. Then we use recall as another performance measure particularly for the Top-N recommendation task [13]. The evaluation rules are listed as follows, where \( Test_{u} \) denotes the items having been rated by the user in the testing set. \( I_{r} \) is the set of Top-N recommendation generated from its training set denoted as \( Train_{u} \).

$$ MAE = \frac{{\sum\nolimits_{u \in U} {\left| {r_{i,j} - scrore_{{u_{i} ,i_{j} }} } \right|} }}{{\sum\nolimits_{u \in U} {\left| {Test_{u} } \right|} }},\,recall = \frac{{\sum\nolimits_{u \in U} {\left| {I_{r} \cap Test_{u} } \right|} }}{{\sum\nolimits_{u \in U} {\left| {Test_{u} } \right|} }} $$

4.2 Results and Discussion

In this section, we present the prediction quality of our SDAE-CF model. The benchmark methods include three conventional CF-based methods, that’s the pure user-based method [2], the pure Item-based method [3], and the SVD feature [5].

In order to do a reasonable comparison, pre-settings of the experiments are as follows, according to [2], we set the neighbourhood size with 30 and use person correlation as similarity metric for user-based benchmark algorithm. As for [3], we choose the adjusted cosine as the similarity metric. The third baseline SVD feature method is a latent factor model intuitively having some analogies with our model, which leads us first thought of experiment on a middle sized dataset—MovieLens 1 M, to determine the optimal numbers of hidden units for our model, and simultaneously in line with the number of the latent factor in SVD. The sensitivity of MAE affected by the number of hidden units in a basic DAE is illustrated in Fig. 5, showing that 200 hidden units would be the choice for a better prediction performance and there is no obvious difference when the hidden units go beyond 200. Thus we set the number of latent factors in SVD feature with 200.

Fig. 5.
figure 5

The sensitivity of the number of hidden units

Table 2 presents the MAE results from all three alternative baseline methods and our SDAE based model, in which fundamental structure also be involved. We can express the table from an empirical view that the DAE is relatively more accurate than the basic AE in faced with higher dimensional data. As it shows in Table 2, DAE and SDAE seems underperforming when coping with a lower dimension datasets, but still outperform other alternatives slightly. The purpose of introducing noise in autoencoder is for learning more robust features from the input. However, if the input itself are not strong enough, introducing noise would achieve nothing and may result in unrecoverable injuries to the original input. In general, AE based models are nearly the same level performance with the SVD feature methods by MAE.

Table 2. Performance on MovieLens datasets by MAE

The second form of interest prediction is Top-N recommendation task. We carry out experiments on MovieLens 10 M dataset to get the Top-N recommended items just according to (9), requires no further prediction score calculation. And we zoomed the range of N in [5, 10, 40]. Table 3 report the recall results in particular when the length of recommendation list is 20 and 40. Figure 6 illustrates that SDAE-based method outperforms the other three basic methods regarding recall metric, first also followed by the SVD feature. However, there is just a small performance gap between SDAE and SVD. In this respect, the two traditional neighborhood-based methods are in an inferior position, for in a sparser dataset the similarity calculation based on correlation is hindered by lacking enough co-rated items for users.

Table 3. Performance on MovieLens 10 M dataset by recall*
Fig. 6.
figure 6

Performance comparison by recall

5 Conclusion

In this paper, we analysis the problems of traditional collaborative filtering methods, then we proposed SDAE-CF model to alleviate the issues. On the one hand, using SDAE to encode the user preference vector break the limitation of similarity calculation among users depending on the common rated items, thus alleviate the loss of potential information. On the other hand, since the modified similarity based on Hamming distance can be calculated in constant time, decreasing the requirement for storing similarity extensively, which extends the scalability of traditional correlation-based algorithm. Experiments on three type scale MovieLens datasets show that SDAE-CF has potential in dealing with a high-dimension dataset and can achieve relative good performance in score prediction and Top-N recommendation task.

For the future work, we consider integrating with content-based techniques which can provide a more informational data foundation for extracting more useful latent features, and further to confirm the applicability of SDAE-CF model.