1 Introduction

Recommender systems can help users to discover their potentially preferences from varieties of items on the basis of their tastes [1]. Collaborative filtering (CF) is one of the key techniques to build personalized recommender systems, due to its accuracy and scalability [2]. The essence of CF is to infer users’ preferences from the behavior data of themselves and other users. Most conventional CF methods are based upon matrix factorization (MF) [3], which projects users and items into a shared latent space and uses a latent feature vector to represent either a user or an item [4]. However, MF-based methods suffer from rating sparsity, so that the accuracy of learning latent user/item representations is limited. To address rating sparsity, large numbers of works incorporate users’ and items’ side information into conventional MF models. For more accurate extraction of latent factors from side information, previous studies employ latent Dirichlet allocation (LDA) [5, 6], Bayesian personalized ranking [7, 8], denoising autoencoder (DAE) [9] and stacked denoising autoencoder (SDAE) [10,11,12,13] to model side information of users or items. Nevertheless, these methods use inner product to model interactions between users and items, which restricts their capability of capturing non-linearity [14]. To model non-linear interaction, various approaches apply deep neural networks to model these interactions and achieve promising performance, such as neural collaborative filtering (NCF) [14], deep matrix factorization (DMF) [15], neural factorization machine (NFM) [16], DeepFM [17], JRL [18], GCMC [19], DeepCoNN [20], ConvNCF [21] and IRGAN [22]. Nonetheless, these deep neural networks cannot capture the uncertainty of latent user/item representations.

Currently, several works have taken advantage of deep generative models to perform CF task, such as variational autoencoder (VAE) [23]. VAE is a non-linear probabilistic model that has the capability of capturing uncertainty, and the non-linearity enable it to explore non-linear probabilistic latent-variable models on large-scale recommendation datasets, such as collaborative variational autoencoder (CVAE) [24], CLVAE [25], VAECF [26] and VAE-HPrior [27]. Despite the effectiveness of these VAE-based methods, there are still several drawbacks. CVAE directly uses inner product to model interaction, which hinders itself to learn non-linear interactions between users and items. CLVAE and VAECF only exploit the rating information, which leads in poor performance as the sparsity of rating matrix is extremely high. VAECF and VAE-HPrior only model users’ behaviors to generate prediction, which makes them unable to recommend an item to a new user. Besides, VAECF selects the same Gaussian prior for all users, resulting in poor latent user representations [28].

To solve the problems mentioned above, we devise a deep hybrid framework, neural variational collaborative filtering (NVCF), and propose three NVCF-based instantiations with side information for top-k recommendation. Different from the user/item generative processes in most existing VAE-based methods, we model the generative process of both users and items through a unified neural variational model with parallel structure, which can effectively learn non-linear latent representations of users and items for CF. The side information of users and items is incorporated into their latent factors through a deep neural network for neural CF task, which means NVCF can mitigate rating sparsity and model better latent representations of users and items. The parameters of prior neural network are learned from data, leading to the fact that it is able to embed users’ better preferences and items’ features into latent factors of users and items, respectively. For inferring the posterior of latent factors of users and items, we derived a Stochastic Gradient Variational Bayes (SGVB) algorithm to infer these posteriors, which makes the parameters of our model can be effectively learned by back-propagation. The rest of this paper is organized as follows: In Sect. 2, an overview of related works on CF models is provided. In Sect. 3, our models are presented, and the parameters learning process is discussed. The Sect. 4 presents experimental results and discussions, followed by conclusions and future work in Sect. 5.

2 Related work

In recent years, the deep learning methods have attained tremendous achievements in various fields [29, 30]. Due to the abilities of neural networks to discover non-linear and subtle relationships in user-item feedbacks, many works utilize neural networks to address the task of CF. To incorporate item content information into latent item factors, collaborative deep learning (CDL) [10] integrates SDAE into probabilistic matrix factorization (PMF), which can balance the influences of user ratings and side information. Collaborative deep ranking (CDR) [11] utilizes pair-wise framework with implicit feedback, which leverages deep feature representation of item content into Bayesian pair-wise ranking. Deep collaborative filtering Framework [12] utilizes deep feature learning to aid collaborative recommendation, which embeds the content information of items and users while CDL and CDR only consider the effects of item features. Recently, the additional stacked denoising autoencoder (aSDAE) [13] was presented to incorporate side information into MF, which jointly performs deep latent user/item factors learning from side information, and CF task from the user rating. GCMC [19] considers the recommendation problem as a link prediction task with graph CNNs, which can easily integrate user/item side information (such as social networks and item relationships) into the recommendation model.

Since the above methods apply inner product to model the user/item interactions, they are not able to capture the complex structure of the interaction data between users and items. NCF framework [14] was proposed to make use of both linearity of MF and non-linearity of MLP to capture linear and non-linear relationship between users and items. NFM [16] employs Bi-Interaction layer to incorporate both user rating and item content information. Based on factorization machines, DeepFM [17] seamlessly integrates factorization machine and MLP, and it can model the high-order feature interactions via deep neural network and low-order interactions via factorization machine. For joint representations of user and item, JRL [18] places a MLP above the element-wise product of user embedding and item embedding, where user and item side information is adopted to learn the corresponding user and item representations based on deep representation learning architectures. DeepCoNN [20] adopts two parallel CNNs to model user behaviors and item properties from review texts, which alleviates data sparsity and enhances the interpretability by exploiting rich semantic representations of reviews with CNNs. ConvNCF [21] utilizes outer product instead of dot product to model user/item interaction patterns, and applies CNNs over the result of outer product to capture the high-order correlations among embeddings dimensions. IRGAN [22] is the first model which takes advantage of generative adversarial networks for item recommendation. Due to the power of capturing uncertainty and non-linearity of deep generative model [23], several works utilize deep generative model to address the task of CF. Such as, CVAE [24] applies VAE to incorporate item content information into MF. CLVAE [25] encompasses VAE through augmenting structures to model the auxiliary information and to model the implicit user feedbacks. VAECF [26] directly utilizes VAE for CF task, and VAE-HPrior [27] incorporates user-dependent priors in the latent VAE space to encode users’ preferences as functions of item reviews. Unlike previous VAE-based recommendation methods, this paper constructs the generative processes of users and items through a unified neural variational framework, which enables our model to capture both linear and non-linear latent representations of users and items.

3 Neural variational collaborative filtering with side information

In this section, we present the neural variational collaborative filtering framework (NVCF), as shown in Fig. 1. NVCF contains two main components: the feature extraction module and the NVCF module. In feature extraction process, the NVCF learns and extracts user/item features through a unified deep generative framework with parallel structure. Then, the latent user/item vectors are fed into NVCF module to learn the user-item relations, and finally generate the rating prediction (Table 1).

Fig. 1
figure 1

NVCF framework

3.1 Notations

Given M users and N items, the latent factors of user and item are denoted by \(U = \{u_i|i=1, \ldots , M\} \in \mathbb {R}^{K \times M}\) and \(V = \{v_j|j=1, \ldots , N\} \in \mathbb {R}^{K \times N}\) respectively, where K denotes the dimensions of latent factors. For implicit feedback, the user rating matrix is denoted by \(R \in \mathbb {R}^{M \times N}\), where \(R_{ij} = 1\) indicates that the i-th user has interacted with the j-th item, otherwise \(R_{ij} = 0\). The user’s and item’s side information is denoted by two “bag-of-items” vectors over users and items, \(X =\{X_i|i=1, \ldots , M\}\in \mathbb {R}^{P \times M}\) and \(Y = \{Y_j|j=1, \ldots , N\}\in \mathbb {R}^{Q \times N}\) respectively, where P and Q are the dimensions of user side information and item side information respectively. Here, we call X and Y latent profile representation and latent content representation, respectively. Given R, X and Y, the problem is to infer latent factors \(u_i\) and \(v_j\), and then to predict the missing ratings \(\hat{R}\).

Table 1 Symbols and notations

3.2 Feature extraction

As mentioned in [26], most MF-based methods assume that the prior distributions of user and item latent factors are standard Gaussian distributions, and predict rating only through user-item feedback. Some MF methods incorporate either user’s or item’s side information into rating prediction via linear regression, which leads to the limited accuracy of inferring latent relations between users and items. To achieve further improvement on prediction performance, our model incorporates both user’s and item’s side information into feature learning, which can make positive contributions to the inferring process of latent user/item factors.

3.2.1 Generative model

To learn robust features of user and item, a unified neural variational framework is built with a parallel structure. In this paper, the generative process is similar to the deep latent Gaussian model [31]. For each user \(u_i\), the generative model starts by sampling a K-dimensional latent representation \(z_{u_i}\) from a standard Gaussian prior, i.e. \(z_{u_i} \sim \ N(0,\mathbb {I}^K)\). The sample variable \(X_i\) is generated from its latent variable \(z_{u_i}\) through a MLP (decoder) with the generative parameter \(\theta\), i.e. \(X_i \sim p_{\theta }(X_i|z_{u_i})\). The \(p_{\theta }(X|z_u)\) can be generated from a multivariate Bernoulli distribution (binary) or Gaussian distribution (real-value). The generative process of user profile is defined as follows:

  1. (1)

    For each layer \(l\in [1,L]\) of the generative network,

    1. a)

      For each column n of weight matrix \(W_l^d\), draw

      $$\begin{aligned} W_{l,n}^d \sim N(0,\lambda _{w}^{-1} \mathbb {I}_K) \end{aligned}$$
    2. b)

      Draw bias vector

      $$\begin{aligned} b_l^d \sim N(0,\lambda _{w}^{-1} \mathbb {I}_K) \end{aligned}$$
    3. c)

      For each row i of \(h_l^d\), draw

      $$\begin{aligned} h_{l,i}^d \sim N(\sigma (h_{l-1,i}^d W_l^d + b_l^d), \lambda _s^{-1}\mathbb {I}_K ) \end{aligned}$$
  2. (2)

    For each \(X_i\),

    1. a)

      If \(X_i\) is binary, draw

      $$\begin{aligned} X_i \sim B(\sigma (h_l^d W_l^d + b_{l+1}^d)) \end{aligned}$$
    2. b)

      If \(X_i\) is real-value, draw

      $$\begin{aligned} X_i \sim N(h_l^d W_l^d + b_{l+1}^d, \lambda _X^{-1}\mathbb {I}_K) \end{aligned}$$

where \(\lambda _w\), \(\lambda _s\) and \(\lambda _X\) are hyperparameters, \(h_l^d\) represents hidden layers of decoder. Similar to SDAE, \(\lambda _s\) is taken to infinity for computational efficiency.

The latent representation \(z_{u_i}\) can be drawn by a Gaussian prior distribution with zero mean and identity matrix: \(z_{u_i} \sim N(0,\mathbb {I}_K)\). The user’s latent representation \(u_i\) consists of latent user offset and latent user profile vector:

$$\begin{aligned} u_i = \epsilon _i + z_{u_i} \end{aligned}$$
(1)

The generative process of item content is similar to that of user profile, and the item latent representation \(v_j\) is composed of latent item offset and latent item content vector: \(v_j = \epsilon _j + z_{v_j}\).

3.2.2 Inference model

The inference model is also a MLP network (encoder) corresponding to the one in the generative model. For user, the inference process is to approximate the intractable posterior distribution \(p_{\theta }(z_{u_i}|X_i)\) which is determined by the generative network. Using the Stochastic Gradient Variational Bayes (SGVB) estimator, the posterior of latent user profile variable \(z_u\) can be approximated by a tractable variational distribution \(q_{\phi }(z_{u_i}|X_i)\).

$$\begin{aligned} q_{\phi }(z_u|X_i) = N(\mu _{\phi }(X_i),diag(\sigma _{\phi }^2(X_i))) \end{aligned}$$
(2)

where \(\mu _{\phi } \in \mathbb {R}^K\) and \(\sigma _{\phi }^2 \in \mathbb {R}^K\) are the mean and covariance of the approximate posterior respectively, which are outputs of the inference model (i.e. non-linear functions of \(X_i\) and the variational parameter \(\phi\)).

Similar to [23, 26], the inference process of \(z_u\) is defined as follows:

  1. (1)

    For each layer l of the inference model,

    1. (a)

      For each column n of weight matrix \(W_l^e\), draw

      $$\begin{aligned} W_{l,n}^e \sim N(0,\lambda _{w}^{-1} \mathbf {I}_K) \end{aligned}$$
    2. (b)

      Draw bias vector

      $$\begin{aligned} b_l^e \sim N(0,\lambda _{w}^{-1} \mathbf {I}_K) \end{aligned}$$
    3. (c)

      For each row i of \(h_l^e\), draw

      $$\begin{aligned} h_{l,i}^e \sim N(\sigma (h_{l-1,i}^e W_l^e + b_l^e), \lambda _s^{-1}\mathbf {I}_K ) \end{aligned}$$
  2. (2)

    For each user \(u_i\),

    1. (a)

      Draw latent mean vector

      $$\begin{aligned} \mu _i \sim N(h_l^e W_{\mu }^e + b_{\mu }^e, \lambda _s^{-1}\mathbf {I}_K ) \end{aligned}$$
    2. (b)

      Draw latent covariance vector

      $$\begin{aligned} \log \sigma _i^2 \sim N(h_l^e W_{\sigma }^e + b_{\sigma }^e, \lambda _s^{-1}\mathbf {I}_K ) \end{aligned}$$
    3. (c)

      Draw latent content vector

      $$\begin{aligned} z_{u_i} \sim N(\mu _i, diag(\sigma _i^2)) \end{aligned}$$

As explained in [26], the evidence lower bound (ELBO) for \(X_i\) can be estimated by using SGVB estimator:

$$\begin{aligned} \mathcal {L}(\theta , \phi ; X_i) =&\mathbb {E}_{q_{\phi }(z_u|X_i)}[\log p(u_i|z_u) + \log p_{\theta }(X_i|z_u)] \nonumber \\&- \beta \cdot \mathbb {KL}(q_{\phi }(z_u|X_i)\Vert p(z_u)) \nonumber \\ \simeq&\log p(u_i|z_{u_i,l}) + \frac{1}{L} \sum \limits _{l=1}^{L}\log p_{\theta }(X_i|z_{u_i,l}) \nonumber \\&- \beta \cdot \mathbb {KL}(q_{\phi }(z_u|X_i)\Vert p(z_u)) \nonumber \\ \mathbb {KL}(q_{\phi }(z_u|X_i ) \Vert p(z_u)) =&\frac{1}{2} \sum \limits _{i=1}^{M}(\mu _i^2 + \sigma _i^2 - \log \sigma _i^2 - 1)\nonumber \\&z_{u_i,l} = \mu _i + \sigma _i \odot \varepsilon _{i,l} \end{aligned}$$
(3)

where \(\mathbb {KL}\) denotes the Kullback-Leibler divergence, \(\beta \in [0,1]\) is a parameter to control the regularization strength for addressing the posterior collapse problem [32], \(\varepsilon _{i,l} \sim N(0,\mathbb {I})\), and \(\odot\) represents the element-wise product.

The inference process of item content is similar to user profile inference process, and the ELBO for item network can be derived similarly:

$$\begin{aligned} \mathcal {L}(\theta , \phi ; Y_j) =&\mathbb {E}_{q_{\phi }(z_v|Y_j)}[\log p(v_j|z_v) + \log p_{\theta }(Y_j|z_v)] \nonumber \\&- \beta \cdot \mathbb {KL}(q_{\phi }(z_v|Y_j)\Vert p(z_v)) \nonumber \\ \simeq&\log p(v_j|z_{v_j,l}) + \frac{1}{L} \sum \limits _{l=1}^{L}\log p_{\theta }(Y_j|z_{v_j,l}) \nonumber \\&- \beta \cdot \mathbb {KL}(q_{\phi }(z_v|Y_j)\Vert p(z_v)) \nonumber \\ \mathbb {KL}(q_{\phi }(z_v|Y_j) \Vert p(z_v)) =&\frac{1}{2} \sum \limits _{j=1}^{N}(\mu _j^2 + \sigma _j^2 - \log \sigma _j^2 - 1) \nonumber \\&z_{v_j,l} = \mu _j + \sigma _j \odot \varepsilon _{j,l} \end{aligned}$$
(4)

3.3 Side information embedded NVCF

Inspired by NCF, we propose three NVCF-based models to improve prediction performance, which are generalized MF model with side information (sGMF), MLP with side information (sMLP) and the fusion of sGMF and sMLP. The CF module of sGMF utilizes a computational method similar to the inner product of MF, which applies a linear kernel to model the latent feature interactions. The CF process of sMLP concatenates the user and item latent vectors, and then utilizes non-linear kernel to learn the interaction between user and item latent features by a MLP network. The CF part of the fused method combines sGMF and sMLP under the NVCF framework, where sGMF and sMLP share the same embedding layer, and the outputs of their interaction functions are combined. All three models integrate side information to improve prediction performance.

3.3.1 sGMF

sGMF utilizes the extracted user and item features to calculate the element-wise product of the user and item latent vectors, and outputs the calculated vectors to a fully connected neural layer. The element-wise products of the user and item latent vectors in the first neural CF layer are defined in Eq. 5. Then, sGMF projects the vectors to the output layer, as shown in Eq. 6.

$$\begin{aligned} \varPsi _1(u_i,v_j)= & {} u_i \odot v_j \end{aligned}$$
(5)
$$\begin{aligned} \hat{R}_{ij} = a_{out}(\hbar ^{\top } \varPsi (u_i,v_j))= & {} a_{out}(\hbar ^{\top } (u_i \odot v_j)) \end{aligned}$$
(6)

where \(a_{out}\) denotes the activation function, \(\hbar\) denotes edge weights of the output layer, and \(\hat{R}_{ij}\) denotes the predicted rating. sGMF is intuitively equivalent to MF, as \(a_{out}\) is an identity function and \(\hbar\) is a uniform vector of 1.

Under the framework of NVCF, \(a_{out}\) can be a non-linear activation function and \(\hbar\) can be learned from training data, so sGMF has more powerful learning capability than MF. Dissimilar to the original GMF only relying on implicit feedback, sGMF incorporates both user and item side information into latent user/item representations learning, and employs VAE to extract user’s and item’s latent vectors, which can lead to better performance.

3.3.2 sMLP

sMLP uses the same way with sGMF to extract user and item features from auxiliary information. However, sMLP takes a different learning strategy in the NVCF module. Instead of treating user and item latent vectors by MF, sMLP concatenates learned latent user vector \(u_i\) and latent item vector \(v_j\), then adopts a MLP network in NVCF module to learn high-level user-item relations. The CF process of sMLP can be defined as follows:

$$\begin{aligned} \mathcal {Z}_1&= \varPsi _1(u_i,v_j) = \left[ \begin{array}{c} u_i \\ v_j \end{array} \right] , \nonumber \\ \varPsi _2(\mathcal {Z}_1)&= a_2 ( W^{\top }_2 \mathcal {Z}_1 + b_2 ), \nonumber \\&\cdots \cdots \nonumber \\ \varPsi _{G}(\mathcal {Z}_{G-1})&= a_G ( W^{\top }_G \mathcal {Z}_{G-1} + b_G )\nonumber \\ \hat{R}_{ij}&= \sigma (\hbar ^{\top } \varPsi _{G}(\mathcal {Z}_{G-1})) \end{aligned}$$
(7)

where \(W_G\), \(b_G\), and \(a_G\) denote the weights, bias vector, and activation function for the G-th layer, respectively; the \([\cdot ]\) denotes the concatenating operation.

Different from the original MLP in [14] that depends only on implicit feedback, sMLP learn user-item relations through the combination of VAE and MLP neural network, where VAE is employed to extract user and item features from auxiliary information, and MLP is used to perform CF task. Thus, sMLP is able to learn the vital relations between users and items.

3.3.3 Fusion of sGMF and sMLP

Similar to NeuMF [14], the model for combining sGMF with a single layer sMLP can be formulated as follows.

$$\begin{aligned} \hat{R}_{ij} = \sigma \left. \left( g^{\top } a(u_i \odot v_j) + W \left[ \begin{array}{c} u_i \\ v_j \end{array} \right] + b\right) \right) \end{aligned}$$
(8)

However, sharing the embedding of sGMF and sMLP might limit the performance of the fused model [14]. For those datasets where the optimal embedding size of the two models varies a lot, this solution may fail to obtain the optimal ensemble. In order to provide more flexibility to the fused model, we allow sGMF and sMLP to learn separate embeddings, and combine the two models by concatenating their last hidden layer. Figure 2 illustrates our proposed method, and the formulation is given as follows.

$$\begin{aligned} \varPsi ^{sGMF}&= u^{sG}_i \odot v^{sG}_j \left[ \begin{array}{c} u_i \\ v_j \end{array} \right] \nonumber \\ \varPsi ^{sMLP}&= a_G \left( W^{\top }_G \left( a_{G-1} \left( \cdots a_2 (W^{\top }_2 \left[ \begin{array}{c} u^{sM}_i \\ v^{sM}_j \end{array} \right] + b_2) \cdots \right) \right) + b_G \right) \nonumber \\ \hat{R}_{ij}&= \sigma \left( h^{\top } \left[ \begin{array}{c} \varPsi ^{sGMF} \\ \varPsi ^{sMLP} \end{array} \right] \right) \end{aligned}$$
(9)

where \(u^{sG}_i,v^{sG}_j\) and \(u^{sM}_i,v^{sM}_j\) represent user/item embeddings for sGMF and sMLP.

Fig. 2
figure 2

Side Information embedded neural variational MF (sNVMF) Model

As discussed in [14], ReLU is adopted as the activation function of sMLP layers. This fusion model combines the linearity of MF and non-linearity of MLP for modelling user and item latent structures with side information, so we call it side information embedded neural variational MF (sNVMF).

3.4 Optimization

Generally, the loss function consists of the reconstruction error in feature extraction and the prediction error. The reconstruction error lies on loss functions of VAEs for user and item feature extraction, which are equivalent to their ELBOs, respectively. For convenience, the ELBOs of user and item prior networks are denoted by \(\mathcal {L}_u\) and \(\mathcal {L}_v\), respectively.

In prediction process, NVCF outputs the predicted rating \(\hat{R}_{ij}\) for each user-item pair \((u_i,v_j)\). Due to the nature of implicit feedback, user-item ratings can be regarded as labels with binary value, i.e., if a user is relevant to an item, the implicit feedback is 1, otherwise it is 0. Therefore, the predicted \(\hat{R}_{ij}\) can be regarded as the possibility that a user relevant to an item, which means the output \(\hat{R}_{ij}\) has to be constrained in the range of [0, 1] by using a sigmoid activation function. Similar to [14], the loss function can be defined as follows.

$$\begin{aligned} \mathcal {L}_{\mathrm{s}} = - \sum _{(i,j) \in \mathcal {T}\cup \mathcal {T}^-} R_{ij} \log \hat{R}_{ij} + (1-R_{ij}) \log (1-\hat{R}_{ij}) \end{aligned}$$
(10)

where \(\mathcal {T}\) denotes the set of observed instances and \(\mathcal {T}^-\) denotes a set of negative instances, which can be sampled from unobserved user-item interactions.

To minimize the objective function for NVCF, the optimization can be done by performing stochastic gradient descent (SGD), which is the same as the binary cross-entropy loss. By employing a probabilistic treatment for NVCF, the recommendation with implicit feedback can be regarded as a binary classification problem. Thus, the general loss function for training NVCF is defined as,

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\mathrm{s}} + \lambda _u \cdot \mathcal {L}_u + \lambda _v \cdot \mathcal {L}_v \end{aligned}$$
(11)

where \(\lambda _u\) and \(\lambda _v\) denote the hyperparameters of the loss function.

3.5 Prediction

After model training and parameters learning, we can predict the probability that user will rate an item for a user-item pair (\(u_i,v_i\)). Given a trained model, for a user-item pair (\(u_i,v_i\)) without any observed relation, the predicted rating can be written as follow:

$$\begin{aligned} \hat{R}_{ij} = \sigma \left( \hbar ^{\top } \left[ \begin{array}{c} \varPsi ^{sGMF} \\ \varPsi ^{sMLP} \end{array} \right] \right) , \ \hbar \leftarrow \left[ \begin{array}{c} \alpha h^{sGMF} \\ (1-\alpha ) h^{sMLP} \end{array} \right] \end{aligned}$$
(12)

where \(h^{sGMF}\) and \(h^{sMLP}\) denote the h vector of the pretrained sGMF and sMLP models, respectively; \(\alpha\) is a hyper-parameter determining the tradeoff between the two pretrained models.

3.6 Computational complexity

The computational complexity of NVCF is \(max(\mathrm {O}_{{\mathrm{VAE}}_u},\mathrm {O}_{{\mathrm{VAE}}_v}) + \mathrm {O}_{\mathrm{NCF}}\). The first term \(max(\cdot )\) is the computational complexity of the VAE part (feature extraction), \(\mathrm {O}_{{\mathrm{VAE}}_u}=\mathrm {O}(MP^2+LH^2)\) and \(\mathrm {O}_{{\mathrm{VAE}}_v}=\mathrm {O}(NQ^2+LH^2)\), where M and N are the number of users and items respectively; P and Q are the dimensions of user and item side information, respectively; L is the number of layers of MLP networks in VAE; H is the average hidden layer size. Since MNP and Q are all linear and \(L,H\ll min(P,Q)\), the complexity of feature extraction has a squared term. The second term is the computational complexity of NCF. As we know, the complexity of NCF is linear to the size of matrix and the layers of neural network, i.e. \(\mathrm {O}(\mathrm {NCF}) = \mathrm {O}(MN+MNG)\), where G is the layers of neural network. Because \(G \ll min(M,N)\), the complexity of NCF has a squared term. Thus, the computational complexity of NVCF is of a squared term.

4 Experiments and results

4.1 Experimental settings

4.1.1 Datasets

In this section, four public datasets from GroupLens, Yelp and Epinions are collected to evaluate our model, which are MovieLens-100K (ML100K), MovieLens-1M (ML1M), Yelp Challenge Dataset (Yelp) and Extended Epinions dataset (EPext). Table 2 summarizes the characteristics of four datasets.

Table 2 Statistics of MovieLens, Yelp and Extended Epinions datasets

The ML100K and ML1M have been widely utilized to evaluate CF-based recommendation algorithms. The former one contains 943 users and 1682 movies with 100,000 ratings, while the latter one includes 6040 users and 3706 movies with 1,000,209 ratings. Each rating value is on a scale of 1–5, and each user has rated at least 20 movies. These two datasets are explicit feedback data, while our goal is to investigate the performance of learning from the implicit feedback. Therefore, these two datasets are transformed into implicit data, where each entry is marked as 1 if the corresponding rating is no less than 4, otherwise marked as 0. For side information, user demographics including age, occupation and gender are regarded as collaborative information, and movie descriptions (genres) are taken as auxiliary item information.

The Yelp contains customer reviews of local businesses. Each review is associated with a rating ranging from 1 to 5, and the user-item matrix is binarized using value 3 as a threshold. The reviews are filtered out, whose text is not written in English and businesses other than restaurants. To reduce sparsity, users with less than 5 reviews and businesses that have been rated by less than 30 users are deleted. Moreover, we merge the repetitive ratings at different time-stamps to the earliest one, so as to study the performance of recommending new items to a user. The final dataset obtains 25,815 users, 25,677 items, and 730,791 ratings.

The EPext contains ratings on articles/reviews users on products on Epinions.com. This dataset is extremely sparse (its density is about 0.015%), and it includes over 13,000,000 ratings on a 5-star scale of 120,492 users on 755,760 items (articles/reviews). The corresponding rating is also assigned to a value of 1 as implicit feedback. The EPext also contains 717,667 trust relations and 123,705 distrust statements, which can be considered as user side information. For item side information, each article/review have a topic(subject) associated with it, which can be regarded as auxiliary information.

4.1.2 Baselines and evaluation metrics

To evaluate the proposed NVCF framework and its three instantiations, six representative CF models are selected as baselines.

BPR [7] optimizes the MF model with a pairwise ranking loss, to learn from implicit feedback. It is a highly competitive baseline for item recommendation.

mDCF [12] employs SDAE to extract features from user and item auxiliary information and uses MF to determine user-item latent relations.

NeuMF [14] is a model proposed within the NCF framework, which combines hidden layer of GMF and MLP to learn the user-item interaction function.

NFM [16] generalizes factorization machine for CF, and combines the factorization machine and neural network to incorporate both feedback and content.

CVAE [24] is a Bayesian generative model that jointly models CTR and deep generative model to bridge auxiliary information together with deep architecture.

VAECF [26] is a state-of-the-art method that directly apply VAE to CF to for implicit feedback.

To evaluate the performance of our models, two common evaluation metrics for top-k recommendation are adopted: Hit Radio (HR) [14] and Normalized Discounted Cumulative Gain NDCG [33]. HR@k is a recall-based metric, measuring whether the testing item is in the top-k position. NDCG@k assigns the higher scores to the items within the top-k positions of the ranking list.

4.1.3 Parameter settings

For training set, four negative instances are sampled for each positive instance. The model parameters are randomly initialized by using a Gaussian distribution with mean of 0 and standard deviation of 0.01. Similar to [34], a mini-batch Adam method is employed to optimize model, and the learning rate and the batch size are set to 0.001 and 128 respectively. In the feature extraction step, K is set to 128. The two generative networks both are two latent layers with ReLU activation, and the two prior networks are one latent layer. The last layer of generative network is sigmoid activation, and the parameter \(\beta\) is set to be 0.2 to achieve the best performance of VAE, according to [26]. In NVCF step, \(\alpha\) is set to 0.5, allowing sGMF and sMLP to initialize sNVMF equally, and the dimension of latent vectors is defined as the number of neurons in the last NVCF layer. Specifically, the layer number of sMLP are set to 4 for the best performance [14].

4.2 Experimental results

4.2.1 Overall performance

In our experiments, each dataset is split into two parts: training datasets and testing datasets. For the training set, experiments are carried out with a setting of 80% random sample of each rating, and the rest (20%) are used for testing. Tables 2 and 3 list the top-k recommendation results of all methods on four datasets, in term of HR@5/NDCG@5 and HR@10/NDCG@10, respectively.

Table 3 Performance comparison between all methods on HR@5 and NDCG@5
Table 4 Performance comparison between all methods on HR@10 and NDCG@10

From Tables 3 and 4, we can find that most neural network-based methods (NeuMF, NFM, VAECF, sGMP, sMLP and sNVMF) outperform linear baselines, which demonstrates that deep neural network can help to achieve more subtle and better latent user and item representations. We also find that although sMLP is a little less robust than NFM in terms of HR@5 and NDCG@5 on ML100K, Yelp and EPext, it has better performance than other non-NVCF models in terms of HR@10 and NDCG@10 on all datasets. It is clear that sGMP, sMLP and sNVMF outperform other baselines in most cases, and sNVMF achieves the best performance on all datasets with two metrics, which indicates the effectiveness of NVCF framework to perform CF task. It is also found most VAE-based non-linear models (VAECF, sGMP, sMLP and sNVMF) achieve promising performance, which means the Bayesian nature and non-linearity of neural network can facilitate inferring better latent preferences of users and items. Among VAE models, our three NVCF-based methods outperform VAECF and CVAE in terms of all metrics on all datasets, which shows the advantages of our VAE-boosted NVCF framework.

4.2.2 Performance in cold-start scenarios

To evaluate our models in different cold-start scenarios, we form evaluation sets in different cold ratios. The datasets are split into training set (80%), validation set (10%) and test set (10%). For 30% cold users, we random choose 30% samples in the test sets and give each sample a specific user id only for the sample. We evaluate our models in 30% user cold (Cold-U) and 30% item cold (Cold-V) scenarios on all datasets in term of NDCG@5 and NDCG@10. BPR, NeuMF and VAECF only use rating information and are not able to manage cold-start scenarios well, so they are not compared with our methods under NVCF framework.

Table 5 Performance comparison between selected methods in cold-start on NDCG@5
Table 6 Performance comparison between selected methods in cold-start on NDCG@10

Tables 5 and 6 show the performance of our methods and other hybrid methods in different cold-start scenarios. Because CVAE cannot handle cold user problem, we do not conduct experiments in cold user scenario. It is obvious that our methods significantly outperform other methods in the scenarios of both cold items and cold users, and achieve more remarkable improvements against other methods than the case of testing all users/items. These results indicate that using neural network to model interactions between users and items works better than those of simply using inner product, and demonstrate our methods have the ability to provide high quality recommendations to cold start scenarios.

Fig. 3
figure 3

sNVMF performance on HR@10 and NDCG@10 with K on four Datasets

4.2.3 Sensitivity Analysis

In this section, we investigate the influence of the dimension of latent space on four datasets in term of HR@10 and NDCG@10, with sampling 80% data for training. Because sNVMF has better performance than sGMF and sMLP, we focus on the performance of sNVMF for a different dimension of the latent vector. The dimension of latent space K is set to be 8, 16, 32, 64 and 128, respectively. According to Fig. 3, it is crystal clear that larger dimension leads to better performance, and the optimal K of sNVMF for ML100K, ML1M, Yelp and EPext is 128. Thus, we set \(K = 128\) as default for all datasets.

5 Conclusion

In this paper, we proposed a new hybrid deep framework, NVCF, for top-k recommendation, with three instantiation sGMF, sMLP and sNVMF, which incorporates a unified deep generative model for hybrid deep collaborative filtering. The NVCF framework models both users’ and items’ generative processes, which enables it to generate recommendation under different cold-start scenarios. Our methods incorporate users’ and items’ side information through two parallel VAE networks, in order to mitigate rating sparsity and facilitate modeling user/item features. For inference purpose, we proposed a SGVB approach to approximate posteriors of latent user and item variables. Due to Bayesian nature and non-linearity, NVCF can learn better latent user/item factors and deal with the cold-start problem via a full Bayesian probabilistic view. Experimental results show that our methods achieve the best performance and can effectively handles user/item cold-start problem.

In the future, we plan to incorporate more structural auxiliary information to further improve the recommendation precision, such as employing knowledge graph to integrating item knowledge graph and social network with user-item graph, for establishing knowledge-aware connectivities between user and item. In addition, NVCF framework is not limited to textual auxiliary information, and can be extended to multimedia auxiliary information, such as videos and images. Thus, building recommender systems with multimedia that contain more abundant visual semantics can better understand users’ preferences and provide more efficient recommendation.