1 Introduction

Recommendation systems (RS) are of paramount importance in social networks and e-commerce platforms. RS aims at inferring users’ preferences over items by utilizing their previous interactions. Traditional methods for RS can be categorized into two classes [1]: content-based and collaborative filtering (CF). Content-based methods make use of the features of users and items and recommend items that are similar to the items that the users have liked before. CF methods utilize previous rating information obtained from users with similar interest to recommend items to the users, and have been widely used in RS due to their impressive performance.. Matrix factorization (MF) is one of the most successful and popular CF approaches that first infers users’ and items’ latent factors from a user-item rating matrix and then recommends items to users who share similar latent factors to the items [2]. Most matrix factorization methods including probabilistic matrix factorization (PMF) that projects (parameterized) users and items probabilistic matrices to maximize their inner product suffer from the problems of poor latent representations of users and items and data sparsity.

To overcome these problems, we propose a novel deep generative model, namely Neural Variational Matrix Factorization (NVMF), that incorporates side information (features) of both users and items to capture better latent representations of them for more effective collaborative-filtering recommendation. Our NVMF infers the posterior distribution (Bayesian estimation) of the latent factors of users and items through two end-to-end variational autoencoder neural networks, namely “user neural network” and “item neural network”, consisting of a generative network (i.e., decoder) and a inference network (i.e., encoder) respectively,

to model the generative process of user’s latent factor and item’s latent factor. The main contributions of this paper are:

  • We propose a novel deep generative model, namely neural variational matrix factorization (NVMF), that incorporates side information via a novel deep generative process to capture more subtle relations between latent factors and side information. To the best of our knowledge, this is the first work which combines deep generative model with matrix factorization for collaborative filtering.

  • To effectively infer NVMF, we propose an inference algorithm applying Stochastic Gradient Variational Bayes estimation on user and item neural networks to approximately compute parameter of our NVMF and infer latent factors of users and items, which improves the solution quality by following traditional inference methods that optimizes the variational parameters one by one. We derive the variational evidence lower bounds for our proposed model.

  • We extensively conduct experiments and performance evaluation. The experiment results show that our proposed NVMF method outperforms major state-of-the-art CF methods on recommendation accuracy.

The paper is organized as follows: Section 2 introduces related work; Section 3 gives preliminaries and problem statement; Section 4 describes the deep generative process of our user and item neural networks; Section 5 presents the inference process for estimating the posterior probability distributions of latent variables of users and items; Section 6 shows the results of experiments and performance evaluation. Section 7 concludes the paper. A conference version containing some preliminary results has been accepted by PAKDD 2019 [35].

2 Related work

On matrix factorization (MF) for collaborative filtering (CF), a series of improved methods have been proposed, including non-negative matrix factorization (NNMF) [3], max-margin matrix factorization [4] and, most remarkably, probabilistic matrix factorization (PMF) [5] that projects (parameterized) users and items probabilistic matrices to maximize their inner product. To overcome the data sparsity problem suffered by these methods, hybrid methods that enjoy the advantages of different categories of CF methods have been proposed. Some hybrid methods [6,7,8,9,10] take the advantages of both content-based and PMF methods, and incorporate side information, such as demographics of a user, type of an item, etc., into PMF. Regardless of their better performance compared to the PMF, we find that most of hybrid methods apply Gaussian or Poisson distributions to model the generative process of users’ ratings and may result in learning poor representations of users and items from complex data. Furthermore, most of these hybrid methods also incorporate side information via a linear regression way which hinders themselves from capturing the complex relations between latent factors and side information.

On the other hand, deep learning has achieved state-of-art results in various fields such as natural language processing [11], computer vision [12] and speech recognition [13, 14] due to its powerful ability of representation. Thus, some researchers have applied deep learning to the task of CF. Deep collaborative Filtering (DCF) [15] is a general deep architecture for hybrid CF by integrating PMF with deep neural networks. Wang et al. [16] proposed collaborative deep learning (CDL) to integrate stacked denoising autoencoder (SDAE) into probabilistic matrix factorization which get state-of-the-art result over real world datasets. In order to solve the matrix sparsity and cold start problems, Collaborative Filtering Neural network (CFN) [17] incorporates the side information to mitigate sparsity and cold start problems. Recently, Dong et al. [18] proposed the additional stacked denoising autoencoder (aSDAE). The aSDAE combines autoencoder with matrix factorization and incorporates side information into each layer of the neural network, which makes itself able to obtain the best possible result. However, these deep learning-based methods are all deterministic and unable to capture the uncertainties of the latent representations of users and items. For these models, using very deep stacked neural network makes the models themselves too deep to train and are easy to overfit the data. Thus, how to learn effective and robust latent representations of users and items needs to be further investigated.

Recently, the deep generative model [19,20,21] has attracted significant research efforts since it has both non-linearity of neural network, and can obtain more subtle latent representations of users and items due to its Bayesian nature. Li et al. [22] proposed Collaborative Variational Autoencoder (CVAE), which utilizes VAE to extract latent item information and incorporates it into matrix factorization. Liang et al. [23] proposed the VAE-CF model to the CF task. Chen et al. proposed a collective VAE [24] which incorporates side information into VAE-CF. However, these VAE-based methods do not consider side information of both users and items. They simply use the same Gaussian priors for all users and items, which is unrealistic for most cases and could lead to poor latent representation and suffer from the Posterior-collapse problem [25] of VAE.

In this paper, we solve the above problems by proposing a new CF method of neural-network based variational matrix factorization that models the relationship of latent factor and side information by a novel deep generative process so as to alleviate the data sparsity problem and effectively learn latent representations. In addition, our proposed method as a variant of variational autoencoder for MF differs from the existing neural network based hybrid methods by taking the neural network as a complete Bayesian probabilistic framework to combine the advantages of both deep learning and probabilistic matrix factorization for the purpose of capturing more subtle latent representations and the uncertainties of latent representations.

3 Preliminaries

We first introduce Probabilistic Matrix Factorization and then give our problem definition.

3.1 Probabilistic matrix factorization

Our proposed neural variational matrix factorization method is built on the well-know probabilistic matrix factorization (PMF) [26] that can be viewed as parameterized projection of two given probabilistic matrices such that their inner product is maximized. The goal of PMF is to find two low-rank latent factor matrices U and V to approximate the user-item feedback matrix: RUV. PMF gives a probabilistic framework to learn the latent factor matrices by assuming a linear model with Gaussian observation noise and Gaussian priors on the latent factors (see Fig. 1 left) to minimize the sum-of-squared-errors between observed feedback matrix R and the inner product of latent factors (U, V) with two quadratic regularization terms from the object function (3). Specifically, PMF considers the conditional distribution of the observed rating matrix R given latent factors U and V, and the prior distribution of U and V as follows:

$$ \begin{array}{@{}rcl@{}} &&p(\boldsymbol{R}|\boldsymbol{U},\boldsymbol{V},\alpha) =\prod\limits_{i=1}^{M}\prod\limits_{j=1}^{N}\left[\mathcal{N}(R_{ij}|\boldsymbol{u_{i}}^{\top}\boldsymbol{v_{j}},\alpha^{-1}) \right]^{I_{ij}}, \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} &&p(\boldsymbol{U}|\alpha_{1}) =\prod\limits_{i=1}^{M}\mathcal{N}(\boldsymbol{u_{i}}|0,\alpha_{1}^{-1}\boldsymbol{I}), ~~p(\boldsymbol{V}|\alpha_{2}) =\prod\limits_{j=1}^{N}\mathcal{N}(\boldsymbol{v_{j}}|0,\alpha_{2}^{-1}\boldsymbol{I}), \end{array} $$
(2)

where \(\mathcal {N}(\cdot , \cdot )\) represents a Gaussian distribution with means and precision, and Iij represents user i has rated on item j is not empty entities in R. α, α1, α2 are parameters for precisions in the corresponding Gaussian distributions. PMF is learned by finding (Maximizing a posterior) MAP estimation. Maxing the posterior distribution is equivalent to minimizing the following loss function:

$$ \begin{array}{@{}rcl@{}} \underset{\boldsymbol{U},\boldsymbol{V}}{\arg\min} \mathcal{L} (\boldsymbol{U}, \boldsymbol{V})& = & \frac{1}{2}\sum\limits_{i=1}^{M}\sum\limits_{j=1}^{N} I_{ij}(R_{ij}-\boldsymbol{u_{i}}^{\top} \boldsymbol{v_{j}})^{2}\\ &&+\frac{\lambda_{1}}{2}\sum\limits_{i=1}^{M}\left\| \boldsymbol{u_{i}}\right\|^{2}_{F} +\frac{\lambda_{2}}{2}\sum\limits_{j=1}^{N}\left\|\boldsymbol{v_{j}}\right\|^{2}_{F} , \end{array} $$
(3)

where ∥⋅∥F denotes the Frobenius norm, λ1 = α1/α and λ2 = α2/α. Some improved methods based on PMF have been proposed such as Bayesian matrix factorization [8, 9] and Hierarchical Bayesian Matrix factorization [7] to incorporate side information into PMF and learn the full posterior of latent factors U and V. However, these model incorporate side information via a linear regression ways which hinders them to capture subtle and non-linear relations between latent factors of users and items between their side information. The variational parameters in these models [7,8,9] are also too many to store or optimize local variational parameters. To address the drawbacks above, in the following section (Section 4), we propose a neural variational matrix factorization (NVMF) that integrates side information into probabilistic MF via a deep generative model and propose a Stochastic Gradient Variational Bayes [19] inference algorithm for our NVMF to avoid the optimization of the every variational parameters.

Fig. 1
figure 1

Graphical models of Probabilistic Matrix Factorization (left) and our model NVMF (right), where shaded nodes are observed variables

3.2 Problem definition and notations

The problem we address in this paper is to infer the posterior latent factors of users and items and predict the missing value in user-item feedback matrix R given R, F and G via probabilistic matrix factorization. Accordingly, our neural-network based neural variational matrix factorization, NVMF, is essentially a function Ξ that satisfies the following:

$$ \boldsymbol{R}, \boldsymbol{F}, \boldsymbol{G} \stackrel{\Xi}{\longrightarrow} \boldsymbol{U}, \boldsymbol{V}, $$
(4)

where R is a feedback matrix such as rating matrix with M and N being the total number of users and items, respectively, \(\boldsymbol {F}\in \mathbb {R}^{P\times M}\) and \(\boldsymbol {G}\in \mathbb {R}^{Q\times N}\) are the side information matrix of all users and items, respectively, with P and Q being the dimensions of each user’s and item’s side information, respectively; \( \boldsymbol {U}=[\boldsymbol {u}_{1},...\boldsymbol {u}_{M}]\in \mathbb {R}^{D\times M}\) and \(\boldsymbol {V}=[\boldsymbol {v}_{1},...\boldsymbol {v}_{N}]\in \mathbb {R}^{D\times N}\) are the two rank matrices serving for users and items, respectively, with D denoting the dimensions of latent factor space..

There are two types of feedback matrix which are explicit feedback (\(\boldsymbol {R} \in \mathbb {R}^{M\times N}\)) and implicit feedback (R ∈ {0,1}M×N). Like recent recommendation methods [22, 27], we focus on implicit feedback in our paper as this is a more challenging situation. For convenient discussion, we represent each user i’s rating scores including the missing/unobserved ones over all items as \(\boldsymbol {s}_{i}^{u}=[R_{i1},...,R_{iN}]\in \mathbb {R}^{N\times 1}\), where Rij is an element in R. Similarly, we represent each item j’s rating scores from all users including those who do not provide rating for j as \(\boldsymbol {s}_{j}^{v}=[R_{1j},...,R_{Mj}]\in \mathbb {R}^{M\times 1}\). We call \(\boldsymbol {s}_{i}^{u}\) and \(\boldsymbol {s}_{j}^{v}\) as the collaborative information of user i and item j, respectively. Obviously, our task is to infer each user’s and item’s latent factors, ui and vj through R, F and G. Then we can use ui and vj to predict the missing Rij.

Like most probabilistic models [7, 9, 10], our model requires three steps of implementation: 1) Define a generative model to describe the generative process of the observed variables (R); (2) infer the posterior of the latent variables (U and V) conditioned on the observed variables; 3) use the inferred posterior to predict the missing value in R. Note that, unlike traditional MF methods [2] which gets the point vector of latent factors, the Bayesian matrix factorization mentioned above and our model is to infer the full posterior distributions of latent factors and then uses the inferred posteriors to predict missing value.

Table 1 is a list of mathematical symbols used in the paper:

Table 1 Glossary

4 The generative process

The generative process of the probabilistic graphic model of our NVMF is shown in Fig. 1 (right). Unlike other matrix factorization models (Fig. 1 (left)) that directly utilize the rating value Rij in user-item matrix R to infer the latent factors users and items, our NVMF model considers a more general situation that the i-th user latent factor ui can be jointly inferred by both user’s rating history on all items \(\boldsymbol {s}_{i}^{u}\) (the collaborative information of user i ) and user’s features fi. Similarly, the j-th item latent factor vj can be jointly inferred by both users’ rating history on the item \(\boldsymbol {s}_{j}^{v}\) (the collaborative information of item j) and its own features gj. Under this consideration, we first model the features of users and items through latent variable model. Although we do not know the real distributions of user features and item features, we know that any distribution can be generated by mapping the standard Gaussian through a sufficiently complicated function [28].

Thus for user i, given a standard Gaussian latent variable bi assigned to the user, his features fi are generated from its latent variable bi through a neural network, which is called “user generative network” (see Fig. 2), and are governed by the parameter 𝜃 in the network such that we have:

$$ \boldsymbol{b}_{i} \sim \mathcal{N}(0,\boldsymbol{I}_{K_{b}}),~~\boldsymbol{f}_{i}\sim p_{\theta}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i}), $$
(5)

where \(\boldsymbol {I}_{K_{b}}\) is the covariance matrix, Kb is the dimension of bi, and the specific form of the probability of generating fi given bi, p𝜃(fi|bi), depends on the type of data. For instance, if fi is binary vector, p𝜃(fi|bi) can be a multivariate Bernoulli distribution Ber(F𝜃(bi)) with F𝜃(⋅) being the highly no-linear function parameterized by the parameter 𝜃 in the neural networks. Similarly, for item j, its features gj are modeled to be generated from a standard Gaussian latent variable dj through another generative network, which is called “item generative network” (see Fig. 2), and are governed by a parameter τ in the network such that we have:

$$ \boldsymbol{d}_{j} \sim \mathcal{N}(0,\boldsymbol{I}_{K_{d}}),~~\boldsymbol{g}_{j}\sim p_{\tau}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j}), $$
(6)

where \(\boldsymbol {I}_{K_{d}}\) is the covariance matrix and Kd is the dimension of dj. Traditional PMF assumes the prior distributions of user latent factor ui and item latent factor vj are standard Gaussian distributions and predict rating only through collaborative information such as the user-item feedback matrix. In our model, to further enhance the performance, besides the collaborative information, we believe the user’s features fi can also positively contribute to the inference of his latent factor ui. Similarly, for better inferring the j-th item’s latent factor vj, we also fully utilize user’s features gj. Unlike most MF methods [7, 8, 29] that incorporate side information via linear regression, in order to get more subtle latent relations, we consider the conditional prior p(ui|fi) and p(vj|gj) are Gaussian distributions such that we have \(p(\boldsymbol {u}_{i}|\boldsymbol {f}_{i})=\mathcal {N}(\mu _{u}(\boldsymbol {f}_{i}), {\Sigma }_{u}(\boldsymbol {f}_{i}))\) and \(p(\boldsymbol {v}_{j}|\boldsymbol {g}_{j})=\mathcal {N}(\mu _{v}(\boldsymbol {g}_{j}), {\Sigma }_{v}(\boldsymbol {g}_{j}))\), where

$$ \begin{array}{@{}rcl@{}} &&\mu_{u}(\boldsymbol{f}_{i})=F_{\mu_{u}}(\boldsymbol{f}_{i}), \qquad{\Sigma}_{u}(\boldsymbol{f}_{i})=\text{diag}(\exp(F_{\delta_{u}}(\boldsymbol{f}_{i}))), \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} &&\mu_{v}(\boldsymbol{g}_{j})=G_{\mu_{v}}(\boldsymbol{g}_{j}), \qquad{\Sigma}_{v}(\boldsymbol{g}_{j})=\text{diag}(\exp(G_{\delta_{v}}(\boldsymbol{g}_{j}))), \end{array} $$
(8)

where \(F_{\mu _{u}}(\cdot )\), \(F_{\delta _{u}}(\cdot )\), are the two highly non-linear functions parameterized by μu and δu in the neural network, i.e., the user prior network, serving for all users, and \(G_{\mu _{v}}(\cdot )\) and \(G_{\delta _{v}}(\cdot )\) are the two non-linear ones parameterized by μv and δv in another neural network, i.e., the item prior network, serving for all items, respectively. For simplicity, note that we set γ = {μu, δu} and ψ = {μv, δv}. For the collaborative information of user i (\(\boldsymbol {s}_{i}^{u}\)), we assign a standard Gaussian latent variable ai to it and believe user latent factor ui can potentially affect user collaborative information. Then we consider \(\boldsymbol {s}_{i}^{u}\) is generated from both a standard Gaussian latent variable ai and user latent factor ui, and is governed by the parameter α in the generative network (see Fig. 2 and the caption in the figure) such that we have:

$$ \boldsymbol{a}_{i} \sim \mathcal{N}(0,\boldsymbol{I}_{K_{a}}), ~~\boldsymbol{s}_{i}^{u}\sim p_{\alpha}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i}), $$
(9)

where \(\boldsymbol {I}_{K_{a}}\) is the covariance matrix and Ka is the dimension of ai. Similarly, the j-th item’s collaborative information, \(\boldsymbol {s}_{j}^{v}\), is generated from its standard Gaussian latent variable cj and item latent factor vj, and is governed by the parameter β in the generative network such that we have:

$$ \boldsymbol{c}_{j} \sim \mathcal{N}(0,\boldsymbol{I}_{K_{c}}),~~\boldsymbol{s}_{j}^{v}\sim p_{\beta}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j}), $$
(10)

where \(\boldsymbol {I}_{K_{c}}\) is the covariance matrix and Kc is the dimension of cj. Similar to the form of the probability distribution, p𝜃(fi|bi), in (5), the specific forms of the probability distributions in (6), (9) and (10) depend on the type of data. The rating Rij is drawn from the Gaussian distribution whose mean is the inner product of the user i and item j latent factor representations such that we have:

$$ p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j})=\mathcal{N}(\boldsymbol{u}_{i}^{\top}\boldsymbol{v}_{j},C_{ij}^{-1}). $$
(11)

where \(C_{ij}^{-1}\) is the precision of Gaussian distribution, and similar to the collaborative topic modeling [27], Cij serves as a confidence parameter for rating Rij, which is defined as:

$$ C_{ij}=\left\{ \begin{array}{lr} \varphi_{1} \quad \text{if} \quad R_{ij}\neq0,&\\ \varphi_{2} \quad \text{if} \quad R_{ij}=0,& \end{array} \right. $$
(12)

where φ1 and φ2 are the parameters satisfying φ1 > φ2 > 0, the basic reason behind which is that if Rij = 0 it means the user i is not interested in the item j or the user i is unaware of it. Figure 1 shows the graph model corresponding to the generative process defined above (i.e., (11), (10), (9), (8), (7), (6) and (5)). Based on the generative process, the joint probability of all observation and the latent variables can be written as follows:

$$ \begin{array}{@{}rcl@{}} && p(\boldsymbol{R},\boldsymbol{F},\boldsymbol{G},\mathcal{Z}; \boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{\gamma}, \boldsymbol{\tau}, \boldsymbol{\beta}, \boldsymbol{\psi})=\prod\limits_{i=1}^{M}\prod\limits_{j=1}^{N} \underbrace{p(\boldsymbol{a}_{i})p(\boldsymbol{b}_{i})p_{\boldsymbol{\theta}}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i})p_{\boldsymbol{\gamma}}(\boldsymbol{u}_{i}|\boldsymbol{f_{i}})p_{\boldsymbol{\alpha}}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i})}_{\text{for users}} \cdot \\ && \hspace{0.3in} \underbrace{p(\boldsymbol{c}_{j})p(\boldsymbol{d}_{j})p_{\boldsymbol{\tau}}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j})p_{\boldsymbol{\psi}}(\boldsymbol{v}_{j}|\boldsymbol{g}_{j})p_{\boldsymbol{\beta}}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j})}_{\text{for items}}p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j}), \end{array} $$
(13)
Fig. 2
figure 2

The architecture of NVMF: user network (left) and item network (right) infer latent factors for users and items, shaded rectangles are observed vectors

Thus the marginal distribution of all observed variables (R, F and G) is:

$$ \begin{array}{@{}rcl@{}} && p(\boldsymbol{R},\boldsymbol{F},\boldsymbol{G}; \boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{\gamma}, \boldsymbol{\tau}, \boldsymbol{\beta}, \boldsymbol{\psi})=\int \prod\limits_{i=1}^{M}\prod\limits_{j=1}^{N} \underbrace{p(\boldsymbol{a}_{i})p(\boldsymbol{b}_{i})p_{\boldsymbol{\theta}}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i})p_{\boldsymbol{\gamma}}(\boldsymbol{u}_{i}|\boldsymbol{f_{i}})p_{\boldsymbol{\alpha}}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i})}_{\text{for users}} \cdot \\ && \hspace{0.3in} \underbrace{p(\boldsymbol{c}_{j})p(\boldsymbol{d}_{j})p_{\boldsymbol{\tau}}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j})p_{\boldsymbol{\psi}}(\boldsymbol{v}_{j}|\boldsymbol{g}_{j})p_{\boldsymbol{\beta}}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j})}_{\text{for items}}p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j}) d\mathcal{Z}, \end{array} $$
(14)

where \(\mathcal {Z}=\left \{\boldsymbol {U},\boldsymbol {V},\boldsymbol {A},\boldsymbol {B},\boldsymbol {C},\boldsymbol {D}\right \}\) is the set of all latent variables in (15) that need to be inferred, and \(\mathcal {Z}_{ij} =\left \{\boldsymbol {u}_{i},\boldsymbol {v}_{j},\boldsymbol {a}_{i}, \boldsymbol {b}_{i}, \boldsymbol {c}_{j},\boldsymbol {d}_{j}\right \}\).

5 The inference process

We now show how to infer the posterior distributions over the latent variables \(\mathcal {Z}\) and parameters (𝜃, α, γ, τ, β, ψ) to maximize the marginal distribution in (16), optimize parameters, and predict rating of R.

5.1 Estimation of latent variables and parameters

Based on the generative process defined in last section, the joint probability of all observation and the latent variables can be written as follows:

$$ \begin{array}{@{}rcl@{}} && p(\boldsymbol{R},\boldsymbol{F},\boldsymbol{G},\mathcal{Z}; \boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{\gamma}, \boldsymbol{\tau}, \boldsymbol{\beta}, \boldsymbol{\psi})=\prod\limits_{i=1}^{M}\prod\limits_{j=1}^{N} \underbrace{p(\boldsymbol{a}_{i})p(\boldsymbol{b}_{i})p_{\boldsymbol{\theta}}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i})p_{\boldsymbol{\gamma}}(\boldsymbol{u}_{i}|\boldsymbol{f_{i}})p_{\boldsymbol{\alpha}}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i})}_{\text{user part}} \cdot \\ && \hspace{0.3in} \underbrace{p(\boldsymbol{c}_{j})p(\boldsymbol{d}_{j})p_{\boldsymbol{\tau}}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j})p_{\boldsymbol{\psi}}(\boldsymbol{v}_{j}|\boldsymbol{g}_{j})p_{\boldsymbol{\beta}}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j})}_{\text{item part}} \underbrace{p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j})}_{\text{PMF part}}, \end{array} $$
(15)

From (15), we can view our NVMF as three parts, i.e., the user part, the item part and the PMF part. The user part and item part mainly model the inner relations between side information and latent factors in users and items, respectively. The PMF part models the across interactions between latent factors of users and items. Based on the joint distribution of all variables (15), the marginal distribution of all observed variables (R, F and G) is:

$$ \begin{array}{@{}rcl@{}} && p(\boldsymbol{R},\boldsymbol{F},\boldsymbol{G}; \boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{\gamma}, \boldsymbol{\tau}, \boldsymbol{\beta}, \boldsymbol{\psi})=\int \prod\limits_{i=1}^{M}\prod\limits_{j=1}^{N} p(\boldsymbol{a}_{i})p(\boldsymbol{b}_{i})p_{\boldsymbol{\theta}}\\ &&~~~\times(\boldsymbol{f}_{i}|\boldsymbol{b}_{i})p_{\boldsymbol{\gamma}}(\boldsymbol{u}_{i}|\boldsymbol{f_{i}})p_{\boldsymbol{\alpha}}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i}) \cdot \\ &&~~~~p(\boldsymbol{c}_{j})p(\boldsymbol{d}_{j})p_{\boldsymbol{\tau}}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j})p_{\boldsymbol{\psi}}(\boldsymbol{v}_{j}|\boldsymbol{g}_{j})p_{\boldsymbol{\beta}}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j})p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j}) d\mathcal{Z}, \end{array} $$
(16)

where \(\mathcal {Z}=\left \{\boldsymbol {U},\boldsymbol {V},\boldsymbol {A},\boldsymbol {B},\boldsymbol {C},\boldsymbol {D}\right \}\) is the set of all latent variables in (15) that need to be inferred, and \(\mathcal {Z}_{ij} =\left \{\boldsymbol {u}_{i},\boldsymbol {v}_{j},\boldsymbol {a}_{i}, \boldsymbol {b}_{i}, \boldsymbol {c}_{j},\boldsymbol {d}_{j}\right \}\). We focus on inferring the posterior distributions of the latent variables (U and V) in \(\mathcal {Z}\) and find parameters (𝜃, α, γ, τ, β, ψ) to maximize the marginal distribution of observed variables (R, F and G) in (16).

However, it is difficult to infer latent variables in \(\mathcal {Z}\) by using traditional mean-field approximation since we do not have any conjugate probability distribution in our model which requires by the traditional mean-field approach [30] and the integral in marginal distribution of observed variables (R, G and F) in (16) is also intractable. Inspired by VAE [19], we use Stochastic Gradient Variational Bayes (SGVB) estimator to estimate posteriors of the latent variables related to user (ai, bi, ui) and latent variables related to item (cj, dj, vj) by introducing two inference networks, i.e., the user inference network and the item inference network (see Fig. 2), parameterized by ϕ and λ, respectively.

To do this, we first decompose the variational distribution q of latent variables in \(\mathcal {Z}\) into two categories of variational distributions used in the two networks in our NVMF model — user inference network and item inference network (see Fig. 2), qϕ and qλ, by assuming the conditional independence:

$$ \begin{array}{@{}rcl@{}} q(\mathcal{Z}_{ij}|\mathcal{X}_{i}, \mathcal{Y}_{j}, R_{ij})& = &\underbrace{q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i})q_{\phi}(\boldsymbol{a}_{i}|\mathcal{X}_{i})q_{\phi}(\boldsymbol{b}_{i}|\mathcal{X}_{i})}_{\text{for users}}\\ &&\cdot\underbrace{q_{\lambda}(\boldsymbol{v}_{j}|\mathcal{Y}_{j})q_{\lambda}(\boldsymbol{c}_{j}|\mathcal{Y}_{j})q_{\lambda}(\boldsymbol{d}_{j}|\mathcal{Y}_{j})}_{\text{for items}}, \end{array} $$
(17)

where \(\mathcal {X}_{i}=(\boldsymbol {s}_{i}^{u},\boldsymbol {f}_{i})\) represents the set of user observed variables, and \(\mathcal {Y}_{j}=(\boldsymbol {s}_{j}^{v},\boldsymbol {g}_{j})\) represents the set of item observed variables.

Like VAE [19], the variational distributions are chosen to be a Gaussian distribution \(\mathcal {N}(\boldsymbol {\mu },\boldsymbol {\Sigma })\), whose mean μ and covariance matrix Σ are the output of the inference network. Thus, in our NVMF, for latent variables related to the i-th user, we set:

$$ \begin{array}{@{}rcl@{}} &&q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i})=\mathcal{N}(\mu_{\phi_{u_{i}}}(\mathcal{X}_{i}),\text{diag}(\exp(\delta_{\phi_{u_{i}}}(\mathcal{X}_{i})))), \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} &&q_{\phi}(\boldsymbol{a}_{i}|\mathcal{X}_{i})=\mathcal{N}(\mu_{\phi_{a_{i}}}(\mathcal{X}_{i}),\text{diag}(\exp(\delta_{\phi_{a_{i}}}(\mathcal{X}_{i})))), \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} &&q_{\phi}(\boldsymbol{b}_{i}|\mathcal{X}_{i})=\mathcal{N}(\mu_{\phi_{b_{i}}}(\mathcal{X}_{i}),\text{diag}(\exp(\delta_{\phi_{b_{i}}}(\mathcal{X}_{i})))), \end{array} $$
(20)

where the subscripts of μ and δ indicate the parameters in our user inference network corresponding to ui, ai and bi, respectively. Similarly, for j-th item:

$$ \begin{array}{@{}rcl@{}} &&q_{\lambda}(\boldsymbol{v}_{j}|\mathcal{Y}_{j})=\mathcal{N}(\mu_{\lambda_{v_{j}}}(\mathcal{Y}_{j}),\text{diag}(\exp(\delta_{\lambda_{v_{j}}}(\mathcal{Y}_{j})))), \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} &&q_{\lambda}(\boldsymbol{c}_{j}|\mathcal{Y}_{j})=\mathcal{N}(\mu_{\lambda_{c_{j}}}(\mathcal{Y}_{j}),\text{diag}(\exp(\delta_{\lambda_{c_{j}}}(\mathcal{Y}_{j})))), \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} &&q_{\lambda}(\boldsymbol{d}_{j}|\mathcal{Y}_{j})=\mathcal{N}(\mu_{\lambda_{d_{j}}}(\mathcal{Y}_{j}),\text{diag}(\exp(\delta_{\lambda_{d_{j}}}(\mathcal{Y}_{j})))), \end{array} $$
(23)

where the subscripts of μ and δ indicate the parameters in item inference network corresponding to vj, cj and dj, respectively.

Thus, the tractable standard evidence lower bound (ELBO) of variational distribution \(q(\mathcal {Z}_{ij}|\mathcal {X}_{i}, \mathcal {Y}_{j}, R_{ij})\) for the inference can be computed as follows:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}(q)&=&\mathbb{E}_{q}[\log p(\mathcal{O},\mathcal{Z}) -\log q(\mathcal{Z}|\mathcal{O})] \\ &=&\sum\limits_{i=1}^{M}\sum\limits_{j=1}^{N}(\mathcal{L}_{i}(q_{\phi})+\mathcal{L}_{j}(q_{\lambda})+\mathbb{E}_{q}[\log p(R_{ij}|\boldsymbol{u}_{i},\boldsymbol{v}_{j})]), \end{array} $$
(24)

where \(\mathcal {O}=(\boldsymbol {F},\boldsymbol {G},\boldsymbol {R})\) is a set of all observed variables. qϕ and qλ are user term and item term in (17), respectively. For user i and item j, we have:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{i}&&(q_{\phi})=\mathcal{L}(\phi,\alpha,\theta,\gamma;\mathcal{X}_{i})=\mathbb{E}_{q_{\phi}(\boldsymbol{a}_{i},\boldsymbol{u}_{i}|\mathcal{X}_{i})}[\log p_{\alpha}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i},\boldsymbol{u}_{i})] \\ &&+\mathbb{E}_{q_{\phi}(\boldsymbol{b}_{i}|\mathcal{X}_{i})}[\log p_{\theta}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i})]-\text{KL}(q_{\phi}(\boldsymbol{a}_{i}|\mathcal{X}_{i})||p(\boldsymbol{a}_{i})) \\ &&-\text{KL}(q_{\phi}(\boldsymbol{b}_{i}|\mathcal{X}_{i})||p(\boldsymbol{b}_{i}))-\omega_{1} \text{KL}(q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i}) || p_{\gamma}(\boldsymbol{u}_{i}|\boldsymbol{f}_{i})), \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} \vspace{-0.5em} \mathcal{L}_{j}&&(q_{\lambda})=\mathcal{L}(\lambda,\beta,\tau,\psi;\mathcal{Y}_{j})=\mathbb{E}_{q_{\lambda}(\boldsymbol{c}_{j},\boldsymbol{v}_{j}|\mathcal{Y}_{j})}[\log p_{\beta}(\boldsymbol{s}_{j}^{v}|\boldsymbol{c}_{j},\boldsymbol{v}_{j})] \\ &&+\mathbb{E}_{q_{\lambda}(\boldsymbol{d}_{j}|\mathcal{Y}_{j})}[\log p_{\tau}(\boldsymbol{g}_{j}|\boldsymbol{d}_{j})]-\text{KL}(q_{\lambda}(\boldsymbol{c}_{j}|\mathcal{Y}_{j})||p(\boldsymbol{c}_{j})) \\ &&-\text{KL}(q_{\lambda}(\boldsymbol{d}_{j}|\mathcal{Y}_{j})||p(\boldsymbol{d}_{j}))-\omega_{2} \text{KL}(q_{\lambda}(\boldsymbol{v}_{j}|\mathcal{Y}_{j}) || p_{\psi}(\boldsymbol{v}_{j}|\boldsymbol{g}_{j})), \end{array} $$
(26)

where in the standard ELBO, the free parameters ω1 and ω2 are 1, and \(\text {KL}(q_{\phi }(\boldsymbol {u}_{i}|\mathcal {X}_{i})\)||pγ(ui|fi)) is the Kullback-Leibler divergence between the approximate posterior distribution \(q_{\phi }(\boldsymbol {u}_{i}|\mathcal {X}_{i})\) and the prior pγ(ui|fi). The variational distribution \(q_{\phi }(\boldsymbol {u}_{i}|\mathcal {X}_{i})\) acts as an approximation to the true posterior \(p(\boldsymbol {u}_{i}|\mathcal {O})\) when maximizing (25). Similarly, \(q_{\lambda }(\boldsymbol {v}_{j}|\mathcal {Y}_{j})\) acts as an approximation to the true posterior \(p(v_{j} | \mathcal {O})\) when maximizing (26). And like discussed in VAE [19], maximizing the ELBO also means maximizing the marginal distribution of observed variables (\(\mathcal {O}\)).

Inspired by β-VAE and the previous work [23, 31], we use two free trade-off parameters ω1 and ω2 for the last terms in (25) and (26), respectively, in the ELBO to control the KL regularization instead of directly applying ω1 = ω2 = 1 (other KL terms in (25) and (26) do not need to apply the trade-off parameters as they are not related to our final latent factors ui and vj). Since we assume the posteriors of all latent variables in \(\mathcal {Z}\) are all Gaussian distribution, the KL terms in (25) and (26) have analytical forms. However, for the expectations terms in (25) and (26), we can not compute them analytically. To handle this problem, we use Monte Carlo method [20] to approximate the expectations by drawing samples from the posterior distribution of latent variables \(\mathcal {Z}\). By using the reparameterization trick [20], the ELBO for user network is given:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}&&(\phi,\alpha,\theta,\gamma;\mathcal{X}_{i})\approx \frac{1}{K}\sum\limits_{k=1}^{K} (\log p_{\alpha}(\boldsymbol{s}_{i}^{u}|\boldsymbol{a}_{i}^{k},\boldsymbol{u}_{i}^{k})+\log p_{\theta}(\boldsymbol{f}_{i}|\boldsymbol{b}_{i}^{k})) \\ &&-\text{KL}(q_{\phi}(\boldsymbol{a}_{i}|\mathcal{X}_{i})||p(\boldsymbol{a}_{i}))-\text{KL}(q_{\phi}(\boldsymbol{b}_{i}|\mathcal{X}_{i})||p(\boldsymbol{b}_{i})) \\ &&-\omega_{1} \text{KL}(q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i})|| p_{\gamma}(\boldsymbol{u}_{i}|\boldsymbol{f}_{i})), \end{array} $$
(27)

where K is the size of the samplings, \(\boldsymbol {a}_{i}^{k}=\boldsymbol {\mu }_{a}+\boldsymbol {\delta }_{a}\odot \boldsymbol {\epsilon }_{a}^{k}, \boldsymbol {b}_{i}^{k}=\boldsymbol {\mu }_{b}+\boldsymbol {\delta }_{b}\odot \boldsymbol {\epsilon }_{b}^{k}, \boldsymbol {u}_{i}^{k}=\boldsymbol {\mu }_{u}+\boldsymbol {\delta }_{u}\odot \boldsymbol {\epsilon }_{u}^{k}\), ⊙ is an element-wise multiplication and \(\boldsymbol {\epsilon }_{a}^{k}, \boldsymbol {\epsilon }_{b}^{k}, \boldsymbol {\epsilon }_{u}^{k}\) are samples drawn from standard multivariate normal distribution \(\mathcal {N}(0,\boldsymbol {I})\). The superscript k denotes the k-th sample. The ELBO for item network, \(\mathcal {L}(\lambda ,\beta ,\tau ,\psi ;\mathcal {Y}_{j})\), can be derived similarly, and thus we omit it here.

figure a

5.2 Optimizing parameters

Since minimizing the objection function is equivalent to maximizing the log likelihood of the observed data. Based on \(\mathcal {L}(\phi ,\alpha ,\theta ,\gamma ;\mathcal {X}_{i})\) in (28) and \(\mathcal {L}(\lambda ,\beta ,\tau ,\psi ;\mathcal {Y}_{j})\), the objective function is:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}&=&-\sum\limits_{i=1}^{M}\sum\limits_{j=1}^{N}(\mathcal{L}(\phi,\alpha,\theta,\gamma;\mathcal{X}_{i}) +\mathcal{L}(\lambda,\beta,\tau,\psi;\mathcal{Y}_{j}) \\ &&+\frac{C_{ij}}{2}\mathbb{E}_{q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i})q_{\lambda}(\boldsymbol{v}_{j}|\mathcal{Y}_{j})}[(R_{ij}-\boldsymbol{u}_{i}^{\top} \boldsymbol{v}_{j})^{2}]), \end{array} $$
(28)

where the expectation term is given by:

$$ \begin{array}{@{}rcl@{}} && \mathbb{E}_{q_{\phi}(\boldsymbol{u}_{i}|\mathcal{X}_{i}) q_{\lambda}(\boldsymbol{v}_{j}|\mathcal{Y}_{j})} [(R_{ij}-\boldsymbol{u}_{i}^{\top} \boldsymbol{v}_{j})^{2}]=R_{ij}^{2}-2R_{ij}\mathbb{E}[\boldsymbol{u}_{i}]^{\top} \mathbb{E}[\boldsymbol{v}_{j}] \\ &&+\text{tr}((\mathbb{E}[\boldsymbol{v}_{j}]\mathbb{E}[\boldsymbol{v}_{j}]^{\top}+{\Sigma}_{v}){\Sigma}_{u})+\mathbb{E}[\boldsymbol{u}_{i}]^{\top}(\mathbb{E}[\boldsymbol{v}_{j}]\mathbb{E}[\boldsymbol{v}_{j}]^{\top} +{\Sigma}_{v})\mathbb{E}[\boldsymbol{u}_{i}], \end{array} $$
(29)

where tr(⋅) denotes the trace of a matrix. Since NVMF is a fully end-to-end neural network, the whole parameters of the model are the weight matrix of entire network, we can use back-propagation algorithm to optimize the weights of the user network and the item network.

5.3 Prediction of ratings

After the training is converged, we can get the posterior distributions of ui and vj through the user and item inference networks, respectively. So the prediction Rij can be made by:

$$ \mathbb{E}[R_{ij}|\mathcal{O}]=\mathbb{E}[\boldsymbol{u}_{i}|\mathcal{O}]^{\top} \mathbb{E}[\boldsymbol{v}_{j}|\mathcal{O}]. $$
(30)

Specifically, given a user, an item and the corresponding observed collaborative information and the features, \(\boldsymbol {s}_{i}^{u}, \boldsymbol {f}_{i}, \boldsymbol {s}_{j}^{v}\) and gj, our NVMF makes the rating prediction for the user on the item, Rij, as \(R_{ij} = \mathbb {E}[R_{ij}|\mathcal {O}]\) with \(\mathbb {E}[\boldsymbol {u}_{i}|\mathcal {O}] =\mu _{\phi _{u_{i}}}(\mathcal {X}_{i})\) (see (18)) and \(\mathbb {E}[\boldsymbol {v}_{j}|\mathcal {O}] = \mu _{\phi _{v_{j}}}(\mathcal {Y}_{j})\) (see (21)), respectively. For ranking prediction, we rank a list of items based on these prediction ratings.

5.4 Discussion

The computational complexity of training NVMF in each iteration is O(2(N + P)L + PL + 2(M + Q)L + QL + JL2 + D2), where J is the total number of layers in user and item networks, L is the average dimensions of these layers, O(2(N + P)L), O(PL), O(2(M + Q)L), O(QL), O(JL2) and O(D2) are the complexities of the encoder input layer and the decoder output layer in the user network, the input layer in the user prior network, the encoder input layer and the decoder output layer in the item network, the input layer in the item prior network, all the latent layers in NVMF, and the matrix factorization, respectively. Thus, the complexity of proposed NVMF is at the same level as the previous recommendation methods [15, 16].

6 Performance evaluation

We first give our experimental environment settings and then present the results of performance evaluation of experiments.

6.1 Experimental setting

6.1.1 Dataset

We use three benchmark datasets in our experiments which are commonly used to previous recommendation algorithms.

  • MovieLens-100KFootnote 1 (ML-100K) : Similar to [15, 18], we extract the features of users and movies provided by the dataset to construct our side information matrices F and G respectively. The user’s feature contains user’s id, age, gender, occupation and zipcode, corresponding to the movie’s feature contains movie’s title, release data and 18 categories of movie genre.

  • MovieLens-1MFootnote 2 (ML-1M): Similar to ML-100K, we can get side information of users and items.

  • BookcrossingFootnote 3: For this dataset, we also extract the features of users and book provided by the dataset. We encoded the user and book feature into binary vector of length 30 and 30 respectively. Since we will evaluate our model performance on implicit feedback. Thus, following to [15, 18], we interpret three datasets above as implicit feedback.

6.1.2 Baselines and experimental settings

For implicit feedback, as demonstrated in [16, 22], the hybrid collaborative filtering model incorporating side information outperforms the other method without side information. So the most baselines we choose are hybrid models. The baselines we use to compare our proposed method are listed as follows:

  1. 1)

    CMF [32]: This model is a MF model which decomposes the user-item matrix R, user’s side information matrix F, and item’s side information matrix G to get the consistent latent factor of user and item.

  2. 2)

    CDL [16]: This model is a hierarchical Bayesian model for joint learning of stacked denoising autoencoder and collaborative filtering.

  3. 3)

    CVAE [22]: This model is a Bayesian probabilistic model which unifies item feature and collaborative filtering through stochastic deep learning and probabilistic model.

  4. 4)

    aSDAE [18]: This model is a deep learning model which can integrate the side information into matrix factorization via additional stack denoising autoencoder.

  5. 5)

    NeuMF [33]: This model is a state-of-the-art collaborative filtering method for implicit feedback.

Since the implicit feedback matrix R ∈ {0,1}M×N, we set p𝜃(fi|bi) and p𝜃(gj|cj) as multivariate Bernoulli distribution. We set ω1, ω2 and the learning rate η to 0.05, 0.05 and 0.0001. The value of φ1, φ2 and λw are set to 1, 0.01 and 0.0001, respectively. The dimensions Ka, Kb, Kc, Kd are all chosen to be 20. The inference and generative networks are both two-layer network architectures and the last layer of generative network is a softmax layer. Similar to explicit feedback, the prior network of user and item are both set to one layer. We also set the dimensions of use and item latent factor D as 30. We use 80% ratings of dataset to train each method, 10% as validation, and the remaining 10% for testing. We repeat this procedure five times and report the average performance.

6.1.3 Evaluation metrics

For evaluation, we use the Hit Ratio (HR) [34] and the Normalized Discounted Cumulative Gain (NDCG) [33] as our evaluation as metrics. For each user, we sort the top-K items based on the predicted ratings.

$$ \begin{array}{@{}rcl@{}} &&\text{HR}@K=\frac{\#\text{hits}@K}{|\text{GT}|},~~~~~~\text{NDCG}@K=Z_{k}\sum\limits_{i=1}^{K}\frac{2^{\text{rel}_{i}}-1}{\log_{2}(i+1)}, \end{array} $$
(31)

where GT denotes the Ground Truth in test list set, reli is the graded relevance value of the item at position i and Zk is the normalization. For top-K recommendation, reli ∈{0,1}. We report the average recall scores over all users in our experimental analysis.

6.2 Results and analysis

We now show the results of performance comparisons between our proposed model NVMF and baselines on different experimental settings.

6.2.1 Performance comparison with baselines

Table 2 shows the experiment result that compare CMF, CDL, CVAE, aSDAE and NeuMF using three datasets. As we can see, our proposed methods NVMF significantly outperform the compared methods in all cases on both ML-100K, ML-1M and BookCrossing datasets. Compared with other methods that have a deep learning structure, CMF achieves the worst performance. This demonstrates the deep learning structure can learn more subtle and complex representation than the tradition MF method. We also observe that although CDL and CVAE both have a deep learning structure, CVAE achieves better performance than CDL. This is because CDL is based on denoising autoencoder which can be seen as point estimation, however CVAE is fully deep probabilistic model which make it hard to overfit the data. From Table 2, we observe the strongest baseline in our experiment is aSDAE which outperforms the other baselines. Although aSDAE is not a probabilistic model, it incorporates user’s side information (user’s features) into matrix factorization which CDL and CVAE does not. By incorporating both user’s feature and item’s feature and applying deep generative model, our model NVMF outperform all the baselines. Specifically, the average improvements of NVMF over the state-of-the-art method. Compared to other datasets, our model has the greatest performance improvement on Bookcrossing which is most sparse matrix among three datasets. This shows NVMF can more effectively alleviate the sparsity problem on implicit feedback than aSDAE.

Table 2 Recommendation performancecomparison between our NVMF and baselines

6.2.2 Performance comparison on different model configurations

Figure 3 shows performance comparison in term of different dimension D. We can observe that larger dimension leads to better performance. According to Fig. 3 , our NVMF outperforms other baselines on different dimensions. We also can find our NVMF achieve best performance when D= 40 on ML100K and ML100M datasets, D= 50 on BookCrossing. Figure 4 shows the contours of NDCG@5 for NVMF on three datasets. When the parameters equals 1, i.e ω1 = 1 and ω2 = 1, it means we directly optimize the standard ELBO -which has a degraded performance and this confirmed in Fig. 4b. When we decrease ω1 to 0.1 at fixed ω2, we find NVMF’s performance improves(small ω1 implies we want condense more user’s collaborative information into user’s latent factor). Similar observation can be made for varying ω2 at fixed ω1. Moreover, It can be observed that there is a region of values of ω1 and ω2 (near (0.1,0.1)), around which NVMF provides the best performance in terms of recall. Altogether, Fig. 4 shows treating ω1 and ω2 as trade-off parameters can yields significant improvement in performance of the recommendation.

Fig. 3
figure 3

Performance comparison between NVMF and the baselines by varying latent dimension D on three datasets

Fig. 4
figure 4

The NDCG@5 for NVMF by varying ω1 and ω2 on three datasets

6.3 Performance comparison on different sparsity conditions

To further evaluate our proposed NVMF on different sparsity conditions, we train each compared baseline method with different percentages (30%, 50%, 80%) of feedbacks on ML-100K datasets. Note that we do not conduct the experiment on ML-1M and BoorCrossing Datasets, since they are already very sparse (95.8% and 99.9% sparsity respectively). Table 3 shows the results. From Table 3, we can find: 1) Our proposed model outperforms all baselines on all sparsity conditions, which demonstrates our model can effectively handle the data sparsity problem. 2) The models which incorporate both uses’ and items’ side information (i.e., NVMF and aSDAE) outperform others (i.e., CDL and CVAE), which demonstrates again that incorporating both users’ and items’ side information can effectively handle data sparsity problem. 3) Our model NVMF outperforms many baselines (i.e., aSDAE and NeuMF) at extreme sparsity condition (30%) by a large margin.

Table 3 Recommendation performance comparison with different percentages of training dat on ML-100K

7 Conclusion

In this paper, we study the problem of how to learn subtle and complex latent factors from the feedback matrix with side information for collaborative filtering in recommendation systems. We propose a neural variational matrix factorization model NVMF which is a novel deep generative model to learn the latent factors of users and items. NVMF possesses the merits of both deep learning and probabilistic generative models. It can learn more subtle and complex representations than traditional probabilistic matrix factorization, and is more robust than other deep neural network models such as autoencoder (AE) and denoising autoencoder (DAE). In addition, NVMF incorporates the features of users and items into matrix factorization through a novel generative process, which enables it to effectively handle the data sparsity and cold start problems. We present a complete end-to-end network architecture so that back-propagation can be applied for efficient parameter estimation, and a Stochastic Gradient Variational Bayes (SGVB) estimator to infer latent factors and parameters in our model. We also derive a variational lower bound to show the quality guarantee of the model. Experiments conducted on explicit and implicit (user-item rating) feedbacks have demonstrated our model’s effectiveness in learning the latent factors and its outperformance on recommendation accuracy over the state-of-the-art methods for recommendation.

The complete architecture of model generation and parameter inference makes the proposed model easy to be extended to handle other types of data such as images and videos.