Keywords

1 Introduction

With the emergence of big data, recommender systems have become a core part of online services. The goal of a recommender system is to model user preferences by analyzing their history data and providing them with personalized recommendations. Collaborative filtering (CF) is an efficient approach for recommender systems that aims at predicting the rating of a user for an item given the past rating history of the users. Among various CF-based methods, matrix factorization (MF) is one of the most powerful approaches [2, 11].

MF-based algorithms represent user preferences and item attributes by latent feature vectors in a shared latent space. Typically, an MF-based algorithm finds the latent feature vectors of users and items by fitting the model with the observed ratings given in the user–item rating matrix [2, 11, 12, 15]. However, because each user can only rate a limited number of items, the rating matrix is usually extremely sparse. Therefore, the performance of rating prediction declines if an item has too few ratings; or in an extreme case, if an item has no prior ratings, the system cannot learn its latent feature vector, and thus cannot recommend it to any user. This problem is referred to as the cold-start problem.

To address this problem, a common approach is to exploit other information about users and items, known as side information or auxiliary information. There are various types of side information, depending on the item being recommended (e.g., the genres of movies or text content of books). Such side information is successfully combined with traditional CF-based algorithms to alleviate the cold-start problem. For instance, in [1, 14, 16], text content was exploited for article recommendations; music content for song recommendations [9], or text content and category information for movie recommendations [10]. One limitation of these models is that such side information is not always available, or, in many cases, that information is available, but less informative for describing items (e.g., some items are described by a few keywords or very short texts).

This work focuses on exploiting another kind of feedback known as implicit feedback (e.g., clicks, views, purchases) as the auxiliary data. The advantage of using implicit feedback is that it is abundant and easily collected during the interactions of users with the system without requiring users to provide further interactions.

Related Work. In [2], the authors proposed SVD++ which exploits implicit feedback to boost the performance of the original probabilistic matrix factorization (PMF) model [11]. However, in SVD++, implicit feedback is a binary matrix that indicates “who rates what”, obtained by binarizing the rating data, an item has implicit feedback if and only if it has rating data; therefore, this model cannot model an item if it has no ratings.

Co-rating [5] combines explicit and implicit feedback in a unified framework. In the model, the explicit feedback are normalized into the range [0, 1] and added to the implicit feedback matrix to form a unique matrix. This matrix is then factorized to obtain the latent feature vectors of users and items. In this way, the feature vector of an item can be inferred even if it does not have any rating data. The limitation of this method is that, after forming the final matrix, implicit and explicit feedback cannot be distinguished; therefore, this model cannot take into account the uncertainty of the implicit feedback.

In [13], the authors proposed a method for combining implicit and explicit feedback using expectation maximization (EM). To predict ratings for an item for which rating data are not available, the rating is inferred from ratings of its neighbors in terms of click data. However, the algorithm is based on an iterative EM-based algorithm in which the E-phase is a matrix factorization model. In other words, matrix factorization is performed multiple times and is therefore computationally expensive.

In [8], the authors proposed a probabilistic model for combining explicit and implicit feedback for making recommendations. In this model, the latent feature vector of an item for which rating data are not available can be learned directly from the implicit feedback data. This is a combination of PMF [11] and an item embedding model; the model learns latent feature vectors for an item from the implicit feedback. In detail, the model consists of simultaneously factorizing the rating matrix and the positive point-wise mutual information (PPMI) matrix that is constructed from the click data. The item vectors are obtained by the factorization of the PPMI matrix and are then adjusted by the rating matrix. Although this model successfully combines implicit feedback data in learning latent feature vectors of items, it is prone to overfitting if the hyperparameters (i.e., the regularization parameters) are not tuned carefully. Usually, the hyperparameter tuning is very costly, especially when there are many hyperparameters, or when the data are large.

Present Work. In this paper, we propose a fully hierarchical Bayesian treatment for the model proposed by Nguyen et al. [8]. In this model, instead of finding a point estimate for the model parameters, our method infers the full posterior distribution of these model parameters to capture their uncertainty. The missing ratings are predicted by integrating out the latent feature vectors of users and items. To this end, we place the Gaussian inverse Wishart priors on the mean vectors and covariance matrices of the latent feature vectors for the rating matrix and PPMI matrix [8]. We develop a Markov chain Monte Carlo (MCMC)-based method for inferring the full posterior distribution.

2 Preliminary

Suppose we have N users and M items. For each user–item pair (ui), there can be two types of feedback: explicit feedback (also known as rating data) and implicit feedback (also known as click data). The rating data are represented by matrix \(\mathbf {R}\in \mathbb {R}^{N\times M}\), in which element \(r_{ui}\) is the rating of user u for item i. \(r_{ui}\) can be a real number or a binary value (e.g., like/dislike). The click data are represented by a binary matrix \(\mathbf {P}\in \{0,1\}^{N\times M}\), where \(p_{ui}=1\) indicates that user u has clicked i at least once, and \(p_{ui}=0\) otherwise.

Generally, the rating matrix \(\mathbf {R}\) is extremely sparse with many missing values (i.e., \(r_{ui}\) is not observed). We are interested in predicting these missing ratings.

2.1 Item Embedding Model According to Implicit Feedback Data

Word embedding techniques have shown their success in many natural language processing tasks [3, 6]. By viewing each item in a recommender system as a word, the same assumptions that underlie word embedding models can be applied to modeling items. In [8], the authors proposed a method for an item embedding model based on implicit feedback data, i.e., a model that captures the relationship between items that are clicked by the same users.

In this item embedding model, item i is represented by two vectors: an item vector \(\mathbf {w}_i\) and a context vector \(\mathbf {z}_i\). The vectors have different roles: the item vector describes the distribution of the item, and the context vector describes the distribution of the co-occurrence of an item with other items in its context.

In [8], the authors proposed an item-embedding scheme in which the item vector and the context vector are obtained by factorizing the PPMI matrix corresponding to the click data [8]:

$$\begin{aligned} PPMI(i,j)=\mathbf {w}_i^\top \mathbf {z}_i \end{aligned}$$
(1)

The PPMI matrix is obtained by replacing the negative values by zeros in the point-wise mutual information (PMI) matrix. The elements of the PMI matrix are correlation measures for the co-occurrence of two items. Empirically, the PMI of items i and j can be approximated using the observed data:

$$\begin{aligned} PMI(i,j)=\log \frac{\#(i,j)|\mathcal {D}|}{\#(i)\#(j)}, \end{aligned}$$
(2)

where \(\#(i)\) and \(\#(j)\) are the numbers of times items i and j are clicked, respectively. \(\mathcal {D}\) is the set of item pairs that appear in the combined click history of all users. \(\#(i,j)\) is the number of users who clicked both i and j.

2.2 Probabilistic Model for Implicit and Explicit Feedback Data

After producing the item embedding model according to implicit feedback, Nguyen et al. [8] proposed a model that combines implicit and explicit feedback in a unified framework (PIE). PIE is a combination of the item-embedding model (described in Sect. 2.1) and the matrix factorization for rating data. In PIE, the latent feature vector \(\mathbf {y}_i\) of item i is obtained by adding a small deviation \(\mathbf {t}_i\) to the item vector \(\mathbf {z}_i\). The graphical model of PIE is shown in Fig. 1a.

The main drawback of this model is that the parameter learning is a point estimation (MAP estimate), which is prone to overfitting when applying the trained model to unseen data. To avoid overfitting, we must tune the hyperparameters carefully. One approach is grid-search: we form a set of appropriate configurations of hyperparameters and train the model with these configurations. The configuration that produces the best performance on the validation set will be selected. However, in general, grid search is very costly, especially for large-scale data, or when the number of hyperparameters is large.

A straightforward way to avoid hyperparameter tuning is to introduce priors to the hyperparameters and optimize the log-posterior over both model parameters and hyperparameters. In this way, the hyperparameters will be learned from the data instead of tuned manually. However, this solution does not significantly improve the generalization of the model because it is still a point estimation and cannot capture the uncertainty of the model parameters.

3 Proposed Method

To address drawbacks described above, we propose a hierarchical, fully Bayesian model (HBFM) that can capture the uncertainty of the model parameters. Instead of approximating the posterior by its mode (the MAP estimate), we approximate the full posterior distributions of model parameters.

3.1 The Model

We place the Gaussian inverse Wishart priors on the mean vectors and covariance matrices of the latent feature vectors. The graphical model is shown in Fig. 1b.

Fig. 1.
figure 1

Graphical models of PIE [8] and HPMF (this paper). Because of space limitations, we omit the bias terms.

We assume that \(r_{ui}\) and \(s_{ij}\) are Gaussian distributions as follows:

$$\begin{aligned} p(r_{ui}|\mathbf {x}_u, \mathbf {y}_i, \theta _u, \rho _i, \sigma ^2_R)&= \mathcal {N}(r_{ui}|\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui}, \sigma ^2_R)\end{aligned}$$
(3)
$$\begin{aligned} p(s_{ui}|\mathbf {w}_i, \mathbf {z}_j, \sigma ^2_S)&= \mathcal {N}(s_{ij}|\mathbf {w}_i^\top \mathbf {z}_j, \sigma ^2_S), \end{aligned}$$
(4)

where \(\theta _u\) and \(\rho _i\) are the biases of user u and item i, respectively; \(\mathbf {y}_i = \mathbf {t}_i+\mathbf {w}_i\); \(\eta _{ui}=\mu + \theta _u + \rho _i\); and \(\mu \) is the global mean of the ratings.

The prior distributions of the latent feature vectors are assumed to be multivariate Gaussian distributions:

$$\begin{aligned} \begin{aligned} p(\mathbf {X}|\mathbf {\Theta }_X) = \prod _{u=1}^N \mathcal {N}(\mathbf {x}_u|\varvec{\mu }_X,\mathbf {\Sigma }_X), \quad p(\mathbf {T}|\mathbf {\Theta }_T) = \prod _{i=1}^M \mathcal {N}(\mathbf {t}_i|\varvec{\mu }_T,\mathbf {\Sigma }_T)\\ p(\mathbf {W}|\mathbf {\Theta }_W) = \prod _{i=1}^M \mathcal {N}(\mathbf {w}_i|\varvec{\mu }_W,\mathbf {\Sigma }_W), \quad p(\mathbf {Z}|\mathbf {\Theta }_Z) = \prod _{j=1}^M \mathcal {N}(\mathbf {z}_u|\varvec{\mu }_Z,\mathbf {\Sigma }_Z) \end{aligned} \end{aligned}$$
(5)

where \(\varvec{\mu }_X\), \(\varvec{\mu }_T\), \(\varvec{\mu }_W\), and \(\varvec{\mu }_Z\) are the mean vectors and \(\mathbf {\Sigma }_X\), \(\mathbf {\Sigma }_Y\), \(\mathbf {\Sigma }_W\), and \(\mathbf {\Sigma }_Z\) are the covariance matrices of \(\mathbf {x}_u\), \(\mathbf {y}_i\), \(\mathbf {w}_i\), and \(\mathbf {z}_j\), respectively; \(\mathbf {\Theta }_X=\{\varvec{\mu }_X, \varSigma _X\}\), \(\mathbf {\Theta }_T=\{\varvec{\mu }_T, \varSigma _T\}\), \(\mathbf {\Theta }_W=\{\varvec{\mu }_W, \varSigma _W\}\), and \(\mathbf {\Theta }_Z=\{\varvec{\mu }_Z, \varSigma _Z\}\).

To model the uncertainty of the latent feature vectors, we do not treat them as distributions of fixed hyperparameters. Instead, we further place Gaussian-inverse Wishart priors on \(\mathbf {\Theta }_X\), \(\mathbf {\Theta }_T\), \(\mathbf {\Theta }_W\), and \(\mathbf {\Theta }_Z\):

$$\begin{aligned} p(\mathbf {\Theta }_X|\varvec{\varPhi }_{X_0}) = \mathcal {N}(\varvec{\mu }_X|\varvec{\mu }_{X_0},\varvec{\varSigma }_{X}/\gamma _{X_0})\mathcal {W}^{-1}(\varvec{\varSigma }_X|\mathcal {W}_{X_0}, \nu _{X_0}) \end{aligned}$$
(6)
$$\begin{aligned} p(\mathbf {\Theta }_T|\varvec{\varPhi }_{T_0}) = \mathcal {N}(\varvec{\mu }_T|\varvec{\mu }_{T_0},\mathbf {\Sigma }_{T}/\gamma _{T_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_T|\mathcal {W}_{T_0}, \nu _{T_0}) \end{aligned}$$
(7)
$$\begin{aligned} p(\mathbf {\Theta }_W|\varvec{\varPhi }_{W_0}) = \mathcal {N}(\varvec{\mu }_W|\varvec{\mu }_{W_0},\mathbf {\Sigma }_{W}/\gamma _{W_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_W|\mathcal {W}_{W_0}, \nu _{W_0}) \end{aligned}$$
(8)
$$\begin{aligned} p(\mathbf {\Theta }_Z|\varvec{\varPhi }_{Z_0}) = \mathcal {N}(\varvec{\mu }_Z|\varvec{\mu }_{Z_0},\mathbf {\Sigma }_{Z}/\gamma _{Z_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_Z|\mathcal {W}_{Z_0}, \nu _{Z_0}), \end{aligned}$$
(9)

where: \(\varvec{\varPhi }_{X_0}=\{\varvec{\mu }_{X_0},\gamma _{X_0}, \mathcal {W}_{X_0}, \nu _{X_0}\}\), \(\varvec{\varPhi }_{T_0}=\{\varvec{\mu }_{T_0},\gamma _{T_0}, \mathcal {W}_{T_0}, \nu _{T_0}\}\), \(\varvec{\varPhi }_{W_0}=\{\varvec{\mu }_{W_0},\gamma _{W_0}, \mathcal {W}_{W_0}, \nu _{W_0}\}\), and \(\varvec{\varPhi }_{Z_0}=\{\varvec{\mu }_{Z_0},\gamma _{Z_0}, \mathcal {W}_{Z_0}, \nu _{Z_0}\}\).

Here, \(\mathcal {W}^{-1}\) is the inverse Wishart distribution with \(\nu _0\) degrees of freedom and a \(d\times d\) scaling matrix \(\mathcal {W}_0\):

$$\begin{aligned} \mathcal {W}^{-1}(\mathbf {\Sigma }|\mathcal {W}_0,\nu _0) = \frac{1}{C}|\mathbf {\Sigma }|^{-(\nu _0-d-1)/2}\exp (-\frac{1}{2}Tr(\mathcal {W}_0\mathbf {\Sigma }^{-1})), \end{aligned}$$
(10)

where C is a normalizing constant and Tr(.) is the trace of a matrix.

The Gaussian inverse Wishart prior is adopted because it is the conjugate prior of the multivariate Gaussian distribution. This selection of the prior allows the conditional distributions derived from the posterior distributions to be sampled easily. Similarly, we place inverse Gamma priors [17] on the variance \(\sigma ^2_R\):

$$\begin{aligned} p(\sigma ^2_R|\alpha _R, \beta _R) \quad =&\quad IG(\sigma ^2_R|\alpha _R, \beta _R), \end{aligned}$$
(11)

where IG(.) is the inverse Gamma distribution [17]:

$$\begin{aligned} IG(x|\alpha , \beta )=\frac{\beta ^\alpha }{\varGamma (\alpha )}x^{-\alpha -1}exp(-\frac{\beta }{x}) \end{aligned}$$
(12)

Choosing the inverse Gamma distribution, which is the conjugate prior of the variance of a Gaussian distribution, makes it easy to sample from the posterior distribution. Indeed, this distribution has also been proven to model the unknown variance of a Gaussian distribution effectively [17].

We place Gaussian priors over the bias terms as follows.

$$\begin{aligned} p(\theta _u|\sigma ^2_\theta ) = \mathcal {N}(\theta _u|0, \sigma ^2_\theta ), \quad p(\rho _i|\sigma ^2_\rho ) = \mathcal {N}(\rho _i|0, \sigma ^2_\rho ), \end{aligned}$$
(13)

where, \(\sigma ^2_\theta \) and \(\sigma ^2_\rho \) are inverse Gamma distributions [17]:

$$\begin{aligned} p(\sigma ^2_\theta |\alpha _\theta , \beta _\theta ) = IG(\sigma ^2_\theta |\alpha _\theta , \beta _\theta ), \quad p(\sigma ^2_\rho |\alpha _\rho , \beta _\rho ) = IG(\sigma ^2_\rho |\alpha _\rho , \beta _\rho ) \end{aligned}$$
(14)

We place an inverse Gamma [17] prior on the variance of \(\sigma ^2_S\) of \(r_{ij}\):

$$\begin{aligned} p(\sigma ^2_S|\alpha _S, \beta _S) \quad =&\quad IG(\sigma ^2_S|\alpha _S, \beta _S) \end{aligned}$$
(15)

3.2 Posterior Inference

Our goal is to find the posterior distribution of the model parameters. The posterior distribution is analytically intractable, so we employ MCMC-based methods, which are widely used for approximating distributions [7]. The key idea of these methods is to construct a Markov chain that converges to the posterior distribution of the model. Each state of the Markov chain is a set of model parameters. The posterior distribution is characterized by the samples from that Markov chain. In this paper, we use Gibbs sampling [7], a kind of MCMC that alternatively samples each variable conditioned on the remaining variables.

Sampling \({\mathbf {x}_{\varvec{u}}}\) , \({\mathbf {t}_{\varvec{i}}}\) \({\mathbf {w}_{\varvec{i}}}\) , and \({\mathbf {z}_{\varvec{j}}}\) . The conditional distribution over the user latent feature vector \(\mathbf {x}_u\), conditioned on the observed ratings, the latent feature vectors of items, and the hyperparameters, is Gaussian:

$$\begin{aligned} \begin{aligned} p(\mathbf {x}_u|\mathbf {R},\mathbf {Y}, \varvec{\mu }_X, \varvec{\theta }, \varvec{\rho }, \mathbf {\Sigma }_X)&= \mathcal {N}(\mathbf {x}_u|\varvec{\mu }^*_{X_u},\mathbf {\Sigma }^*_{X_u})\\&\propto p(\mathbf {x}_u|\varvec{\mu }_X, \mathbf {\Sigma }_X)\prod _{i\in \mathcal {R}_u}\mathcal {N}(r_{ui}|\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui}, \sigma _R^2), \end{aligned} \end{aligned}$$
(16)

where \(\eta _{ui}=\theta _u+\rho _i+\mu \), \(\varvec{\theta }=\{\theta _u\}_{u=1}^N\), \(\varvec{\rho }=\{\rho _i\}_{i=1}^M\), and

$$\begin{aligned} \mathbf {\Sigma }^*_{X_u}&=\Big (\mathbf {\Sigma }_{X}^{-1}+\frac{1}{\sigma ^2_R}\sum _{i\in \mathcal {R}_u}\mathbf {y}_i\mathbf {y}_i^\top \Big )^{-1}\end{aligned}$$
(17)
$$\begin{aligned} \varvec{\mu }^*_{X_u}&= \mathbf {\Sigma }^*_{X_u}\Big [\mathbf {\Sigma }_{X}^{-1}\varvec{\mu }_X+\frac{1}{\sigma ^2_R}\sum _{i\in \mathcal {R}_u}(r_{ui}-\eta _{ui})\mathbf {y}_i\Big ]. \end{aligned}$$
(18)

Similarly, we can obtain the posterior distribution of \(\mathbf {t}_i\), \(\mathbf {w}_i\) and \(\mathbf {z}_j\).

Sampling \(\varvec{\varTheta }_X=\{\varvec{\mu }_X, \varvec{\varSigma }_X\}\), \(\varvec{\varTheta }_T=\{\varvec{\mu }_T, \varvec{\varSigma }_T\}\), \(\varvec{\varTheta }_W=\{\varvec{\mu }_W, \varvec{\varSigma }_W\}\), and \(\varvec{\varTheta }_Z=\{\varvec{\mu }_Z, \varvec{\varSigma }_Z\}\). The posterior distribution over \(\varvec{\varTheta }_X=\{\varvec{\mu }_X, \varvec{\varSigma }_X\}\) conditioned on user latent feature vectors and \(\varvec{\varPhi }_{X_0}=\{\varvec{\mu }_{X_0}, \gamma _{X_0}, \mathcal {W}_{X_0}, \nu _{X_0}\}\) is a Gaussian inverse Wishart distribution:

$$\begin{aligned} p(\varvec{\mu }_X, \varvec{\varSigma }_X|\mathbf {X}, \varvec{\varPhi }_{X_0})&= \mathcal {N}(\varvec{\mu }_X|\varvec{\mu }^*_{X_0}, \varvec{\varSigma }_{X}/\gamma ^*_{X_0})\mathcal {W}^{-1}(\varvec{\varSigma }_X|\mathcal {W}^*_{X_0}, \nu ^*_{X_0})\end{aligned}$$
(19)
$$\begin{aligned}&\propto p(\mathbf {X}|\varvec{\mu }_X, \varvec{\varSigma }_X)p(\varvec{\mu }_X, \varvec{\varSigma }_X|\varvec{\varPhi }_{X_0}), \end{aligned}$$
(20)

where:

$$\begin{aligned} \varvec{\mu }^*_{X_0}&= \frac{\gamma _{X_0}\varvec{\mu }_{X_0}+N\bar{\mathbf {x}}}{\gamma _{X_0}+N}, \quad \gamma ^*_{X_0} = \gamma _{X_0} + N, \nu ^*_{X_0} = \nu _{X_0} + N \end{aligned}$$
(21)
$$\begin{aligned} \mathcal {W}^*_{X_0}&= \mathcal {W}_{X_0}+N\bar{\mathbf {S}}+\frac{\gamma _{X_0}N}{\gamma _{X_0}+N}(\varvec{\mu }_{X_0}-\bar{\mathbf {x}})(\varvec{\mu }_{X_0}-\bar{\mathbf {x}})^\top \end{aligned}$$
(22)
$$\begin{aligned} \bar{\mathbf {x}}&= \frac{1}{N}\sum _{u=1}^{N}\mathbf {x}_u, \quad \bar{\mathbf {S}}=\frac{1}{N}\sum _{u=1}^{N}\mathbf {x}_u\mathbf {x}_u^\top \end{aligned}$$
(23)

Similarly, we can obtain the posterior distributions over \(\varvec{\varTheta }_T\), \(\varvec{\varTheta }_W\), and \(\varvec{\varTheta }_Z\) using exactly the same form.

Sampling bias terms \(\varvec{\theta _{u}}\) and \(\varvec{\rho _{i}}\) . The posterior distribution over the user bias term \(\theta _u\) is Gaussian:

$$\begin{aligned} \begin{aligned} p(\theta _u|\mathbf {R,X,Y},\varvec{\rho },\sigma ^2_R)&= \mathcal {N}(\theta _u|\xi ^*_u,(\sigma ^*_{\theta _u})^2\\&\propto p(\mathbf {R}|\mathbf {X,Y},\varvec{\rho }, \sigma ^2_R)p(\theta _u|\sigma ^2_\theta ), \end{aligned} \end{aligned}$$
(24)

where:

$$\begin{aligned} (\sigma ^*_{\theta _u})^2 = \Big (\frac{1}{\sigma ^2_\theta }+\frac{|\mathcal {R}_u|}{\sigma ^2_R}\Big )^{-1}, \quad \xi ^*_u = \Big (\frac{\sigma ^*_{\theta _u}}{\sigma _R}\Big )^2\sum _{i\in \mathcal {R}_u}\Big [r_{ui}-(\mu +\rho _i+\mathbf {x}_u^\top \mathbf {y}_i)\Big ] \end{aligned}$$
(25)

The posterior distribution over the \(\rho _i\) can be obtained using the same form.

Sampling \(\varvec{\sigma ^{2}_{R}}\) and \(\varvec{\sigma ^{2}_{S}}\) . The posterior distribution over \(\sigma ^2_R\), conditioned on the rating data, user latent factor matrix \(\mathbf {X}\), item latent factor matrix \(\mathbf {Y}\), and bias matrices \(\varvec{\theta }\), \(\varvec{\rho }\), is given as:

$$\begin{aligned} \begin{aligned} p(\sigma ^2_R|\mathbf {R}, \mathbf {X}, \mathbf {Y}, \alpha _R, \beta _R)&= IG(\sigma ^2_R|\alpha ^*_R, \beta ^*_R)\\&\propto p(\mathbf {R}|\mathbf {X}, \mathbf {Y}, \varvec{\theta }, \varvec{\rho }, \sigma ^2_R)p(\sigma ^2_R|\alpha _R, \beta _R) \end{aligned} \end{aligned}$$
(26)

where:

$$\begin{aligned} \alpha ^*_R = \alpha _R + \frac{|\mathcal {R}|}{2}, \quad \beta ^*_R = \beta _R + \frac{1}{2}\sum _{(i,j)\in \mathcal {R}}\Big [r_{ui}-(\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui})\Big ]^2 \end{aligned}$$
(27)

The conditional distribution over \(\sigma ^{2}_{S}\) can be obtained using the same form.

Sampling \(\varvec{{\sigma }^{2}_{\theta }}\) and \(\varvec{{\sigma }^{2}_{\rho }}\) . The conditional distribution over \(\sigma ^{2}_{\theta }\) conditioned on the bias terms of users is an inverse Gamma distribution:

$$\begin{aligned} \begin{aligned} p(\sigma ^2_\theta |\varvec{\theta }, \alpha _\theta , \beta _\theta )&= IG(\sigma ^2_\theta |\alpha ^*_\theta , \beta ^*_\theta ) \quad \propto \quad p(\varvec{\theta }|\sigma ^2_\theta )p(\sigma ^2_\theta |\alpha _\theta ,\beta _\theta ), \end{aligned} \end{aligned}$$
(28)

where:

$$\begin{aligned} \alpha ^*_\theta&= \alpha _\theta + \frac{N}{2}, \quad \beta ^*_\theta = \beta _\theta + \frac{1}{2}\sum _{u=1}^N\theta ^2_u \end{aligned}$$
(29)

The conditional distribution over \(\sigma ^2_\rho \) conditioned on the bias terms of items can be obtained using the same form.

Computational Complexity. From the formulas for posterior distribution sampling, we can observe that the most expensive computations lie in the sampling of the latent feature vectors (\(\mathbf {x}_u\), \(\mathbf {t}_i\), \(\mathbf {w}_i\) and \(\mathbf {z}_j\)), which require computing the inverses of matrices. It is easy to show that in each iteration, the complexity for sampling the latent feature vectors of N users (matrix \(\mathbf {X}\)) is \(\mathcal {O}(d^2|\mathcal {R}| + d^3N)\). Similarly, the complexities for sampling matrix \(\mathbf {T}\), \(\mathbf {W}\), and \(\mathbf {Z}\) are \(\mathcal {O}(d^2|\mathcal {R}| + d^3N)\), \(\mathcal {O}(d^2|\mathcal {S}| + d^3M)\), and \(\mathcal {O}(d^2|\mathcal {S}| + d^3M)\), respectively, where \(|\mathcal {R}|\) and \(|\mathcal {S}|\) are the numbers of observed ratings and observed clicks, respectively. However, note that the posterior distribution of \(\mathbf {x}_{u}\) does not depend on other users; therefore, the sampling of matrix \(\mathbf {X}\) can be performed efficiently in parallel. Similarly, sampling \(\mathbf {T}\), \(\mathbf {W}\), and \(\mathbf {Z}\) can also be sped up by performing them in parallel.

3.3 Rating Prediction

The posterior predictive distribution of the unseen rating value \(\hat{r}_{ui}\) of item i by user u is obtained by integrating out the model parameters and hyperparameters:

$$\begin{aligned} \begin{aligned} p(\hat{r}_{ui}|\mathcal {O})&= \int \dots \int p(\hat{r}_{ui}|\varvec{\varOmega })p(\varvec{\varOmega })d\{\varvec{\varOmega }\}, \end{aligned} \end{aligned}$$
(30)

where \(\mathcal {O}\) is the observed data and \(\varvec{\varOmega }\) is the set of all parameters.

The above posterior predictive distribution is analytically intractable, so we approximate it by sampling the parameters using the Gibbs sampling described in Sect. 3.2. The predicted rating value can be approximated as follows:

$$\begin{aligned} \begin{aligned} p(\hat{r}_{ui}|\mathcal {O})&\approx \frac{1}{K}\sum _{k=1}^{K}p(\hat{r}_{ui}|\mathbf {x}^{(k)}_u, \mathbf {y}^{(k)}_i, \theta ^{(k)}_u, \rho ^{(k)}_i, (\sigma ^2_R)^{(k)})\\&= \frac{1}{K}\sum _{k=1}^{K}\mathcal {N}\Big (\hat{r}_{ui}|\eta ^{(k)}_{ui}+{\mathbf {x}^{(k)}_u}^\top \mathbf {y}^{(k)}_i,(\sigma ^2_R)^{(k)}\Big ), \end{aligned} \end{aligned}$$
(31)

where K is the number of samples taken from the posterior distribution, \((.)^{(k)}\) is the kth sample, and \(\eta ^{(k)}_{ui}=\mu +\theta ^{(k)}_u+\rho ^{(k)}_i\).

We consider two rating prediction tasks: (i) in-matrix prediction: predict the rating by user u of item i, where i has not been rated by u but has been rated by at least one other user (i.e., i appears at least once in the training set of the rating data); and (ii) out-matrix prediction: predict the rating by user u of item i, where i has not been rated by any user (i.e., i does not appear in the training set of the rating data).

In Eq. 31, \(\mathbf {y}^{(k)}_i=\mathbf {w}^{(k)}_i+\mathbf {t}^{(k)}_i\) for the in-matrix prediction task; \(\mathbf {y}^{(k)}_i=\mathbf {w}^{(k)}_i\) and \(\eta ^{(k)}_{ui}=\mu +\theta ^{(k)}_u\) for the out-matrix prediction task.

4 Empirical Study

4.1 Datasets

Data Description. We used three public datasets of different domains with varying sizes. (1) MovieLens 1M (ML-1m): a dataset of user-movie ratings collected from MovieLens, an online film service. It contains 1 million ratings in the range 1–5 of 4000 movies by 6000 users. This dataset is available at GroupLensFootnote 1. (2) MovieLens 20M (ML-20m): another dataset of user-movie ratings collected from MovieLens. It contains 20 million ratings in the range 1–5 of 27,000 movies by 138,000 users. This dataset is available at GroupLensFootnote 2. (3) Bookcrossing: A dataset collected in August and September 2004 from the Book-Crossing websiteFootnote 3. This dataset contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit/implicit) of 271,379 books. We removed users and items that had no explicit feedback.

The MovieLens datasets contain only rating data, so we employed a preprocess phase to obtain the click data. We binarized the original rating data and interpreted it as click data. Furthermore, because rating data are only a small part of the click data, we randomly selected from original ratings with different percentages, assuming that only these amounts of ratings were available. Details of these datasets after preprocessing are shown in Table 1.

Table 1. Datasets obtained by selecting ratings from the original ratings of MovieLens datasets with different percentages

4.2 Experimental Protocol

We used the click data and 80% of the rating data to train the model; the remaining 20% of the rating data was used as the test data to evaluate the model. In evaluating the in-matrix prediction task, when splitting data, we made sure that every item in the test set appeared at least once in the training set. In evaluating the out-matrix prediction task, we made sure that none of the items in the test set appeared in the training set (to ensure that none of the items in the test set had any previous ratings).

Evaluation Metric. We used Root Mean Square Error (RMSE) as the metric to measure the performance of the models. RMSE measures the deviation between the rating predicted by the model and the true rating (given by the test set), and is defined as follows.

$$\begin{aligned} RMSE=\sqrt{\frac{1}{|Test|}\sum _{(u,i)\in Test}(r_{ui}-\hat{r}_{ui})^2}, \end{aligned}$$
(32)

where |Test| is the size of the test set.

Competing Methods. For the in-matrix prediction task, we compared our method with the following baseline methods:

  1. 1.

    PMF [11]: a state-of-the-art method for rating predictions

  2. 2.

    BPMF [12]: the Bayesian treatment of PMF [11]

  3. 3.

    NMF (non-negative matrix factorization) [4]: a matrix factorization method which requires the components of user and item factors to be non-negative

  4. 4.

    PIE [8]: the model described in Sect. 2.2

  5. 5.

    SVD++ [2]: a factor model that exploits both explicit and implicit feedback in rating predictions

For the out-matrix prediction task, we compared our proposed method with PIE [8], which is described in Sect. 2.2.

Parameter Settings. We varied the dimension of the latent space (\(d=20, 30, 50,100\)) to study the performance of the models with respect to the dimensionality of the latent feature vectors.

For PMF, NMF and SVD++, we used grid search to find the optimal values of the hyperparameters that produced the best performance on a validation set. For the PIE model [8], we fixed \(\lambda =1\) and used grid search to find the optimal values of the remaining parameters that gave good performance on the validation set. For BPMF [12], hyperparameters were set following the original paper.

Regarding our proposed method, HBFM, for simplicity, we set the parameters as follows: \(\mathcal {W}_\mathcal {F}=\mathbf {I}_d\), \(\nu _{\mathcal {F}_0}=d\), \(\gamma _{\mathcal {F}_0}=1\), and \(\mathbf {\mu }_{\mathcal {F}_0}=\mathbf {0}\) (\(\mathcal {F}=\{X, T, W, Z\}\)). We adopted uninformative priors for the noise variances; therefore, we set the hyperparameters for the inverse Gamma distributions as follows: \(\alpha _R=\alpha _S=\alpha _\theta =\alpha _\rho =0\) and \(\beta _R=\beta _S=\beta _\theta =\beta _\rho =0\). For the Gibbs sampling process, we ignored the first 1000 samples as “burn-in”. The following 100 samples were selected to approximate the posterior distributions.

4.3 Results

We report the RMSEs on the test datasets for the in-matrix and out-matrix prediction tasks in Tables 2 and 3, respectively. We can see that HBFM outperformed the competing methods for all values of d.

Table 2. Test RMSEs for different numbers of latent features
Table 3. Test RMSEs for the out-matrix prediction task

For small values of d (e.g., \(d=20, 50\)), PIE and HP-PIE perform better than the other methods, indicating the effectiveness of exploiting click data in boosting the performance of rating predictions. When d exceeds 150, the test RMSEs for PMF, NMF, SVD++, and PIE tend to increase, whereas those for BPMF and HBFM continue to decrease. This is because when d increases, the number of parameters increases and the models become more complex. PMF, NMF, SVD++, and PIE do not handle the complexity of the model well; therefore, they tend to overfit. By contrast, BPMF and HBFM, which can manage the complexity of the models well, continue improving the test RMSEs. This shows that the full Bayesian model that can manage the uncertainty of the model parameters is an effective approach for avoiding overfitting.

Impact of the Sparsity of the Dataset on the Methods. We studied the effectiveness of the proposed method for datasets with different levels of sparsity by training models with the ML1-10, ML1-20, ML1-50, ML20-10, ML20-20 and ML20-50 datasets. The test RMSEs are shown in Table 4.

Table 4. Test RMSEs for datasets with different levels of sparsity. The dimensionality of feature vectors is fixed: \(d=20\)

We can observe that denser rating data improved test RMSE values for all methods. This is reasonable because when more rating data are available for training, the prediction is more accurate. When the data are extremely sparse (e.g., ML1-10 or ML20-10), although managing the complexity of the model for sparse data is a challenging task, PIE and HBFM perform better than the other methods because they leverage the sparsity of rating data by the click data. For all settings, HBFM outperforms the competing methods. These results clearly show the effectiveness of exploiting click data and managing the complexity of sparse datasets.

Performance for Different Segmentations of Users. We further test the effectiveness of our method with different segments of users. We divided users into three segments based on the number of items for which they had provided ratings, and compared the performances of the methods for each group. These segments are: (i) low: users who provide fewer than 20 ratings; (ii) medium: users who provide fewer than 50 and more than 20 ratings; and (iii) high: users who provide 50 or more ratings.

Fig. 2.
figure 2

Test RMSEs for different segmentations of users

The test RMSEs in Fig. 2 show that our method (HBFM) outperforms all competing methods for all user segments for the three datasets. From the results, we can also see that all the methods perform better when more explicit feedback is provided. This is reasonable because explicit feedback is much more reliable than implicit feedback for inferring users’ preferences.

5 Discussion and Future Work

In this paper, we have proposed HBFM, a fully Bayesian model that combines explicit and implicit feedback to address the cold-start problem in collaborative filtering. This is a Bayesian treatment of the PIE model [8], in which priors are placed on the hyperparameters such as the covariance matrix of latent feature vectors or the variance of rating data. We developed a Gibbs sampling-based method to approximate the posterior distributions over latent feature vectors of users and items. The experiments show that HBFM provides good control over the capacity, and can be applied to models with large numbers of parameters and very sparse data.

Several future directions are possible. One is to make the model more flexible by developing a nonparametric algorithm that can efficiently find the appropriate dimensionality of latent feature vectors instead of empirically tuning the method. Another direction is to generalize the model to adopt different types of explicit feedback. In the present model, we assumed that the rating data were random variables with Gaussian distributions. This model may not work well when the data are binary feedback (e.g., like/dislike, purchase/not purchase); in that case, a Bernoulli distribution model may be more suitable.