A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data

Nguyen, ThaiBinh; Takasu, Atsuhiro

doi:10.1007/978-3-319-69179-4_8

ThaiBinh Nguyen¹⁸ &
Atsuhiro Takasu^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3135 Accesses

Abstract

Matrix factorization (MF) is one of the most efficient methods for performing collaborative filtering. An MF-based method represents users and items by latent feature vectors that are obtained by decomposing the rating matrix of users to items. However, MF-based methods suffer from the cold-start problem: if no rating data are available for an item, the model cannot find a latent feature vector for that item, and thus cannot make a recommendation for it. In this paper, we present a hierarchical Bayesian model that can infer the latent feature vectors of items directly from the implicit feedback (e.g., clicks, views, purchases) when they cannot be obtained from the rating data. We infer the full posterior distributions of these parameters using a Gibbs sampling method. We show that the proposed method is strong with overfitting even if the model is very complex or the data are very sparse. Our experiments on real-world datasets demonstrate that our proposed method significantly outperforms competing methods on rating prediction tasks, especially for very sparse datasets.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Deep Bayesian Matrix Factorization

Combining Positive and Negative Feedbacks with Factored Similarity Matrix for Recommender Systems

HHMF: hidden hierarchical matrix factorization for recommender systems

Article 27 May 2019

Keywords

1 Introduction

With the emergence of big data, recommender systems have become a core part of online services. The goal of a recommender system is to model user preferences by analyzing their history data and providing them with personalized recommendations. Collaborative filtering (CF) is an efficient approach for recommender systems that aims at predicting the rating of a user for an item given the past rating history of the users. Among various CF-based methods, matrix factorization (MF) is one of the most powerful approaches [2, 11].

MF-based algorithms represent user preferences and item attributes by latent feature vectors in a shared latent space. Typically, an MF-based algorithm finds the latent feature vectors of users and items by fitting the model with the observed ratings given in the user–item rating matrix [2, 11, 12, 15]. However, because each user can only rate a limited number of items, the rating matrix is usually extremely sparse. Therefore, the performance of rating prediction declines if an item has too few ratings; or in an extreme case, if an item has no prior ratings, the system cannot learn its latent feature vector, and thus cannot recommend it to any user. This problem is referred to as the cold-start problem.

To address this problem, a common approach is to exploit other information about users and items, known as side information or auxiliary information. There are various types of side information, depending on the item being recommended (e.g., the genres of movies or text content of books). Such side information is successfully combined with traditional CF-based algorithms to alleviate the cold-start problem. For instance, in [1, 14, 16], text content was exploited for article recommendations; music content for song recommendations [9], or text content and category information for movie recommendations [10]. One limitation of these models is that such side information is not always available, or, in many cases, that information is available, but less informative for describing items (e.g., some items are described by a few keywords or very short texts).

This work focuses on exploiting another kind of feedback known as implicit feedback (e.g., clicks, views, purchases) as the auxiliary data. The advantage of using implicit feedback is that it is abundant and easily collected during the interactions of users with the system without requiring users to provide further interactions.

Related Work. In [2], the authors proposed SVD++ which exploits implicit feedback to boost the performance of the original probabilistic matrix factorization (PMF) model [11]. However, in SVD++, implicit feedback is a binary matrix that indicates “who rates what”, obtained by binarizing the rating data, an item has implicit feedback if and only if it has rating data; therefore, this model cannot model an item if it has no ratings.

Co-rating [5] combines explicit and implicit feedback in a unified framework. In the model, the explicit feedback are normalized into the range [0, 1] and added to the implicit feedback matrix to form a unique matrix. This matrix is then factorized to obtain the latent feature vectors of users and items. In this way, the feature vector of an item can be inferred even if it does not have any rating data. The limitation of this method is that, after forming the final matrix, implicit and explicit feedback cannot be distinguished; therefore, this model cannot take into account the uncertainty of the implicit feedback.

In [13], the authors proposed a method for combining implicit and explicit feedback using expectation maximization (EM). To predict ratings for an item for which rating data are not available, the rating is inferred from ratings of its neighbors in terms of click data. However, the algorithm is based on an iterative EM-based algorithm in which the E-phase is a matrix factorization model. In other words, matrix factorization is performed multiple times and is therefore computationally expensive.

In [8], the authors proposed a probabilistic model for combining explicit and implicit feedback for making recommendations. In this model, the latent feature vector of an item for which rating data are not available can be learned directly from the implicit feedback data. This is a combination of PMF [11] and an item embedding model; the model learns latent feature vectors for an item from the implicit feedback. In detail, the model consists of simultaneously factorizing the rating matrix and the positive point-wise mutual information (PPMI) matrix that is constructed from the click data. The item vectors are obtained by the factorization of the PPMI matrix and are then adjusted by the rating matrix. Although this model successfully combines implicit feedback data in learning latent feature vectors of items, it is prone to overfitting if the hyperparameters (i.e., the regularization parameters) are not tuned carefully. Usually, the hyperparameter tuning is very costly, especially when there are many hyperparameters, or when the data are large.

Present Work. In this paper, we propose a fully hierarchical Bayesian treatment for the model proposed by Nguyen et al. [8]. In this model, instead of finding a point estimate for the model parameters, our method infers the full posterior distribution of these model parameters to capture their uncertainty. The missing ratings are predicted by integrating out the latent feature vectors of users and items. To this end, we place the Gaussian inverse Wishart priors on the mean vectors and covariance matrices of the latent feature vectors for the rating matrix and PPMI matrix [8]. We develop a Markov chain Monte Carlo (MCMC)-based method for inferring the full posterior distribution.

2 Preliminary

Suppose we have N users and M items. For each user–item pair (u, i), there can be two types of feedback: explicit feedback (also known as rating data) and implicit feedback (also known as click data). The rating data are represented by matrix $\mathbf {R}\in \mathbb {R}^{N\times M}$, in which element $r_{ui}$ is the rating of user u for item i. $r_{ui}$ can be a real number or a binary value (e.g., like/dislike). The click data are represented by a binary matrix $\mathbf {P}\in \{0,1\}^{N\times M}$, where $p_{ui}=1$ indicates that user u has clicked i at least once, and $p_{ui}=0$ otherwise.

Generally, the rating matrix $\mathbf {R}$ is extremely sparse with many missing values (i.e., $r_{ui}$ is not observed). We are interested in predicting these missing ratings.

2.1 Item Embedding Model According to Implicit Feedback Data

Word embedding techniques have shown their success in many natural language processing tasks [3, 6]. By viewing each item in a recommender system as a word, the same assumptions that underlie word embedding models can be applied to modeling items. In [8], the authors proposed a method for an item embedding model based on implicit feedback data, i.e., a model that captures the relationship between items that are clicked by the same users.

In this item embedding model, item i is represented by two vectors: an item vector $\mathbf {w}_i$ and a context vector $\mathbf {z}_i$. The vectors have different roles: the item vector describes the distribution of the item, and the context vector describes the distribution of the co-occurrence of an item with other items in its context.

In [8], the authors proposed an item-embedding scheme in which the item vector and the context vector are obtained by factorizing the PPMI matrix corresponding to the click data [8]:

$$\begin{aligned} PPMI(i,j)=\mathbf {w}_i^\top \mathbf {z}_i \end{aligned}$$

(1)

The PPMI matrix is obtained by replacing the negative values by zeros in the point-wise mutual information (PMI) matrix. The elements of the PMI matrix are correlation measures for the co-occurrence of two items. Empirically, the PMI of items i and j can be approximated using the observed data:

$$\begin{aligned} PMI(i,j)=\log \frac{\#(i,j)|\mathcal {D}|}{\#(i)\#(j)}, \end{aligned}$$

(2)

where $\#(i)$ and $\#(j)$ are the numbers of times items i and j are clicked, respectively. $\mathcal {D}$ is the set of item pairs that appear in the combined click history of all users. $\#(i,j)$ is the number of users who clicked both i and j.

2.2 Probabilistic Model for Implicit and Explicit Feedback Data

After producing the item embedding model according to implicit feedback, Nguyen et al. [8] proposed a model that combines implicit and explicit feedback in a unified framework (PIE). PIE is a combination of the item-embedding model (described in Sect. 2.1) and the matrix factorization for rating data. In PIE, the latent feature vector $\mathbf {y}_i$ of item i is obtained by adding a small deviation $\mathbf {t}_i$ to the item vector $\mathbf {z}_i$. The graphical model of PIE is shown in Fig. 1a.

The main drawback of this model is that the parameter learning is a point estimation (MAP estimate), which is prone to overfitting when applying the trained model to unseen data. To avoid overfitting, we must tune the hyperparameters carefully. One approach is grid-search: we form a set of appropriate configurations of hyperparameters and train the model with these configurations. The configuration that produces the best performance on the validation set will be selected. However, in general, grid search is very costly, especially for large-scale data, or when the number of hyperparameters is large.

A straightforward way to avoid hyperparameter tuning is to introduce priors to the hyperparameters and optimize the log-posterior over both model parameters and hyperparameters. In this way, the hyperparameters will be learned from the data instead of tuned manually. However, this solution does not significantly improve the generalization of the model because it is still a point estimation and cannot capture the uncertainty of the model parameters.

3 Proposed Method

To address drawbacks described above, we propose a hierarchical, fully Bayesian model (HBFM) that can capture the uncertainty of the model parameters. Instead of approximating the posterior by its mode (the MAP estimate), we approximate the full posterior distributions of model parameters.

3.1 The Model

We place the Gaussian inverse Wishart priors on the mean vectors and covariance matrices of the latent feature vectors. The graphical model is shown in Fig. 1b.

We assume that $r_{ui}$ and $s_{ij}$ are Gaussian distributions as follows:

$$\begin{aligned} p(r_{ui}|\mathbf {x}_u, \mathbf {y}_i, \theta _u, \rho _i, \sigma ^2_R)&= \mathcal {N}(r_{ui}|\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui}, \sigma ^2_R)\end{aligned}$$

(3)

$$\begin{aligned} p(s_{ui}|\mathbf {w}_i, \mathbf {z}_j, \sigma ^2_S)&= \mathcal {N}(s_{ij}|\mathbf {w}_i^\top \mathbf {z}_j, \sigma ^2_S), \end{aligned}$$

(4)

where $\theta _u$ and $\rho _i$ are the biases of user u and item i, respectively; $\mathbf {y}_i = \mathbf {t}_i+\mathbf {w}_i$; $\eta _{ui}=\mu + \theta _u + \rho _i$; and $\mu $ is the global mean of the ratings.

The prior distributions of the latent feature vectors are assumed to be multivariate Gaussian distributions:

$$\begin{aligned} \begin{aligned} p(\mathbf {X}|\mathbf {\Theta }_X) = \prod _{u=1}^N \mathcal {N}(\mathbf {x}_u|\varvec{\mu }_X,\mathbf {\Sigma }_X), \quad p(\mathbf {T}|\mathbf {\Theta }_T) = \prod _{i=1}^M \mathcal {N}(\mathbf {t}_i|\varvec{\mu }_T,\mathbf {\Sigma }_T)\\ p(\mathbf {W}|\mathbf {\Theta }_W) = \prod _{i=1}^M \mathcal {N}(\mathbf {w}_i|\varvec{\mu }_W,\mathbf {\Sigma }_W), \quad p(\mathbf {Z}|\mathbf {\Theta }_Z) = \prod _{j=1}^M \mathcal {N}(\mathbf {z}_u|\varvec{\mu }_Z,\mathbf {\Sigma }_Z) \end{aligned} \end{aligned}$$

(5)

where $\varvec{\mu }_X$, $\varvec{\mu }_T$, $\varvec{\mu }_W$, and $\varvec{\mu }_Z$ are the mean vectors and $\mathbf {\Sigma }_X$, $\mathbf {\Sigma }_Y$, $\mathbf {\Sigma }_W$, and $\mathbf {\Sigma }_Z$ are the covariance matrices of $\mathbf {x}_u$, $\mathbf {y}_i$, $\mathbf {w}_i$, and $\mathbf {z}_j$, respectively; $\mathbf {\Theta }_X=\{\varvec{\mu }_X, \varSigma _X\}$, $\mathbf {\Theta }_T=\{\varvec{\mu }_T, \varSigma _T\}$, $\mathbf {\Theta }_W=\{\varvec{\mu }_W, \varSigma _W\}$, and $\mathbf {\Theta }_Z=\{\varvec{\mu }_Z, \varSigma _Z\}$.

To model the uncertainty of the latent feature vectors, we do not treat them as distributions of fixed hyperparameters. Instead, we further place Gaussian-inverse Wishart priors on $\mathbf {\Theta }_X$, $\mathbf {\Theta }_T$, $\mathbf {\Theta }_W$, and $\mathbf {\Theta }_Z$:

$$\begin{aligned} p(\mathbf {\Theta }_X|\varvec{\varPhi }_{X_0}) = \mathcal {N}(\varvec{\mu }_X|\varvec{\mu }_{X_0},\varvec{\varSigma }_{X}/\gamma _{X_0})\mathcal {W}^{-1}(\varvec{\varSigma }_X|\mathcal {W}_{X_0}, \nu _{X_0}) \end{aligned}$$

(6)

$$\begin{aligned} p(\mathbf {\Theta }_T|\varvec{\varPhi }_{T_0}) = \mathcal {N}(\varvec{\mu }_T|\varvec{\mu }_{T_0},\mathbf {\Sigma }_{T}/\gamma _{T_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_T|\mathcal {W}_{T_0}, \nu _{T_0}) \end{aligned}$$

(7)

$$\begin{aligned} p(\mathbf {\Theta }_W|\varvec{\varPhi }_{W_0}) = \mathcal {N}(\varvec{\mu }_W|\varvec{\mu }_{W_0},\mathbf {\Sigma }_{W}/\gamma _{W_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_W|\mathcal {W}_{W_0}, \nu _{W_0}) \end{aligned}$$

(8)

$$\begin{aligned} p(\mathbf {\Theta }_Z|\varvec{\varPhi }_{Z_0}) = \mathcal {N}(\varvec{\mu }_Z|\varvec{\mu }_{Z_0},\mathbf {\Sigma }_{Z}/\gamma _{Z_0})\mathcal {W}^{-1}(\mathbf {\Sigma }_Z|\mathcal {W}_{Z_0}, \nu _{Z_0}), \end{aligned}$$

(9)

where: $\varvec{\varPhi }_{X_0}=\{\varvec{\mu }_{X_0},\gamma _{X_0}, \mathcal {W}_{X_0}, \nu _{X_0}\}$, $\varvec{\varPhi }_{T_0}=\{\varvec{\mu }_{T_0},\gamma _{T_0}, \mathcal {W}_{T_0}, \nu _{T_0}\}$, $\varvec{\varPhi }_{W_0}=\{\varvec{\mu }_{W_0},\gamma _{W_0}, \mathcal {W}_{W_0}, \nu _{W_0}\}$, and $\varvec{\varPhi }_{Z_0}=\{\varvec{\mu }_{Z_0},\gamma _{Z_0}, \mathcal {W}_{Z_0}, \nu _{Z_0}\}$.

Here, $\mathcal {W}^{-1}$ is the inverse Wishart distribution with $\nu _0$ degrees of freedom and a $d\times d$ scaling matrix $\mathcal {W}_0$:

$$\begin{aligned} \mathcal {W}^{-1}(\mathbf {\Sigma }|\mathcal {W}_0,\nu _0) = \frac{1}{C}|\mathbf {\Sigma }|^{-(\nu _0-d-1)/2}\exp (-\frac{1}{2}Tr(\mathcal {W}_0\mathbf {\Sigma }^{-1})), \end{aligned}$$

(10)

where C is a normalizing constant and Tr(.) is the trace of a matrix.

The Gaussian inverse Wishart prior is adopted because it is the conjugate prior of the multivariate Gaussian distribution. This selection of the prior allows the conditional distributions derived from the posterior distributions to be sampled easily. Similarly, we place inverse Gamma priors [17] on the variance $\sigma ^2_R$:

$$\begin{aligned} p(\sigma ^2_R|\alpha _R, \beta _R) \quad =&\quad IG(\sigma ^2_R|\alpha _R, \beta _R), \end{aligned}$$

(11)

where IG(.) is the inverse Gamma distribution [17]:

$$\begin{aligned} IG(x|\alpha , \beta )=\frac{\beta ^\alpha }{\varGamma (\alpha )}x^{-\alpha -1}exp(-\frac{\beta }{x}) \end{aligned}$$

(12)

Choosing the inverse Gamma distribution, which is the conjugate prior of the variance of a Gaussian distribution, makes it easy to sample from the posterior distribution. Indeed, this distribution has also been proven to model the unknown variance of a Gaussian distribution effectively [17].

We place Gaussian priors over the bias terms as follows.

$$\begin{aligned} p(\theta _u|\sigma ^2_\theta ) = \mathcal {N}(\theta _u|0, \sigma ^2_\theta ), \quad p(\rho _i|\sigma ^2_\rho ) = \mathcal {N}(\rho _i|0, \sigma ^2_\rho ), \end{aligned}$$

(13)

where, $\sigma ^2_\theta $ and $\sigma ^2_\rho $ are inverse Gamma distributions [17]:

$$\begin{aligned} p(\sigma ^2_\theta |\alpha _\theta , \beta _\theta ) = IG(\sigma ^2_\theta |\alpha _\theta , \beta _\theta ), \quad p(\sigma ^2_\rho |\alpha _\rho , \beta _\rho ) = IG(\sigma ^2_\rho |\alpha _\rho , \beta _\rho ) \end{aligned}$$

(14)

We place an inverse Gamma [17] prior on the variance of $\sigma ^2_S$ of $r_{ij}$:

$$\begin{aligned} p(\sigma ^2_S|\alpha _S, \beta _S) \quad =&\quad IG(\sigma ^2_S|\alpha _S, \beta _S) \end{aligned}$$

(15)

3.2 Posterior Inference

Our goal is to find the posterior distribution of the model parameters. The posterior distribution is analytically intractable, so we employ MCMC-based methods, which are widely used for approximating distributions [7]. The key idea of these methods is to construct a Markov chain that converges to the posterior distribution of the model. Each state of the Markov chain is a set of model parameters. The posterior distribution is characterized by the samples from that Markov chain. In this paper, we use Gibbs sampling [7], a kind of MCMC that alternatively samples each variable conditioned on the remaining variables.

Sampling ${\mathbf {x}_{\varvec{u}}}$ , ${\mathbf {t}_{\varvec{i}}}$ ${\mathbf {w}_{\varvec{i}}}$ , and ${\mathbf {z}_{\varvec{j}}}$ . The conditional distribution over the user latent feature vector $\mathbf {x}_u$, conditioned on the observed ratings, the latent feature vectors of items, and the hyperparameters, is Gaussian:

$$\begin{aligned} \begin{aligned} p(\mathbf {x}_u|\mathbf {R},\mathbf {Y}, \varvec{\mu }_X, \varvec{\theta }, \varvec{\rho }, \mathbf {\Sigma }_X)&= \mathcal {N}(\mathbf {x}_u|\varvec{\mu }^*_{X_u},\mathbf {\Sigma }^*_{X_u})\\&\propto p(\mathbf {x}_u|\varvec{\mu }_X, \mathbf {\Sigma }_X)\prod _{i\in \mathcal {R}_u}\mathcal {N}(r_{ui}|\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui}, \sigma _R^2), \end{aligned} \end{aligned}$$

(16)

where $\eta _{ui}=\theta _u+\rho _i+\mu $, $\varvec{\theta }=\{\theta _u\}_{u=1}^N$, $\varvec{\rho }=\{\rho _i\}_{i=1}^M$, and

$$\begin{aligned} \mathbf {\Sigma }^*_{X_u}&=\Big (\mathbf {\Sigma }_{X}^{-1}+\frac{1}{\sigma ^2_R}\sum _{i\in \mathcal {R}_u}\mathbf {y}_i\mathbf {y}_i^\top \Big )^{-1}\end{aligned}$$

(17)

$$\begin{aligned} \varvec{\mu }^*_{X_u}&= \mathbf {\Sigma }^*_{X_u}\Big [\mathbf {\Sigma }_{X}^{-1}\varvec{\mu }_X+\frac{1}{\sigma ^2_R}\sum _{i\in \mathcal {R}_u}(r_{ui}-\eta _{ui})\mathbf {y}_i\Big ]. \end{aligned}$$

(18)

Similarly, we can obtain the posterior distribution of $\mathbf {t}_i$, $\mathbf {w}_i$ and $\mathbf {z}_j$.

Sampling $\varvec{\varTheta }_X=\{\varvec{\mu }_X, \varvec{\varSigma }_X\}$, $\varvec{\varTheta }_T=\{\varvec{\mu }_T, \varvec{\varSigma }_T\}$, $\varvec{\varTheta }_W=\{\varvec{\mu }_W, \varvec{\varSigma }_W\}$, and $\varvec{\varTheta }_Z=\{\varvec{\mu }_Z, \varvec{\varSigma }_Z\}$. The posterior distribution over $\varvec{\varTheta }_X=\{\varvec{\mu }_X, \varvec{\varSigma }_X\}$ conditioned on user latent feature vectors and $\varvec{\varPhi }_{X_0}=\{\varvec{\mu }_{X_0}, \gamma _{X_0}, \mathcal {W}_{X_0}, \nu _{X_0}\}$ is a Gaussian inverse Wishart distribution:

$$\begin{aligned} p(\varvec{\mu }_X, \varvec{\varSigma }_X|\mathbf {X}, \varvec{\varPhi }_{X_0})&= \mathcal {N}(\varvec{\mu }_X|\varvec{\mu }^*_{X_0}, \varvec{\varSigma }_{X}/\gamma ^*_{X_0})\mathcal {W}^{-1}(\varvec{\varSigma }_X|\mathcal {W}^*_{X_0}, \nu ^*_{X_0})\end{aligned}$$

(19)

$$\begin{aligned}&\propto p(\mathbf {X}|\varvec{\mu }_X, \varvec{\varSigma }_X)p(\varvec{\mu }_X, \varvec{\varSigma }_X|\varvec{\varPhi }_{X_0}), \end{aligned}$$

(20)

where:

$$\begin{aligned} \varvec{\mu }^*_{X_0}&= \frac{\gamma _{X_0}\varvec{\mu }_{X_0}+N\bar{\mathbf {x}}}{\gamma _{X_0}+N}, \quad \gamma ^*_{X_0} = \gamma _{X_0} + N, \nu ^*_{X_0} = \nu _{X_0} + N \end{aligned}$$

(21)

$$\begin{aligned} \mathcal {W}^*_{X_0}&= \mathcal {W}_{X_0}+N\bar{\mathbf {S}}+\frac{\gamma _{X_0}N}{\gamma _{X_0}+N}(\varvec{\mu }_{X_0}-\bar{\mathbf {x}})(\varvec{\mu }_{X_0}-\bar{\mathbf {x}})^\top \end{aligned}$$

(22)

$$\begin{aligned} \bar{\mathbf {x}}&= \frac{1}{N}\sum _{u=1}^{N}\mathbf {x}_u, \quad \bar{\mathbf {S}}=\frac{1}{N}\sum _{u=1}^{N}\mathbf {x}_u\mathbf {x}_u^\top \end{aligned}$$

(23)

Similarly, we can obtain the posterior distributions over $\varvec{\varTheta }_T$, $\varvec{\varTheta }_W$, and $\varvec{\varTheta }_Z$ using exactly the same form.

Sampling bias terms $\varvec{\theta _{u}}$ and $\varvec{\rho _{i}}$ . The posterior distribution over the user bias term $\theta _u$ is Gaussian:

$$\begin{aligned} \begin{aligned} p(\theta _u|\mathbf {R,X,Y},\varvec{\rho },\sigma ^2_R)&= \mathcal {N}(\theta _u|\xi ^*_u,(\sigma ^*_{\theta _u})^2\\&\propto p(\mathbf {R}|\mathbf {X,Y},\varvec{\rho }, \sigma ^2_R)p(\theta _u|\sigma ^2_\theta ), \end{aligned} \end{aligned}$$

(24)

where:

$$\begin{aligned} (\sigma ^*_{\theta _u})^2 = \Big (\frac{1}{\sigma ^2_\theta }+\frac{|\mathcal {R}_u|}{\sigma ^2_R}\Big )^{-1}, \quad \xi ^*_u = \Big (\frac{\sigma ^*_{\theta _u}}{\sigma _R}\Big )^2\sum _{i\in \mathcal {R}_u}\Big [r_{ui}-(\mu +\rho _i+\mathbf {x}_u^\top \mathbf {y}_i)\Big ] \end{aligned}$$

(25)

The posterior distribution over the $\rho _i$ can be obtained using the same form.

Sampling $\varvec{\sigma ^{2}_{R}}$ and $\varvec{\sigma ^{2}_{S}}$ . The posterior distribution over $\sigma ^2_R$, conditioned on the rating data, user latent factor matrix $\mathbf {X}$, item latent factor matrix $\mathbf {Y}$, and bias matrices $\varvec{\theta }$, $\varvec{\rho }$, is given as:

$$\begin{aligned} \begin{aligned} p(\sigma ^2_R|\mathbf {R}, \mathbf {X}, \mathbf {Y}, \alpha _R, \beta _R)&= IG(\sigma ^2_R|\alpha ^*_R, \beta ^*_R)\\&\propto p(\mathbf {R}|\mathbf {X}, \mathbf {Y}, \varvec{\theta }, \varvec{\rho }, \sigma ^2_R)p(\sigma ^2_R|\alpha _R, \beta _R) \end{aligned} \end{aligned}$$

(26)

where:

$$\begin{aligned} \alpha ^*_R = \alpha _R + \frac{|\mathcal {R}|}{2}, \quad \beta ^*_R = \beta _R + \frac{1}{2}\sum _{(i,j)\in \mathcal {R}}\Big [r_{ui}-(\mathbf {x}_u^\top \mathbf {y}_i+\eta _{ui})\Big ]^2 \end{aligned}$$

(27)

The conditional distribution over $\sigma ^{2}_{S}$ can be obtained using the same form.

Sampling $\varvec{{\sigma }^{2}_{\theta }}$ and $\varvec{{\sigma }^{2}_{\rho }}$ . The conditional distribution over $\sigma ^{2}_{\theta }$ conditioned on the bias terms of users is an inverse Gamma distribution:

$$\begin{aligned} \begin{aligned} p(\sigma ^2_\theta |\varvec{\theta }, \alpha _\theta , \beta _\theta )&= IG(\sigma ^2_\theta |\alpha ^*_\theta , \beta ^*_\theta ) \quad \propto \quad p(\varvec{\theta }|\sigma ^2_\theta )p(\sigma ^2_\theta |\alpha _\theta ,\beta _\theta ), \end{aligned} \end{aligned}$$

(28)

where:

$$\begin{aligned} \alpha ^*_\theta&= \alpha _\theta + \frac{N}{2}, \quad \beta ^*_\theta = \beta _\theta + \frac{1}{2}\sum _{u=1}^N\theta ^2_u \end{aligned}$$

(29)

The conditional distribution over $\sigma ^2_\rho $ conditioned on the bias terms of items can be obtained using the same form.

Computational Complexity. From the formulas for posterior distribution sampling, we can observe that the most expensive computations lie in the sampling of the latent feature vectors ($\mathbf {x}_u$, $\mathbf {t}_i$, $\mathbf {w}_i$ and $\mathbf {z}_j$), which require computing the inverses of matrices. It is easy to show that in each iteration, the complexity for sampling the latent feature vectors of N users (matrix $\mathbf {X}$) is $\mathcal {O}(d^2|\mathcal {R}| + d^3N)$. Similarly, the complexities for sampling matrix $\mathbf {T}$, $\mathbf {W}$, and $\mathbf {Z}$ are $\mathcal {O}(d^2|\mathcal {R}| + d^3N)$, $\mathcal {O}(d^2|\mathcal {S}| + d^3M)$, and $\mathcal {O}(d^2|\mathcal {S}| + d^3M)$, respectively, where $|\mathcal {R}|$ and $|\mathcal {S}|$ are the numbers of observed ratings and observed clicks, respectively. However, note that the posterior distribution of $\mathbf {x}_{u}$ does not depend on other users; therefore, the sampling of matrix $\mathbf {X}$ can be performed efficiently in parallel. Similarly, sampling $\mathbf {T}$, $\mathbf {W}$, and $\mathbf {Z}$ can also be sped up by performing them in parallel.

3.3 Rating Prediction

The posterior predictive distribution of the unseen rating value $\hat{r}_{ui}$ of item i by user u is obtained by integrating out the model parameters and hyperparameters:

$$\begin{aligned} \begin{aligned} p(\hat{r}_{ui}|\mathcal {O})&= \int \dots \int p(\hat{r}_{ui}|\varvec{\varOmega })p(\varvec{\varOmega })d\{\varvec{\varOmega }\}, \end{aligned} \end{aligned}$$

(30)

where $\mathcal {O}$ is the observed data and $\varvec{\varOmega }$ is the set of all parameters.

The above posterior predictive distribution is analytically intractable, so we approximate it by sampling the parameters using the Gibbs sampling described in Sect. 3.2. The predicted rating value can be approximated as follows:

$$\begin{aligned} \begin{aligned} p(\hat{r}_{ui}|\mathcal {O})&\approx \frac{1}{K}\sum _{k=1}^{K}p(\hat{r}_{ui}|\mathbf {x}^{(k)}_u, \mathbf {y}^{(k)}_i, \theta ^{(k)}_u, \rho ^{(k)}_i, (\sigma ^2_R)^{(k)})\\&= \frac{1}{K}\sum _{k=1}^{K}\mathcal {N}\Big (\hat{r}_{ui}|\eta ^{(k)}_{ui}+{\mathbf {x}^{(k)}_u}^\top \mathbf {y}^{(k)}_i,(\sigma ^2_R)^{(k)}\Big ), \end{aligned} \end{aligned}$$

(31)

where K is the number of samples taken from the posterior distribution, $(.)^{(k)}$ is the kth sample, and $\eta ^{(k)}_{ui}=\mu +\theta ^{(k)}_u+\rho ^{(k)}_i$.

We consider two rating prediction tasks: (i) in-matrix prediction: predict the rating by user u of item i, where i has not been rated by u but has been rated by at least one other user (i.e., i appears at least once in the training set of the rating data); and (ii) out-matrix prediction: predict the rating by user u of item i, where i has not been rated by any user (i.e., i does not appear in the training set of the rating data).

In Eq. 31, $\mathbf {y}^{(k)}_i=\mathbf {w}^{(k)}_i+\mathbf {t}^{(k)}_i$ for the in-matrix prediction task; $\mathbf {y}^{(k)}_i=\mathbf {w}^{(k)}_i$ and $\eta ^{(k)}_{ui}=\mu +\theta ^{(k)}_u$ for the out-matrix prediction task.

4 Empirical Study

4.1 Datasets

Data Description. We used three public datasets of different domains with varying sizes. (1) MovieLens 1M (ML-1m): a dataset of user-movie ratings collected from MovieLens, an online film service. It contains 1 million ratings in the range 1–5 of 4000 movies by 6000 users. This dataset is available at GroupLens^{Footnote 1}. (2) MovieLens 20M (ML-20m): another dataset of user-movie ratings collected from MovieLens. It contains 20 million ratings in the range 1–5 of 27,000 movies by 138,000 users. This dataset is available at GroupLens^{Footnote 2}. (3) Bookcrossing: A dataset collected in August and September 2004 from the Book-Crossing website^{Footnote 3}. This dataset contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit/implicit) of 271,379 books. We removed users and items that had no explicit feedback.

The MovieLens datasets contain only rating data, so we employed a preprocess phase to obtain the click data. We binarized the original rating data and interpreted it as click data. Furthermore, because rating data are only a small part of the click data, we randomly selected from original ratings with different percentages, assuming that only these amounts of ratings were available. Details of these datasets after preprocessing are shown in Table 1.

Table 1. Datasets obtained by selecting ratings from the original ratings of MovieLens datasets with different percentages

Full size table

4.2 Experimental Protocol

We used the click data and 80% of the rating data to train the model; the remaining 20% of the rating data was used as the test data to evaluate the model. In evaluating the in-matrix prediction task, when splitting data, we made sure that every item in the test set appeared at least once in the training set. In evaluating the out-matrix prediction task, we made sure that none of the items in the test set appeared in the training set (to ensure that none of the items in the test set had any previous ratings).

Evaluation Metric. We used Root Mean Square Error (RMSE) as the metric to measure the performance of the models. RMSE measures the deviation between the rating predicted by the model and the true rating (given by the test set), and is defined as follows.

$$\begin{aligned} RMSE=\sqrt{\frac{1}{|Test|}\sum _{(u,i)\in Test}(r_{ui}-\hat{r}_{ui})^2}, \end{aligned}$$

(32)

where |Test| is the size of the test set.

Competing Methods. For the in-matrix prediction task, we compared our method with the following baseline methods:

1.
PMF [11]: a state-of-the-art method for rating predictions
2.
BPMF [12]: the Bayesian treatment of PMF [11]
3.
NMF (non-negative matrix factorization) [4]: a matrix factorization method which requires the components of user and item factors to be non-negative
4.
PIE [8]: the model described in Sect. 2.2
5.
SVD++ [2]: a factor model that exploits both explicit and implicit feedback in rating predictions

For the out-matrix prediction task, we compared our proposed method with PIE [8], which is described in Sect. 2.2.

Parameter Settings. We varied the dimension of the latent space ($d=20, 30, 50,100$) to study the performance of the models with respect to the dimensionality of the latent feature vectors.

For PMF, NMF and SVD++, we used grid search to find the optimal values of the hyperparameters that produced the best performance on a validation set. For the PIE model [8], we fixed $\lambda =1$ and used grid search to find the optimal values of the remaining parameters that gave good performance on the validation set. For BPMF [12], hyperparameters were set following the original paper.

Regarding our proposed method, HBFM, for simplicity, we set the parameters as follows: $\mathcal {W}_\mathcal {F}=\mathbf {I}_d$, $\nu _{\mathcal {F}_0}=d$, $\gamma _{\mathcal {F}_0}=1$, and $\mathbf {\mu }_{\mathcal {F}_0}=\mathbf {0}$ ($\mathcal {F}=\{X, T, W, Z\}$). We adopted uninformative priors for the noise variances; therefore, we set the hyperparameters for the inverse Gamma distributions as follows: $\alpha _R=\alpha _S=\alpha _\theta =\alpha _\rho =0$ and $\beta _R=\beta _S=\beta _\theta =\beta _\rho =0$. For the Gibbs sampling process, we ignored the first 1000 samples as “burn-in”. The following 100 samples were selected to approximate the posterior distributions.

4.3 Results

We report the RMSEs on the test datasets for the in-matrix and out-matrix prediction tasks in Tables 2 and 3, respectively. We can see that HBFM outperformed the competing methods for all values of d.

Table 2. Test RMSEs for different numbers of latent features

Full size table

Table 3. Test RMSEs for the out-matrix prediction task

Full size table

For small values of d (e.g., $d=20, 50$), PIE and HP-PIE perform better than the other methods, indicating the effectiveness of exploiting click data in boosting the performance of rating predictions. When d exceeds 150, the test RMSEs for PMF, NMF, SVD++, and PIE tend to increase, whereas those for BPMF and HBFM continue to decrease. This is because when d increases, the number of parameters increases and the models become more complex. PMF, NMF, SVD++, and PIE do not handle the complexity of the model well; therefore, they tend to overfit. By contrast, BPMF and HBFM, which can manage the complexity of the models well, continue improving the test RMSEs. This shows that the full Bayesian model that can manage the uncertainty of the model parameters is an effective approach for avoiding overfitting.

Impact of the Sparsity of the Dataset on the Methods. We studied the effectiveness of the proposed method for datasets with different levels of sparsity by training models with the ML1-10, ML1-20, ML1-50, ML20-10, ML20-20 and ML20-50 datasets. The test RMSEs are shown in Table 4.

Table 4. Test RMSEs for datasets with different levels of sparsity. The dimensionality of feature vectors is fixed: $d=20$

Full size table

We can observe that denser rating data improved test RMSE values for all methods. This is reasonable because when more rating data are available for training, the prediction is more accurate. When the data are extremely sparse (e.g., ML1-10 or ML20-10), although managing the complexity of the model for sparse data is a challenging task, PIE and HBFM perform better than the other methods because they leverage the sparsity of rating data by the click data. For all settings, HBFM outperforms the competing methods. These results clearly show the effectiveness of exploiting click data and managing the complexity of sparse datasets.

Performance for Different Segmentations of Users. We further test the effectiveness of our method with different segments of users. We divided users into three segments based on the number of items for which they had provided ratings, and compared the performances of the methods for each group. These segments are: (i) low: users who provide fewer than 20 ratings; (ii) medium: users who provide fewer than 50 and more than 20 ratings; and (iii) high: users who provide 50 or more ratings.

The test RMSEs in Fig. 2 show that our method (HBFM) outperforms all competing methods for all user segments for the three datasets. From the results, we can also see that all the methods perform better when more explicit feedback is provided. This is reasonable because explicit feedback is much more reliable than implicit feedback for inferring users’ preferences.

5 Discussion and Future Work

In this paper, we have proposed HBFM, a fully Bayesian model that combines explicit and implicit feedback to address the cold-start problem in collaborative filtering. This is a Bayesian treatment of the PIE model [8], in which priors are placed on the hyperparameters such as the covariance matrix of latent feature vectors or the variance of rating data. We developed a Gibbs sampling-based method to approximate the posterior distributions over latent feature vectors of users and items. The experiments show that HBFM provides good control over the capacity, and can be applied to models with large numbers of parameters and very sparse data.

Several future directions are possible. One is to make the model more flexible by developing a nonparametric algorithm that can efficiently find the appropriate dimensionality of latent feature vectors instead of empirically tuning the method. Another direction is to generalize the model to adopt different types of explicit feedback. In the present model, we assumed that the rating data were random variables with Gaussian distributions. This model may not work well when the data are binary feedback (e.g., like/dislike, purchase/not purchase); in that case, a Bernoulli distribution model may be more suitable.

Notes

References

Gopalan, P.K., Charlin, L., Blei, D.: Content-based recommendations with poisson factorization. Adv. Neural Inf. Process. Syst. 27, 3176–3184 (2014)
Google Scholar
Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 426–434 (2008)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 13, 556–562 (2001)
Google Scholar
Liu, N.N., Xiang, E.W., Zhao, M., Yang, Q.: Unifying explicit and implicit feedback for collaborative filtering. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1445–1448 (2010)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Technical report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993)
Google Scholar
Nguyen, T., Aihara, K., Takasu, A.: A probabilistic model for collaborative filtering with implicit and explicit feedback data. CoRR abs/1705.02085 (2017). http://arxiv.org/abs/1705.02085
van deb Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 2643–2651 (2013)
Google Scholar
Park, S., Kim, Y.D., Choi, S.: Hierarchical bayesian matrix factorization with side information. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1593–1599 (2013)
Google Scholar
Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 1257–1264 (2007)
Google Scholar
Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning, pp. 880–887 (2008)
Google Scholar
Wang, B., Rahimi, M., Zhou, D., Wang, X.: Expectation-maximization collaborative filtering with explicit and implicit feedback. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS, vol. 7301, pp. 604–616. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30217-6_50
Chapter Google Scholar
Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 448–456 (2011)
Google Scholar
Wang, H., Shi, X., Yeung, D.Y.: Collaborative recurrent autoencoder: recommend while learning to fill in the blanks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 415–423 (2016)
Google Scholar
Wang, H., Wang, N., Yeung, D.Y.: Collaborative deep learning for recommender systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244 (2015)
Google Scholar
Witkovsky, V.: Computing the distribution of a linear combination of inverted gamma variables. Kybernetika 37(1), 79–90 (2001)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work was supported by a JSPS Grant-in-Aid for Scientific Research (B) (15H02789, 15H02703).

Author information

Authors and Affiliations

Department of Informatics, SOKENDAI (The Graduate University for Advanced Studies), Tokyo, Japan
ThaiBinh Nguyen & Atsuhiro Takasu
National Institute of Informatics, Tokyo, Japan
Atsuhiro Takasu

Authors

ThaiBinh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to ThaiBinh Nguyen .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Gao Cong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Macquarie University, Sydney, New South Wales, Australia
Wei Emma Zhang
Wuhan University, Wuhan, China
Chengliang Li
Nanyang Technological University, Singapore, Singapore
Aixin Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T., Takasu, A. (2017). A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-69179-4_8
Published: 14 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data

Abstract

Similar content being viewed by others

Deep Bayesian Matrix Factorization

Combining Positive and Negative Feedbacks with Factored Similarity Matrix for Recommender Systems

HHMF: hidden hierarchical matrix factorization for recommender systems

Keywords

1 Introduction

2 Preliminary

2.1 Item Embedding Model According to Implicit Feedback Data

2.2 Probabilistic Model for Implicit and Explicit Feedback Data

3 Proposed Method

3.1 The Model

3.2 Posterior Inference

3.3 Rating Prediction

4 Empirical Study

4.1 Datasets

4.2 Experimental Protocol

4.3 Results

5 Discussion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data

Abstract

Similar content being viewed by others

Deep Bayesian Matrix Factorization

Combining Positive and Negative Feedbacks with Factored Similarity Matrix for Recommender Systems

HHMF: hidden hierarchical matrix factorization for recommender systems

Keywords

1 Introduction

2 Preliminary

2.1 Item Embedding Model According to Implicit Feedback Data

2.2 Probabilistic Model for Implicit and Explicit Feedback Data

3 Proposed Method

3.1 The Model

3.2 Posterior Inference

3.3 Rating Prediction

4 Empirical Study

4.1 Datasets

4.2 Experimental Protocol

4.3 Results

5 Discussion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation