Keywords

1 Introduction

Recommender systems are pervasive in various web and e-commerce applications. The traditional approaches to personalized recommendation, such as neighborhood models and latent factor models, rely on the users’ explicit or implicit feedbacks on the items. Through training models on the historical data, the recommender systems can predict the unknown ratings or preferences of users on items. However, there are abundant link structures between users in social media or other web applications that can potentially sway user’s own opinion, because people may choose to change their own preferences to match others’ responses. Many recent studies have shown that it is beneficial to take those social structures into account when building a recommender system [6, 8, 10, 18], particularly in the setting when users have diverse interests or characteristics. Most social recommendation models rely on explicit feedback [14, 26]. The social structure is exploited to regularize the user preference vectors when modeling the user-item interactions [17]. Recently, several social recommendation models have been developed for implicit feedback data, such as SBPR [31] and SPMC [2]. SBPR uses an observation that users tend to assign higher ranks to items that their friends prefer. The impact of different friends on a user is treated as uniform. However, as noted in [2] that different friends could have a different amount of impact on a user. In SPMC, the impact of friends on a user is modeled as the closeness between them. However, the impact of friends on a user is static with respect to items. In this paper, we further assume that when predicting the preference of a user u towards different items, the impact of friends on u also depends on the items. For example, when user u considers to choose a Romance movie, she would be more likely influenced by some friends. In another time, she might be influenced by other friends when choosing a Thriller movie.

Based on this assumption, we propose a new social recommendation model, MAS (Multi-head Attentive Social recommendation) that explicitly exploits social impact of friends on users rating behaviors. Specifically, we first embed users and items into low-dimensional space through utilizing the user-item interaction. Then we enhance the user representations with a weighted sum of social representations of neighbors, which are learned through explicitly modeling the social relationship. The social representations are used as context information in multi-head attention network, which can stably estimate different influence of neighbors for users with respect to different items. We finally employ Bayesian Personalized Ranking framework to estimate the model’s parameters. The advantages of our new approach over the existing ones are confirmed by extensive experiments on three real-world datasets. Our contributions can be summarized as follows:

  1. 1.

    We design a multi-head attention network to learn the diverse social impact of different friends on a user with respect to each item.

  2. 2.

    We propose a new social recommendation model which incorporates the user-item interaction and social structure into a unified model. The learned preference representation and social representation are combined to represent users.

  3. 3.

    We conduct extensive experiments on three real-world datasets, which include performance comparisons with state-of-the-art methods. The experimental results demonstrate the effectiveness of our proposed approach, even for cold-start users.

2 Related Work

In recent years, the social link between users is used to improve the prediction accuracy of the recommender system [6, 18]. In [10], the authors propose Trustwalker which is based on a random walk model combining the trust-based and the collaborative filtering approach for recommendation. In [16], the authors propose to fuse the users’ tastes and their trusted friends’ favors together. Based on this intuition, they propose social trust ensemble to represent the formulation of the social trust restrictions on the recommender systems. Social regularization has been introduced in [17] to constrain matrix factorization objective functions. In [30], a circle-based recommendation system is developed through inferring category-specific social trust circles from available rating data combined with social network data. The out-performance of using explicit social information is also demonstrated in the experiments of [15].

In [13], the authors propose to combine contextual information and social information to improve quality of recommendations. The matrix factorization is applied on the sub-matrix which is partitioned based on various context, and the social information is incorporated into a social regularization term. In [22], local and global social relations are exploited for recommendation. In their framework, the user preferences of two socially connected users are correlated locally and the global user reputation score is used to weight the importance of their ratings. Motivated by the heuristic that individuals will affect each other during the process of reviewing, a truster and trustee model is proposed in [29] to map users into the same latent feature spaces but with different implications that can explicitly describe the feedback how users affect or follow the opinions of others. They synthesized the two models to one fusing model simultaneously fitting available ratings and trust ties. TrustSVD proposed in [8] extends SVD++ [11] with social trust information. The trust matrix is decomposed into trust-feature matrix and trustee-feature matrix. However, the above models are developed for explicit feedback data.

A few recent works exploit tie strength or user dependencies for building social recommendation systems. The effects of distinguishing strong and weak ties in social recommendation are studied in [27]. They use Jaccard similarity to approximate the tie strength and incorporate the distinction of strong and weak ties into the Bayesian Personalized Ranking method. In their following work [26], the personalized preference of strong and weak ties are learned and incorporated into probabilistic matrix factorization for social recommendation. Assuming users’ latent features follow a matrix variate normal distribution, the learnt positive and negative dependencies between users are incorporated into the probabilistic relational matrix factorization model proposed in [14]. In SREPS [12], the rating data, consumption data and social structure are incorporated to learn latent vectors in the essential preference space. In SERec model [25], the social information is used as social regularization and social boosting. Chaney et al. [3] propose social Poisson factorization, a Bayesian model that incorporates social network information.

Attention network has been successfully applied to tackle problems such as machine translation [1], and recommendation systems [28]. The advantages of recommendation models based on attention mechanism over traditional weighted contribution perspective [3] are demonstrated in recent work such as ACF [4] and NAIS [9]. The authors of [20] propose Attentive Recurrent Social Recommendation model that users’ temporal complex dynamic interests and the static interests are mixed to model users preferences over time. The temporal social influence over time is learned through a temporal attention network. However, the performance of Static Attentive Social Recurrent Recommendation is inferior to ARSR and our model, as demonstrated in experiments of [20] and ours.

3 Proposed Method

In this section, we begin by setting up the notations before delving our proposed method MAS (Multi-head Attentive Social Recommendation). First we present the component modeling user-item interaction. We then design and elaborate a social attention network which is used to learn the social impact. Furthermore, we also extend our basic model with multi-head attention mechanism. Finally, the MAS model is developed after the introduction of the component modeling social structure.

3.1 Notations and Problem Formulation

We denote the user set as \(U = \{u_1,u_2,\cdots ,u_n\}\) and the item set as \(I = \{i_1,i_2,\cdots ,i_m\}\). In the recommendation systems that exploit social structure, we are given a social network \(G = (U,E)\), where E is the set of social links between users, in addition to a rating matrix \(R = [x_{ui}]_{n \times m}\). In this paper, we consider social recommendation with implicit feedback data where the user-item interaction \(x_{ui} \in \{0,1\}\) denotes the feedback of user u on item i. We denote \(R_u\) and \(N_u\) as the positive and negative feedback of user u, respectively. The user-user interactions describe the social connections between users, such as truster/trustee relations or friendships. We denote the friends of user u by F(u).

Our goal is to recommend a list of items that a user may interested in by exploiting both the user-item interactions and social relationship between users.

3.2 Modeling User Feedback

As we aim to recommend top-N items to users, we employ Bayesian Personalized Ranking (BPR) as our basic learning model due to its effectiveness in exploiting the user-item implicit feedback. The basic assumption of individual pair-wise preference over two items used in BPR can be represented as follows:

$$\begin{aligned} i >_u j, i \in R_u, j \in N_u, \end{aligned}$$

where the partial ranking relationship \(i >_u j\) means that a user u is likely prefer a positive item \(i \in R_u\) to a negative item \(j \in N_u\) [19]. Given the training set \(D_S = \{(u,i,j): u \in U, i \in R_u, j \in N_u\}\), the generic optimization criterion of BPR (BPR-OPT) is defined as follows:

$$\begin{aligned} L_I = \sum _{(u,i,j) \in D_S} \ln \sigma (\hat{x}_{ui} - \hat{x}_{uj}) - \lambda _{\varTheta } ||\varTheta ||^2, \end{aligned}$$
(1)

where \(\sigma \) is the sigmoid function \(\sigma (z) = \frac{1}{1 + e^{-z}}\), and \(\lambda _{\varTheta }\) are model specific regularization parameters. The preference of the user u towards the item i, \(\hat{x}_{ui}\) can be estimated through a ranking model.

In this paper, we adopt the BPR-MF framework [19] where the user-item interaction \(\hat{x}_{ui}\) is estimated as follows:

$$\begin{aligned} \hat{x}_{ui} = q_i^Tp_u \end{aligned}$$
(2)

where \(p_u \in \mathbb {R}^d\) and \(q_i \in \mathbb {R}^d\) are latent preference vector of user u and item i, respectively.

Inspired by the work of node or social embedding [21], we introduce a social representation \(c_u \in \mathbb {R}^d\) to the user u. Then we could enhance the representation of user u as follows:

$$\begin{aligned} p_u + \sum _{v \in F(u)}\alpha _{uv} c_v \end{aligned}$$
(3)

where \(\alpha _{uv}\) measures the closeness between user u and v, or social impact of user v on user u.

Fig. 1.
figure 1

An illustration of one head social attention network

It is essentially the inherent user interest \(p_u\) adjusted by a weighted sum of her friends’ social preference vectors. The weight \(\alpha _{uv}\) could be set as \(\alpha _{uv} = 1/|F(u)|\) or \(\alpha _{uv} = 1/\sqrt{|F(u)}|\). However, the amount of impact of different friends on user u is overlooked in this simplest setting. Another choice of \(\alpha _{uv}\) is the closeness between user u and user v as proposed in SMPC [2]. However, as mentioned in Introduction section, our rating or buying behavior might be influenced by different friends on different items. Based on this assumption, we propose the following estimator \(\hat{x}_{ui}\):

$$\begin{aligned} \hat{x}_{ui} = q_i^T \left( p_u + \sum _{v \in F(u)}\alpha _{ui}(v) c_v \right) , \end{aligned}$$
(4)

where \(\alpha _{ui}(v)\) denotes the amount of impact of user v on user u for item i. It can be seen that this formulation subsumes previous ones. We design an social attention network to learn the weight \(\alpha _{ui}(v)\) in the next section.

3.3 Multi-head Social Attention

The goal of the social attention is to select friends who have large amount impact on user when predicting the behavior of a user on different items. We can derive a normalized score with a softmax function if we follow traditional settings of neural attention network [1]. Figure 1 illustrates the architecture of designed social attention network. Given the user’s preference vector \(p_u\), the friend’s social representation vector \(c_v\) and the item representation vector \(q_i\), we use a two-layer network to learn the attention weight \(\alpha _{ui}(v)\) as follows:

$$\begin{aligned} o_{ui}(v)&= p_u^TW_{p}c_v + q_i^TW_{q}c_v \\ \alpha _{ui}(v)&= softmax(o_{ui}(v)) \nonumber \end{aligned}$$
(5)
$$\begin{aligned}&=\frac{\exp (o_{ui}(v))}{\sum _{v' \in F(u)} \exp (o_{ui}(v'))} \end{aligned}$$
(6)

where \(W_p \in \mathbb {R}^{d \times d_a}\) and \(W_q \in \mathbb {R}^{d \times d_a}\) are parameter matrices. The first term \(p_u^TW_{p}c_v\) can be interpreted as the weighted social similarity between user u and user v, and the second term \(q_i^TW_{q}c_v\) can be interpreted as the preference of friend u to item i. Both terms are combined into a score \(o_{ui}(v)\) through the attention network and the final representation score \(o_{ui}(v)\) is computed by a nonlinear activation function \(\phi (\cdot )\). Empirically, we found that the rectified linear unit (ReLU) \(\phi (x)=max(0,x)\) works best due to its non-saturating gradient [7].

To capture more abundant information about the social representation and stabilize the learning process of attention network, we extend our model with multi-head attention mechanism [23]. To make sure the multi-head attention mechanism could learn different representations, we transform the input feature \(c_v\) through different learnable liner transformation as follows:

$$\begin{aligned} c_v^k=c_vW_c^k \end{aligned}$$
(7)

The multi-head attention network has multi-parameters \(W^k\) and \(c^k\). These independent attention mechanisms execute the transformation and output the following weight \(\alpha _{ui}^k(v))\):

$$\begin{aligned} o_{ui}^k(v)&= \phi (p_u^TW_{p}^kc_v^k + q_i^TW_{q}^kc_v^k) \end{aligned}$$
(8)
$$\begin{aligned} \alpha _{ui}^k(v)&= softmax(o_{ui}^k(v)) \end{aligned}$$
(9)

After that, we employ averaging strategy to combine the multi-head attention outputs as follows:

$$\begin{aligned} \alpha _{ui}(v) =\frac{1}{K}\sum _{k=1}^K\alpha _{ui}^k(v) \end{aligned}$$
(10)

3.4 Modeling Social Structure

In this section, we introduce the component modeling social structure which bridges the latent preference \(p_u\) of user u and social embedding \(c_u\). We achieve this goal through another instantiation of BPR. Given the social network or graph \(G = (U, E)\), the social proximity between a friend v and user u is estimated as \(\hat{y}_{uv} = p_u^Tc_v\). Borrowing the partial ranking assumption of BPR, we assume that \(\hat{y}_{uv} \ge \hat{y}_{uv_n}\) for \(v \in F(u)\) and \(v_n \in U \setminus F(u)\). That means the social proximity between neighbors should be larger than strangers. This yields the following objective function:

$$\begin{aligned} \mathcal {L}_G = \sum _{(u,v,v_n) \in D_G} \ln \sigma (\hat{y}_{uvv_n}) \end{aligned}$$
(11)

where \(\hat{y}_{uvv_n} = \hat{y}_{uv} - \hat{y}_{uv_n}\) and \(D_G = \{(u,v,v_n): u \in U, v \in F(u), v_n \in U \setminus F(u)\}\). From the perspective of recommender system, if two users are strongly connected, it is very likely that they have similar preference and the friend should have more impact on him.

3.5 The Unified MAS Model

In this section, we aim to develop a unified MAS model. By incorporating the objective functions for modeling the user feedback and social structure respectively as Eqs. (1) and (11), we arrive at the unified objective function for our MAS model as follows:

$$\begin{aligned} \mathcal {L}_{MAS}= & {} \mathcal {L}_I + \gamma \mathcal {L}_G - \lambda \mathcal {L}_r \end{aligned}$$
(12)

where \(\gamma \) and \(\lambda \) are parameters to trade-off the contribution of corresponding terms, and \(\mathcal {L}_r\) is the regularization term to avoid overfitting which is defined as follows:

$$\begin{aligned} \mathcal {L}_r =\frac{1}{2}(\sum _{u} ||p_u||^2 | + \sum _{v} ||c_v||^2 + \sum _{i} ||q_i||^2+||W^k||). \end{aligned}$$

The model parameters \(\{p_u,c_v,q_i,w_j\}\) are estimated by maximizing the above regularized objective function through stochastic gradient ascent. \(W^k\) is a collection of attention network parameters \(W_c^k\), \(W_p^k\) and \(W_q^k\).

Learning Details. We describe some learning details which are essential to re-implement our method.

Sampling Methods. Since we use the pairwise learning method BPR to optimize model parameters, the sampled training instances have great impact on the recommendation performance. Therefore, we adopt two sampling strategies for the interaction component and the social component. For sampling an instance in the interaction component, we uniformly sample a negative item j from \(N_u\) to build a training triple (uij) for user u and positive item i. For sampling an instance in the social component which is also a triple \((u,v,v_n)\), we use the negative sampling to draw the negative user \(v_n\) according to the noise distribution \(P_n(u)\). We set \(P_n(u) \propto d_{u}^{3/4}\) as used in [21], where \(d_u\) is the out-degree of user u.

Pre-training. In our model, we integrate the attention network to better model user preference, but this could lead to hard fitting because of the non-linearity of neural network and sensitivity to initialization. On the other hand, we only consider user’s neighbor relations during modeling social structure. In reality, users occur in similar contexts also tend to have similar preference. However, training complex social model and user preference simultaneously converges slowly and traps to local minimums. Besides, our aim is to train the recommendation model rather than to devise sophisticated model for social embedding. Therefore, we consider involving the pre-trained social embedding as prior knowledge. LINE [21] shows good performance for capture social structures for social embedding. We adopt the social embedding learned by LINE to initialize that of MAS which greatly facilitates the learning of the attention network and achieves better performance.

Training Process. We summarize the training process in Algorithm 1. To be specific, lines 1–3 initialize all embedding vectors and attention parameters; lines 6 performs uniform sampling; line 7 employs SGA to learn the embeddings and parameters in the interaction component. lines 10 performs negative sampling; line 11 employs SGA to learn the embeddings in the social component.

figure a

4 Experiments

In this section, we evaluate the effectiveness of the proposed MAS framework using three real-world datasets.

4.1 Experimental Settings

Datasets. Three publicly available datasets are used in our experiments, whose statistics are shown in Table 1.

  • CiaoFootnote 1. This dataset contains ratings on DVDs and social network which is crawled from a DVD community website in December, 2013.

  • EpinionsFootnote 2. Epinions is an online consumer review website. The dataset is collected from this website and also contains ratings and social network.

  • FoursquareFootnote 3. This dataset is collected from the popular location-based social network Foursquare with the ratings of venues and social relations.

All three datasets consist of both ratings and directed social links. We removed those items that have less than five ratings. Since feedbacks in these datasets are all explicit ratings from 1 to 5, we treat the items rated higher than 2 as positive feedback Following [27], we randomly choose 70% of each user’s positive items as the training set, 20% are held out as the validation set and the rest are used as the test set. The best parameters are determined according to the performance on validation set. Since a few top-ranked items can only be noticed by users [5], we use the top-K metrics like Precision, Recall, MAP, and NCDG to measure recommendation performance.

Table 1. Statistics of the datasets.

Baselines. To validate the performance of our approach, we compare with the following baselines which aim to recommend with implicit feedback data.

  • BPR. This is a recommendation method tailored to implicit feedback data combined with Matrix Factorization model [19].

  • SBPR. This is a state-of-the-art recommendation model that benefits from modeling social relations. Introduced by [31], the model is based on the assumption that users and their social circles should have similar tastes/preferences towards items.

  • SPFFootnote 4. This is a probabilistic model that incorporates social network information into Poisson factorization [3].

  • SARSE. Static Attentive Social Recurrent Recommendation is one of variants of Attentive Recurrent Social Recommendation model recently proposed in [20]. In SARSE, only the social information is leveraged to learn users’ stationary interests.

For convenience, we denote the model with averaging social embeddings as “ABPR”. In this paper, we focus on leveraging the social information to improve the recommendation performance. To be fair, we present the comparison results with SARSE, the static variant of DARSE.

Parameter Settings. For all models, we randomly initialize the parameters with Gaussian distribution, where the mean and standard deviation are 0 and 0.001, respectively. All models are trained with stochastic gradient descent with a batch size of 1024 at each iteration and the learning rate is 0.001 according to the turning process. We set the regularization parameter \(\lambda \) to 0.01. The parameters of our attention network are initialized with Glorot normal initializer. The latent factor number of attention network is 16 and the number of multi-head is 3. The weight of social component \(\gamma \) is 0.8. All these hyper-parameters are tuned on the validation set.

4.2 Performance Comparison

We first compare the performance of our models with other methods under different metrics. Then we investigate the performance comparison on cold-start users. Finally, we investigate the performance over users with different degrees.

Overall Comparison. Table 2 shows the performance comparison of four metrics with respect to the different recommendation methods on three datasets, where the embedding size includes 16 and 32. We have the following findings:

  • As shown in Table 2, our proposed model MAS achieves competitive results than the other methods. The relative improvements over the best baselines are 24.1%, 8.7% and 3.3% in terms of Recall@10 on three datasets, respectively. This may due to that MAS could capture the different impact of user’s friends effectively. In addition, the performance of MAS is better than SARSE which also adopts the attention mechanism, possibly because MAS assigns more appropriate social weights which is related to each item through the multi-head attention network. The improvements against SPF also verify the benefits of multi-head mechanism.

  • The proposed ABPR is a simplified version of MAS which only aggregates the friends’ embeddings averagely. Even so, ABPR still outperforms SBPR. In particular, in terms of Recall@10, the relative improvement over SBPR is about 6% on average. While SBPR indirectly exploits the social connections, it only utilizes ratings information of user’s friends. ABPR incorporates the interaction component and the social component together effectively which leads to better performance.

  • All models taking social connections into account perform better than BPR on all datasets, which indicates that integrating social information can effectively improve performance when this information could be collected. We also conduct experiments when the embedding size sets to 32. Our model still outperforms all baselines on all datasets in terms of different evaluation metrics. Meanwhile, the performance of all models improves, when the latent embedding size increases from 16 to 32. Figure 2 shows the performance on three datasets in terms of Recall@K when varying K.

Table 2. Overall performance comparison with different dimension: \(d = 16\) and \(d = 32\). (The best scores are in bold.)

Comparison on Cold-Start Users. We further investigate the performance of different methods on cold-start users who consume less than 5 items. As shown in Table 3, our approach MAS significantly outperforms competitors which confirms the advantages of two components’ incorporation and attention network again. The results indicate that MAS is effective for not only all users, but also cold-start users.

Comparison in Degrees. We then analyze the performance comparison on users with different degrees. Here, the degree of a user is the number of his friends. After removing isolated users, we divide the users into four groups (namely, 1–5, 6–15, 16–25 and 25+) in terms of their degrees. Then, we compare the models in terms of Recall@100. As can be seen from Fig. 3, our approach MAS performs the best on different groups in most cases. However, when the degree is quite large, MAS loses on Epinions and Foursquare. By examining the datasets, we find that there are only dozens of users in these groups which may causes the instability. However, MAS is more stable than ABPR on three datasets, which indicates the incorporation of attention mechanism brings more superiority.

Effect of Pre-training. To enhance our model, we also use a popular embedding method LINE [21] to initialize the social embedding which is able to capture the first-order and second-order proximity. From the Table 4, we can find that the performance of MAS with pre-trained social embeddings is improved significantly.

4.3 Attention Analysis

We also investigate the multi-head attention weights to show that our model is more explainable. To demonstrate this, we investigate the multi-head attention weights of a sampled user on target item shown in Table 5. Since ABPR aggregates the social embeddings averagely, it uniformly assigns weights to the four friends. In this case, the target item #885 and #5020 are sampled from the test set and should be ranked higher. From the Table 5, we can see that MAS predicts a larger score on the target items successfully. As a comparison, ABPR predicts a relatively smaller score. We have the following findings: on the one hand, the attention weights can represent the diverse impact of friends. MAS assigns a lower weight on the friend #779 and a larger weight on the other friends to the item #885. On the other hand, the attention weights of friends may vary when the user consumes different items. MAS assigns a lower weight on the friend #3250 to the item #5020 which is different from the item #885. The reason may because that the preferred items of friends and the relationship between them have considerable influence on the user’s decision.

Fig. 2.
figure 2

Performance under different values of K (measured by Recall@100).

Table 3. Cold-start performance comparison (embedding size 16).
Fig. 3.
figure 3

Performance comparison on users with different degrees (measured by Recall@100).

Table 4. Performance of MAS methods with (w/) and without (w/o) pre-trained social embeddings (embedding size 16).
Table 5. Attention weights of a sampled user on target item #885 and #5020 on Ciao. The user has four trustees which are shown in column 1 to 4, and the last column represents the prediction score.

5 Conclusion and Future Work

In this paper, we propose a new social recommendation model, MAS, to exploit the implicit feedback data and social structure. We design a social attention network to learn the impact of users’ friends when predicting users’ preference on different items. The experiments on three real-world datasets demonstrate the effectiveness of our approach. In the future, we plan to extend our social recommendation model through utilizing graph attention networks proposed in [24].