Keywords

1 Introduction

Recommender systems, which are used to search in a large amount of information and provide personalized recommendation services, play an indispensable role in web services nowadays. Among the recommendation models, collaborative filtering (CF) is one of the most widely-used algorithms, which leverages users’ historical interactions to obtain their preferences. However, CF suffers from the cold-start problem as some users may have few interactions.

With the rapid development of web services, various kinds of side information have become available for recommender systems, which form the so-called heterogeneous information networks (HINs) [1] and can be used to alleviate the cold-start problem. Recently, some efforts try to use metapaths [8] or graph neural networks (GNNs) [1, 8] to learn the embeddings of users and items on HINs, since both metapaths and GNNs are capable of capturing high-order semantic relations. As the cold-start users and items may have much more high-order neighbors, aggregating these neighbors can help learn better embeddings of the cold-start users and items.

However, most existing models exploit the rich side information in a supervised manner [1, 8] where the supervision signals are still user-item interactions. As the cold-start users and items have few interactions, they are not fully trained during the training process. Thus, the side information is not fully exploited, especially for the cold-start users and items with rich side information. Besides, the user-item interactions merely describe the direct interaction relations between users and items, while various side information describes many other first-order and high-order relations, which reflect different aspects of users and items. Therefore, the interactions can help learn the direct interaction relation better, but introduce some noises when guiding the process of learning other relations.

To tackle the above problems, a feasible solution is to design a pre-training task which is specifically designed for assisting the aggregation of rich side information. However, most of the existing pre-training models are not for the HIN-based recommendation scenario [6, 8], where the first-order neighbors of a user can directly reflect one part of this user’s preference, and the high-order neighbors of a user implies another part of his/her preference. They describe this user’s preference from two perspectives and together form a more complete one. Therefore, a key challenge is how to jointly consider the two kinds of neighbors in the pre-training task.

In this paper, we propose a novel pre-training model named MHGP to exploit the rich side information in a HIN for cold-start recommendation. We first encode users and items in both first-order and high-order structure views with GNNs and three different attention mechanisms. Then, we collect users and items which are connected with each other by multiple metapaths as the positive samples and leverage contrastive learning to make the embeddings of the first-order structure view of positive samples similar; meanwhile align their embeddings in the high-order structure view. Once the pre-training process converges, the pre-trained embeddings will be fine-tuned with the recommendation model.

We conduct comparative experiments on three real-world datasets. The results demonstrate that our pre-trained model can improve the performance of recommendation models in the cold-start recommendation scenario and outperform several state-of-the-art pre-training GNNs models.

2 Related Work

2.1 Pre-training GNNs

Recently, pre-training GNNs has attracted plenty of attentions which aims to improve the performance of GNNs. The pre-training task can be performed with contrastive learning, such as DGI [7], DMGI [5] and GraphCL [10]. There are also some works performing the pre-training task in other ways, such as L2P-GNN [4] in a meta-learning way.

A few works aim to improve the recommendation task. The work in [2] simulates the cold-start scenario and takes the embedding reconstruction as the pre-training task. SGL [9] performs graph data augmentation for contrastive learning which can be implemented in a pre-training manner. In overall, they cannot fully exploit various types of nodes and relations for pre-training on a HIN to enhance the recommendation task.

2.2 Cold-Start Recommendation

In recent years, the studies on the cold-start problem mainly focus on two directions. One is how to leverage side information to learn better embeddings of users and items, such as DisenHAN [8] and HGRec [6]. The other direction is how to exploit the underlying patterns in the interactions. Most studies adopt GNNs to mine the high-order collaborative information behind the user-item bipartite graph, such as LightGCN [3]. However, these models exploit the high-order information in a supervised manner. For the cold-start users and items, their embeddings are rarely trained as they have very few interactions.

Fig. 1.
figure 1

The MHGP framework.

3 The Proposed MHGP Model

In this section, we introduce our mixed-order heterogeneous graph pre-training (MHGP) model to pre-train the embeddings of users and items. The overall architecture is illustrated in Fig. 1.

3.1 First-order Structure View Encoding

As our purpose is to pre-train the embeddings of users and items for recommendation, we do not consider the embeddings of other node types. Therefore, in the HIN-based recommendation scenario, a user’s first-order neighbors can be users or items, while an item’s first-order neighbors can simply be users.

Item’s First-Order Structure Encoding. The importance of different users who have interacted with the same item may be different. Therefore, we apply a node-level attention mechanism to encode the first-order structures of items:

$$\begin{aligned} \textbf{h}^F_i&=\sum _{u\in \mathcal {N}_i}\boldsymbol{\alpha }_{i,u}\textbf{h}_u,\end{aligned}$$
(1)
$$\begin{aligned} \boldsymbol{\alpha }_{i,u}&=\frac{\exp (LeakyReLU(\textbf{c}_n[\textbf{h}_i\,||\,\textbf{h}_u]))}{\sum _{u\in \mathcal {N}_i}\exp (LeakyReLU(\textbf{c}_n[\textbf{h}_i\,||\,\textbf{h}_u]))}, \end{aligned}$$
(2)

where \(\textbf{h}_u\in \mathbb {R}^d\) is the embedding of user u. \(\mathcal {N}_i\) is the first-order neighbor set of item i. \(\textbf{c}_n\in \mathbb {R}^{2d \times 1}\) is the attention vector, and \(\Vert \) denotes the concatenation operation.

User’s First-Order Structure Encoding. For a user, the two types of first-order neighbors contribute differently to a user’s preference. Therefore, we design a hierarchical attention mechanism consisting of node-level attention and type-level attention to fully capture the influence from users’ first-order neighbors:

$$\begin{aligned} \textbf{h}^F_u&=\boldsymbol{\beta }_1\sum _{i\in \mathcal {N}^I_u}\boldsymbol{\alpha }_{u,i}\textbf{h}_i+\boldsymbol{\beta }_2\sum _{v\in \mathcal {N}^U_u}\boldsymbol{\alpha }_{u,v}\textbf{h}_v, \end{aligned}$$
(3)

where \(\textbf{h}_i,\textbf{h}_v\in \mathbb {R}^d\) are the embeddings of item i and user v, respectively. \(\mathcal {N}^I_u\) and \(\mathcal {N}^U_u\) denote user u’s first-order neighbor sets of items and users, respectively. \(\boldsymbol{\alpha }_{u,i}\) and \(\boldsymbol{\alpha }_{u,v}\) are the node-level attention values of which the calculations are similar to \(\boldsymbol{\alpha }_{i,u}\). \(\boldsymbol{\beta }_1\) and \(\boldsymbol{\beta }_2\) are the type-level attention values:

$$\begin{aligned} \boldsymbol{\beta }_i&=\frac{\exp (\textbf{w}_i)}{\sum \limits _{j\in \left\{ 1,2\right\} } \exp (\textbf{w}_j)},\end{aligned}$$
(4)
$$\begin{aligned} \textbf{w}_i&=\frac{1}{|\mathcal {U}|}\sum \limits _{u\in \mathcal {U}}\textbf{c}_t \tanh (\textbf{W}^F \textbf{h}^i_u+\textbf{b}^F), \end{aligned}$$
(5)

where \(\textbf{c}_t\in \mathbb {R}^d\) is the attention vector, \(\textbf{W}^F\in \mathbb {R}^{d \times d}\) and \(\textbf{b}^F\in \mathbb {R}^{d \times 1}\) are the learnable parameters.

3.2 High-order Structure View Encoding

In a HIN, we can obtain the high-order neighbors by exploiting the rich metapath-based neighbors [1]. As each metapath carries a specific semantic relation, different kinds of metapath-based neighbors imply different preference characteristics.

Metapath-Based Neighbor Generation. In a HIN, there are several adjacency matrices including user-item interaction matrix \(\textbf{Y}\) to describe the whole graph, with each describing one kind of first-order relation. Thus, we can obtain the metapath-based neighbors by matrix multiplication of these adjacency matrices, such as \(\textbf{Y}\textbf{Y}^T\) for metapath “user-item-user”. Afterwards, we set all nonzero values to 1 to form the final adjacency matrix.

Metapath-Based Neighbor Aggregation. Assuming that there are M metapaths \(\left\{ \varPhi _1, \varPhi _1, ..., \varPhi _M\right\} \) and their corresponding matrices are obtained. For each metapath \(\varPhi _m\), we use a GCN to aggregate the corresponding neighbors to obtain \(\textbf{h}^{\varPhi _m}_u\). Then, we apply a semantic-level attention mechanism to fuse the embeddings of all kinds of metapaths starting from user u:

$$\begin{aligned} \textbf{h}^H_u&=\sum _{m=1}^M\boldsymbol{\beta }_{\varPhi _m}\textbf{h}^{\varPhi _m}_u, \end{aligned}$$
(6)

where \(\boldsymbol{\beta }_{\varPhi _m}\) is the semantic-level attention values of which the calculation is similar to \(\boldsymbol{\beta }_1\) and \(\boldsymbol{\beta }_2\). The embeddings of all kinds of metapaths of items are calculated in the same way, denoted by \(\textbf{h}^H_i\).

3.3 Pre-training with Contrastive Learning

There are always some users sharing similar preferences in the recommendation scenarios. Therefore, the embeddings of the first-order and high-order structure views of these users should be similar, respectively. This also applies to items. We treat these users and items as positive samples and leverage contrastive learning to force the two kinds of embeddings of positive nodes to be consistent.

We first count how many kinds of metapaths are connected between each pair of nodes i and j, and the result is denoted by connectivity(ij). For each node i, we select all the nodes where \(connectivity(i, j) > 0\) and sort them in descending order to form \(\mathcal {S}_i\). As \(\mathcal {S}_i\) can be very large and the nodes with lower connectivity(ij) values may introduce some noises, we set a threshold \(T_\mathcal {S}\). If \(|\mathcal {S}_i|>T_\mathcal {S}\), we select the top-\(T_\mathcal {S}\) nodes as the positive nodes of node i.

After obtaining the embeddings of the first-order and high-order structure views, we feed them into a feed-forward neural network to project them into the same semantic space. Then, the final loss is calculated as follow:

$$\begin{aligned} \mathcal {L}=\lambda \mathcal {L}_u+(1-\lambda )\mathcal {L}_i, \end{aligned}$$
(7)

where \(\mathcal {L}_u\) and \(\mathcal {L}_i\) denote the losses from user side and item side, respectively. \(\lambda \) is a learnable parameter to adaptively balance the importance of the two sides. The calculation of \(\mathcal {L}_u\) is given as follows:

$$\begin{aligned} \mathcal {L}_u&=\lambda _u\mathcal {L}^F_u+(1-\lambda _u)\mathcal {L}^H_u,\end{aligned}$$
(8)
$$\begin{aligned} \mathcal {L}^F_u&=\frac{1}{|\mathcal {U}|}\sum _{u\in \mathcal {U}}-\log \frac{\sum _{v\in \mathcal {S}_u} \exp (sim(\textbf{h}^F_u, \textbf{h}^F_v)/\tau )}{\sum _{w\in \mathcal {U}}\exp (sim(\textbf{h}^F_u, \textbf{h}^F_w)/\tau )},\end{aligned}$$
(9)
$$\begin{aligned} \mathcal {L}^H_u&=\frac{1}{|\mathcal {U}|}\sum _{u\in \mathcal {U}}-\log \frac{\sum _{v\in \mathcal {S}_u} \exp (sim(\textbf{h}^H_u, \textbf{h}^H_v)/\tau )}{\sum _{w\in \mathcal {U}}\exp (sim(\textbf{h}^H_u, \textbf{h}^H_w)/\tau )}, \end{aligned}$$
(10)

where \(sim(\cdot )\) denotes the cosine similarity. \(\tau \) denotes the temperature hyperparameter. \(\lambda _u\) is a learnable parameter to adaptively balance the importance of the two kinds of embeddings of users. The calculation of \(\mathcal {L}_i\) is similar to \(\mathcal {L}_u\).

3.4 Fine-Tuning with Recommendation Models

Many existing GNN-based recommendation models initialize the embeddings of users and items randomly, which may lead to local optima during training and further affect the performance of recommendation. To alleviate this problem, we use the pre-trained embeddings to initialize the recommendation model. The embeddings are further fine-tuned with the recommendation model under the supervision of interactions.

Table 1. Statistics of the datasets.

4 Experiments and Results

4.1 Experiment Settings

We conduct the experiments on three real-world datasets: Last.FM, Ciao and Douban Movie. All the datasets contain few interactions and rich side information. The statistics of the datasets are summarized in Table 1.

We choose LightGCN [3] as the base recommendation model and choose three pre-training models DGI [7], DMGI [5] and SGL [9] for comparison.

For each dataset, we randomly choose \(x\%\) interactions as the training set, \(\frac{1-x\%}{2}\) as the validation set, and \(\frac{1-x\%}{2}\) as the testing set. To simulate various cold-start environments, we set x to 20 and 40, respectively.

All the pre-training models are trained from scratch. The early stopping patience is 20 epochs. We tune the learning rate in \(\left\{ 0.01, 0.001, 0.0001\right\} \). For our MHGP, the GNN layer is set to 1 for both encoders. For LightGCN, the number of GNN layer is 2 and the embedding size is fixed to 64. We fine-tune other hyperparameters according to the original papers.

Table 2. Performance of top-20 recommendation with LightGCN as the base model.
Table 3. Ablation study results with \(20\%\) interactions as the training set. P for Precision@20, R for Recall@20 and N for NDCG@20.

4.2 Overall Performance Comparison

The overall performance is shown in Table 2. We can see that our pre-training model MHGP can consistently improve the performance of LightGCN, which demonstrates the effectiveness of MHGP for cold-start recommendation. In addition, the relative improvement increases as the training data decreases. This indicates that when the user-item interactions are sparse, our model can learn better embeddings of users and items by reasonably exploiting the rich side information. Besides, in most cases, our proposed model outperforms other state-of-the-art pre-training models. This indicates that MHGP is more suitable for the HIN-based recommendation task. By contrasting the first-order and high-order structures of the positive samples, MHGP can effectively capture the inherent structure information in a HIN and further benefit the recommendation task.

4.3 Ablation Study

We design two variants MHGP\(_f\) and MHGP\(_h\) to perform ablation study. MHGP\(_f\) only considers the first-order neighbors while MHGP\(_h\) only considers the high-order neighbors. We compare them with MHGP and the results are given in Table 3. We can see that MHGP always achieves the best performance, indicating the necessity of jointly considering the two kinds of neighbors. Furthermore, all of them can improve the performance of LightGCN, which demonstrates the effectiveness of aggregating each kind of neighbors in pre-training. We also observe that MHGP\(_h\) performs better than MHGP\(_f\) on the Ciao and Douban Movie datasets. However, on the Last.FM datasets, the performance of MHGP\(_f\) is better. This is reasonable since the interactions are sparser while the side information is richer on Ciao and Douban Movie than on Last.FM.

5 Conclusion and Future Work

In this paper, we introduce a novel pre-training model MHGP to exploit the rich information in a HIN for enhancing cold-start recommendation. MHGP uses contrastive learning to force the embeddings of first-order and high-order structures of positive nodes to be similar. Thus, it can learn the better embeddings of users and items. Experiments show that MHGP outperforms other state-of-the-art pre-training GNN models. In future work, we will explore whether MHGP can benefit other recommendation scenarios such as sequential recommendation.