1 Introduction

Graph Convolutional Networks (GCNs), which have the ability to explore the topological information of a user-item bipartite graph, have become frequently used in recommender systems in recent years [1,2,3,4,5,6]. Depending on whether to combine ego embeddings after aggregating neighbor embeddings, GCNs can be divided into two types: without self-propagation [3, 7] and with self-propagation [2, 8, 9].

The GCNs without self-propagation update user and item embeddings by aggregating embeddings of their neighbors only. Taking user u for example, the updated embedding of user u is \(\sum \nolimits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{{{p}_{u,i}}}{{\varvec{e}}_{i}}}\), where \(p_{u,i}\) is the normalization constant which has several versions: \(\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }\) (symmetric normalization), \(\vert {{{\mathcal {N}}}_{u}}\vert \) (left normalization), \(\vert {{{\mathcal {N}}}_{i}}\vert \) (right normalization) and other variations. \({{{\mathcal {N}}}_{u}}\) and \({{{\mathcal {N}}}_{i}}\) are neighbor sets of user u and item i respectively. \(\varvec{e}_{i}\) is the embedding of item i. The GCNs with self-propagation combine ego embeddings with neighbor embeddings in the stage of embedding propagation. For instance, NGCF [9] and LR-GCCF [2] add self-loop in the propagation matrix, PinSage [4] concatenates ego embeddings with neighborhood embeddings and performs feature transformation and non-linearity to update embeddings.

Fig. 1
figure 1

A toy example of target users’ embedding update process in item recommendation, where user and item embeddings are described by a set of numerical features. (a) is the aggregation process without self-propagation and (b) is the aggregation process with self-propagation. In this example, we use left normalization as the coefficient of neighbor embeddings and choose summation as the combination way of ego embeddings and neighborhood embeddings in (b)

These two types of GCNs both have limitations: (1) The GCNs without self-propagation only aggregate neighbor embeddings while ignoring ego embeddings. This kind of embedding propagation mechanism discards ego embeddings, leading to the loss of nodes’ inherent information. As shown in Figure 1(a), user \(u_1\)’s income is in a relatively high level. Assuming that \(u_1\) had purchased some inexpensive daily requirements like a pen (\(i_1\)) and a notebook (\(i_2\)). According to the principle of embedding propagation, the value of the income feature of \(u_1\) is updated to 0.2 after performing GCN. As a result, \(u_1\)’s income feature is diminished to a low level, which is contrary to her inherent attributes. Under this circumstance, relatively expensive items are in a low priority for \(u_1\) in recommender systems, leading to the inability to meet the needs of target user. (2) Existing GCNs with self-propagation although combine ego embeddings with neighbor embeddings, they adopt a uniform manner treating all nodes. These kinds of GCNs neglect the individuality and differences of users and items. Taking Figure 1(b) for example, \(u_2\) and \(u_3\) both had read two history books (\(i_3\) and \(i_4\)). As the age feature shows, \(u_2\) and \(u_3\) have a big gap about age and \(u_3\) is much older than \(u_2\). Assuming that \(u_3\) read history books for interest and \(u_2\) just read them to finish the homework of history lessons. However, the relative gap between \(u_2\) and \(u_3\) narrows after the embedding propagation with uniform self-propagation. It obliterates the personalized differences between users in some extent. In this case, recommender systems will pretend to recommend items with similar styles to \(u_2\) and \(u_3\), although they have different demands because of the age gap.

To tackle aforementioned issues, we propose a novel GCN model, Adaptive Self-propagation Graph Convolutional Network (ASP-GCN). Considering the importance of original features and the personalized properties and attributes of users and items, it’s necessary to combine ego embeddings and neighborhood embeddings individually in the process of embedding propagation. A simple method is using a weight matrix and updating this matrix iteratively. However, the value of weight matrix is independent of ego embeddings and neighborhood embeddings, which reduces the model interpretability and recommendation performance. A solution is using a possibility vector generation module to generate aggregation proportions according to ego and neighborhood embeddings. Gumbel-Softmax is such a trick that can achieve this goal. Hence, ASP-GCN uses Gumbel-Softmax trick to generate categorical distributions between two types of embeddings: neighborhood embeddings and hybrid embeddings. Hybrid embeddings consist of ego and neighborhood embeddings. After obtaining the categorical distributions, ASP-GCN proportionally aggregates neighborhood embeddings and ego embeddings in an adaptive way. This kind of propagation rule can simultaneously retain nodes’ inherent information and capture their distinctive characteristics. Note that, ASP-GCN is essentially different with some methods focusing on the residual connection [10, 11]. ASP-GCN can capture fine-granular user properties and item attribues layer by layer while methods focusing on residual connection only handle this in a monolithic or coarse grained way. ASP-GCN uses the embeddings from the last layer for prediction because each layer proportionally retains embeddings from previous layer, contributing to the unnecessity of layer combination. Moreover, we optimized the Bayesian Personalized Ranking (BPR) loss function [12] with a similarity term. Besides the role of BPR, similarity term makes the embeddings of connected nodes close to each other and that of disconnected nodes far to each other because connected nodes have better reflection to each other.

To summarize, the main contributions of our paper are as follows:

  • We study existing GCN-based recommendation models and empirically divide them into two categories: GCNs without self-propagation and GCNs with self-propagation. We hold the view that both of them have corresponding limitations and we conduct a pilot experiment to prove our idea.

  • We propose a novel GCN-based model, Adaptive Self-propagation Graph Convolutional Network (ASP-GCN). ASP-GCN updates node embeddings by aggregating ego embeddings and neighborhood embeddings proportionally according to the categorical distribution estimated by Gumbel-Softmax.

  • Extensive experiments conducted on three publicly available datasets show that ASP-GCN outperforms several state-of-the-art approaches, which verifies the effectiveness of ASP-GCN.

2 Related work

In this section, we introduce the related work of two relevant technologies used in our work: Graph Convolutional Networks (GCNs) for recommendation and Gumbel-Softmax used to generate categorical distributions.

2.1 GCNs for recommendation

Graph Convolutional Networks (GCNs) generalize traditional Convolutional Neural Networks (CNNs) from Euclidean space to graph domain. GCNs have been widely used in node classification [13,14,15], link prediction [16,17,18,19], traffic flow forecasting [20,21,22,23] and other fields due to the remarkable capacity of GCNs in learning graph representations. In recommender systems, interactions between users and items can also be represented as graph structures, leading to the boost of GCNs in this field.

According to the domain that the convolution operations were applied on, existing GCN-based approaches can be divided into two categories: spectral GCNs [24,25,26,27] and spatial GCNs [3, 9, 28, 29]. Spectral GCNs perform convolution operations on the spectral domain to refine eigenvectors. They are computationally expensive because of the consumptive operations like Laplacian eigen-decomposition [27] and Chebyshev polynomials [24]. To tackle this problem, spatial GCNs have been proposed, which refine node (users and items) embeddings by aggregating neighbor embeddings, GC-MC [1] applies graph convolutional network on recommender systems to exploit information from user-item interaction graph structure. However, it stacks only one convolution layer, which means that only the first-order connectivities can be captured. This kind of graph structure is insufficient to model the high-order similarities of users and items, leading to the loss of useful information. To solve this problem, recent studies shine a light on stacking multiple convolution layers to exploit high-hop similarities of users and items from user-item interaction graph [2, 3, 9, 30, 31]. NGCF [9] and PinSage [4] propose to use multi-layer graph structure to capture high-order collaborative filtering signals to update user and item embeddings. Despite success, the feature transformation and non-linearity operations involved in the convolution process lead to the rise of time and space consumption. To tackle this problem, LR-GCCF [2] removes the non-linearity operation. Making a further step, LightGCN [3] removes feature transformation and non-linearity simultaneously, considering these two operations are not only consumptive but also make model difficult to train.

Since ASP-GCN belongs to the domain of spatial GCNs, we compare ASP-GCN with GC-MC, PinSage, NGCF, LR-GCCF and LightGCN from the following perspectives: (1) whether uses feature transformation; (2) whether uses non-linearity; (3) whether uses a residual network structure to make prediction; (4) whether has self-propagation. The comparison details are shown in Table 1.

Table 1 The comparison between ASP-GCN and other state-of-the-art GCN-based methods

2.2 Gumbel-Softmax

The Gumbel-max trick [32,33,34,35] provides a view to generate a one-hot vector over a categorical distribution, which can be used to select features. However, the argmax operation involved is non-differentiable so that the gradient flow is not allowed for Gumbel-max. For this reason, Gumbel-max’s application in neural networks is limited. To this end, Gumbel-Softmax [36, 37] is proposed to handle the non-differentiability problem. Gumbel-Softmax distribution is a continuous distribution over the simplex distribution that appropriates the categorical distribution via reparameterization trick. It has been widely used in learning optimal categorical distributions. One such application is network architecture search (NAS) [38,39,40,41,42]. In these methods, Gumbel-Softmax is used to generate optimal distributions between several pre-defined operations so that to optimize the neural architecture. Another example is graph sparsification [43, 44], in which Gumbel-Softmax is utilized to judge whether an edge is preserved in the sparsified graph. By doing this, the time and space complexity of Graph-based methods can be reduced and the model robustness can be improved. Moreover, Kong et al. [45] explore to investigate which of the linear and non-linear propagation is better in recommender system. Hence, it utilizes Gumbel-Softmax to generate categorical distributions between two modes of nodes in the propagation stage: the linear and non-linear characteristics.

3 Motivation

To demonstrate the motivation of our research, we conduct a pilot experiment to verify the innovation rationality of ASP-GCN. Specifically, we design two variants based on LightGCN and compare the performances of the three methods.

  • LightGCN, which is a typical GCN-based method without self-propagation in the embedding propagation stage. The convolution operation (a.k.a, propagation rule) is: \(\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}{{\varvec{e}}_{i}}}\) (taking user u for example), where \({{\mathcal {N}}}_{u}\) and \({{\mathcal {N}}}_{i}\) are neighbor sets of user u and item i; \(\varvec{e}_i\) is embedding of item i.

  • LightGCN-s, which adds uniform self-propagation in the embedding propagation stage. The convolution operation (a.k.a, propagation rule) is: \(mean({{\varvec{e}}_{u}},\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}{{\varvec{e}}_{i}}})\). Because of the self-propagation mechanism, we only use the user and item embeddings from the last convolution layer to estimate user preferences over items.

  • LightGCN-ws, which adds weighted self-propagation in the embedding propagation stage. The convolution operation (a.k.a, propagation rule) is: \({{w}_{1}}{{\varvec{e}}_{u}}+{{w}_{2}}\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}{{\varvec{e}}_{i}}}\), where \(w_1\) and \(w_2\) are trainable weights processed by softmax function. Here, we also use the user and item embeddings from the last convolution layer to estimate user preferences over items.

For these three methods, we maintain the same value for each hyperparameter. Specifically, the embedding size is fixed to 16 and the number of layers is set to 3. Besides, other hyperparameters (e.g., learning rate, regularization coefficient, dropout ratio) are all set to the same values. We report the performances of these three methods on Gowalla and Yelp datasets in Figure 2.

Fig. 2
figure 2

Training curves (testing Recall, NDCG and training loss) of LightGCN and its two variants on Gowalla and Yelp datasets

As shown in Figure 2, LightGCN-s has similar tendency with LightGCN in terms of training loss, but has higher testing Recall and NDCG values on both Gowalla and Yelp datasets. It indicates that self-propagation plays positive effect in embedding propagation stage, which can improve recommendation performance to some extent. Further, LightGCN-ws consistently outperforms LightGCN and LightGCN-s in terms of testing Recall and NDCG on two datasets significantly. Besides, as can be seen, LightGCN-ws accelerates the convergence speed of GCNs to a great extent. From these evidences, we can infer that both GCNs without self-propagation and GCNs with a uniform self-propagation manner are insufficient to model user preferences and provide satisfying recommendation. The propagation rule with weighted self-propagation is an effective and efficient way compared with the aforementioned two kinds of GCNs.

Although effective, LightGCN-ws just resorts to a simple way to aggregate neighbor embedding and ego embedding in a weighted way. It uses the trainable weights to determine the proportion between neighbor embedding and ego embedding. By doing so, it can mitigate the problem of both GCNs without self-propagation and uniform self-propagation. However, the weights are independent to the user and item embeddings, leading to the high randomness of the proportion in practice, which also limits the recommendation performance to some extent. Therefore, it is urged to take user and item embeddings into account to estimate the proportion between neighbor embedding and ego embedding using an adaptive way. This inspires us to design ASP-GCN.

Fig. 3
figure 3

The overall workflow of ASP-GCN. The upper part is the neighborhood embeddings generation process. The lower part is the hybrid embeddings generation process. The middle part is the adaptive aggregation process, in which neighborhood embeddings and hybrid embeddings aggregated proportionally

4 Methodology

In this section, we first present the overall workflow of ASP-GCN. Then, we illustrate how to use Gumbel-Softmax trick to generate the categorical distributions between two types of embeddings: neighborhood embeddings and hybrid embeddings. Next, we introduce the multi-layer convolution operations of ASP-GCN, which are used to update user and item embeddings. Finally, optimized BPR loss function used to refine model parameters is demonstrated. The overall workflow of ASP-GCN is shown in Figure 3.

4.1 Category distribution generation module

In some existing GCN-based methods, the embeddings involved in embedding aggregation procedure are only neighbor embeddings, while ego embeddings are ignored. For example, the graph convolution operation (a.k.a., propagation rule) in LightGCN is defined as (taking user u for example):

$$\begin{aligned} {{[\varvec{e}_{u}^{(1)}]}_{N}}=\sum \limits _{i\in {{N}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)}} \end{aligned}$$
(1)

N” denotes that there are only neighbor embeddings used in the update stage. \(\varvec{e}_{i}^{(0)}\) is the initial embedding of item i and \(\varvec{e}_{u}^{(1)}\) is the updated embedding of user u. \({{{\mathcal {N}}}_{u}}\) and \({{{\mathcal {N}}}_{i}}\) are the neighbor sets of user u and item i.

This kind of GCNs discard ego embeddings while aggregating neighbor embeddings, leading to the loss of inherent information of users and items. Considering the importance of original features, some other GCN-based methods combine ego embeddings when updating embeddings of target node. For instance, NGCF [9] adds self-loop in the propagation matrix and PinSage [4] concatenates ego embeddings after neighborhood aggregation. Although ego embeddings are involved in these methods, they adopt a uniform way to handle these two types of embeddings. Specifically, LR-GCCF’s propagation rule for user u is \((\frac{1}{\vert {{{\mathcal {N}}}_{u}}\vert }{{\varvec{e}}_{u}}+\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }{{\varvec{e}}_{i}}})\varvec{W}\) [2], where \(\varvec{W}\) is the trainable feature transformation matrix. The proportions between ego embeddings and neighbor embeddings are determined by the neighbor numbers of target node and its neighbor node. In other words, once the graph structure is fixed, the proportions between ego embeddings and neighbor embeddings are fixed in the embedding aggregation stage. PinSage updates the embedding of target node by concatenating ego embedding and aggregated neighbor embeddings: \(\sigma (({{\varvec{e}}_{u}}\vert \vert {{\varvec{e}}_{{{{\mathcal {N}}}_{u}}}})\varvec{W})\). It makes ego embeddings and neighborhood embeddings share the same rule in the embedding propagation process. These kinds of methods treat ego embeddings and neighbor embeddings in a uniform way, while neglecting the different rules of ego embedding and neighbor embedding in the embedding update stage for each node. To tackle the aforementioned problems, we design another kind of convolution operation that combines ego embeddings and neighborhood embeddings:

$$\begin{aligned} {{[\varvec{e}_{u}^{(1)}]}_{H}}=\varvec{e}_{u}^{(0)}+\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)}} \end{aligned}$$
(2)

H” denotes that there are both neighbor embeddings and ego embeddings used in the update stage. After obtaining two kinds of update signals for user u as shown in Eqs. 1 and 2, it is necessary to estimate the proportion of each of them in the process of embedding aggregation. Once obtaining corresponding proportions of each kind of update signal for each user and item, the ego embeddings and neighbor embeddings can be aggregated proportionally. By doing so, the problems illustrated before can be mitigated. To achieve this goal, one simple method is assign two weights for each user and item to represent the aggregation proportion of each kind of update signal in the embedding update process. In other words, model initial a weight matrix and refine this weight matrix to learn the proportions. However, this kind of processing is not optimal because the weights are independent of ego embeddings and neighbor embeddings. This brings a large randomness to the model, which would reduce the model interpretability and further limit the recommendation performance. Gumbel-Softmax trick is good solution to handle this problem, which can generate a categorical distribution according to the provided ego and neighbor embeddings. Hence, in ASP-GCN, Gumbel-Softmax trick is resorted to generate a categorical distribution between \({{[\varvec{e}_{u}^{(l)}]}_{N}}\) and \({{[\varvec{e}_{u}^{(l)}]}_{H}}\) adaptively:

$$\begin{aligned} {{\alpha }_{u,k}}=\frac{\exp ((\log {{\pi }_{u,k}}+{{g}_{u,k}})/\tau )}{\sum \nolimits _{j=1}^{\vert {{\pi }_{u}}\vert }{\exp ((\log {{\pi }_{u,j}}+{{g}_{u,j}})/\tau )}} \end{aligned}$$
(3)

where \({\alpha }_{u,k}\) is the k-th element of user u’s categorical distribution. \(g_{u,1}\), \(\cdots \), \(g_{u,\vert \pi \vert }\) are independently identical distribution (i.i.d) samples drawn from Gumbel(0,1) distribution, which can be sampled via inverse transform sampling as: \({{g}_{u,k}}=-\log (-\log (a)),a\sim \text {Uniform}(0,1)\). \(\tau \) is the temperature factor of Gumbel-softmax. \({\pi }_{u}\) is a two dimension vector, in which \({\pi }_{u,1}\) and \({\pi }_{u,2}\) are the class possibilities of \([\varvec{e}_{u}^{(1)}]_{N}\) and\([\varvec{e}_{u}^{(1)}]_{H}\) respectively. \({\pi }_{u,1}\) and \({\pi }_{u,2}\) are deduced by a multi-layer perceptron (MLP):

$$\begin{aligned} \left\{ \begin{aligned}&{{\pi }_{u,1}}={{\sigma }_{2}}({{\sigma }_{1}}({{[\varvec{e}_{u}^{(1)}]}_{N}}{{\varvec{W}}_{1}}+{{\varvec{b}}_{1}}){{\varvec{W}}_{2}}+{{\varvec{b}}_{2}}) \\&{{\pi }_{u,2}}={{\sigma }_{2}}({{\sigma }_{1}}({{[\varvec{e}_{u}^{(1)}]}_{H}}{{\varvec{W}}_{1}}+{{\varvec{b}}_{1}}){{\varvec{W}}_{2}}+{{\varvec{b}}_{2}}) \\ \end{aligned} \right. \end{aligned}$$
(4)

where \({{W}_{1}}\in {{\mathbb {R}}^{d\times {d}'}},{{W}_{2}}\in {{\mathbb {R}}^{{d}'\times 1}}\) and \({{b}_{1}}\in {{\mathbb {R}}^{1\times {d}'}},{{b}_{2}}\in {{\mathbb {R}}^{1\times 1}}\) are trainable weight matrixes and biases. \({{\sigma }_{1}}(\cdot )\) and \({{\sigma }_{2}}(\cdot )\) are the SELU [46] and Sigmoid functions respectively.

It’s worth noting that we use SELU as the activation function of the first layer of MLP, In the second layer of MLP, we use Sigmoid instead of SELU as the output of MLP to generate a possibility. It is because the output of SELU is not a possibility. SELU is proved to have a better performance than tanh, ReLU and LeakyReLU for two reasons: a) it can achieve internal normalization which converges faster than external normalization; b) it can avoid the problem of gradient vanishing and gradient explosion. The definition of SELU is as:

$$\begin{aligned} SELU\left( x \right) =\gamma \left\{ \begin{array}{ll} x &{} if\ x>0 \\ \eta ({{e}^{x}}-1) &{} if\ x\le 0 \end{array}\right. , \end{aligned}$$
(5)

where \(\gamma \) and \(\eta \) are hyperparameters, predefined as 1.67326 and 1.05070 respectively.

4.2 Adaptive self-propagation

Intuitively, the interacted items can directly reflect a user’s preference; analogously, the users that observed an item can be modeled to describe item features. According to this natural principle, it is necessary to aggregate neighbor embeddings to update the embeddings of target nodes (shown in Eq. 2). Meanwhile, considering the importance of original features users and items, we design another propagation rule which considers ego embeddings and neighbor embeddings simultaneously (shown in Eq. 2). Besides, Gumbel-Softmax trick is utilized to generate categorical distributions between neighborhood embeddings and hybrid embeddings. After obtaining the categorical distributions generated by Gumbel-Softmax, the embeddings are propagated by performing these two types of propagation rules proportionally. By doing this, the node embedding can be propagated to the node itself, which is called self-propagation.

We first introduce the first-order embedding propagation of ASP-GCN that aggregates embeddings of the first-order neighbors, and then generalize it to multiple convolution layers.

First-order propagation. The basic idea of ASP-GCN is to aggregate ego embeddings and neighbor embeddings proportionally. Specifically, we perform the propagation rules as shown in Eqs. 1 and 2 proportionally according to the categorical distributions estimated by the categorical distribution generation module. So, the first-order propagation rule (taking user u for example) in ASP-GCN is:

$$\begin{aligned} \begin{aligned} \varvec{e}_{u}^{(1)}&=\alpha _{u,1}^{(1)}{{[\varvec{e}_{u}^{(1)}]}_{N}}+\alpha _{u,2}^{(1)}{{[\varvec{e}_{u}^{(1)}]}_{H}} \\&=\alpha _{u,1}^{(1)}\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)}}+\alpha _{u,2}^{(1)}(\varvec{e}_{u}^{(0)}+\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)}}) \\&=\alpha _{u,2}^{(1)}\varvec{e}_{u}^{(0)}+(\alpha _{u,1}^{(1)}+\alpha _{u,2}^{(1)})\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)}} \\&=\alpha _{u,2}^{(1)}\varvec{e}_{u}^{(0)}+\sum \limits _{i\in {{{\mathcal {N}}}_{u}}}{\frac{1}{\sqrt{\vert {{{\mathcal {N}}}_{u}}\vert \vert {{{\mathcal {N}}}_{i}}\vert }}\varvec{e}_{i}^{(0)},} \end{aligned} \end{aligned}$$
(6)

where \(\varvec{e}_{u}^{(0)}\) and \(\varvec{e}_{i}^{(0)}\) are initial embeddings of user u and item i; \(\varvec{e}_{u}^{(1)}\) is user u’s updated embedding from the first layer. \([\alpha _{u,1}^{(1)},\alpha _{u,2}^{(1)}]\) is the categorical distribution of user u in the first layer. To note that, the categorical distribution generation module should be operated in each layer for each node. It is because embedding propagation process in different convolution layers involves different nodes, making it inappropriate to perform categorical distribution generation module only once for all layers. Operating categorical distribution generation module in each layer for each node is beneficial to capture finer-grained user preferences and item attributes.

What mentioned above is the embedding update process of a single node. To show the holistic process of embedding update and facilitate the implementation, we provide the matrix form of embedding propagation of ASP-GCN:

$$\begin{aligned} {{\varvec{E}}^{(1)}}=\varvec{\alpha }_{1}^{(1)}\hat{\varvec{A}}{{\varvec{E}}^{(0)}}+\varvec{\alpha }_{2}^{(1)}({{\varvec{E}}^{(0)}}+\hat{\varvec{A}}{{\varvec{E}}^{(0)}})=\varvec{\alpha }_{2}^{(1)}{{\varvec{E}}^{(0)}}+\hat{\varvec{A}}{{\varvec{E}}^{(0)}}, \end{aligned}$$
(7)

where \({{\varvec{E}}^{(0)}}\in {{\mathbb {R}}^{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )\times d}}\) and \({{\varvec{E}}^{(1)}}\in {{\mathbb {R}}^{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )\times d}}\) are the initial embedding matrix and updated embedding matrix of all users and items in the first layer respectively. d is the embedding size \([\varvec{\alpha } _{1}^{(1)}\in {{\mathbb {R}}^{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )\times 1}}\), \(\varvec{\alpha } _{2}^{(1)}\in {{\mathbb {R}}^{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )\times 1}}]\) is the categorical distributions for all users and items in the first layer. \(\hat{\varvec{A}}={{\varvec{D}}^{-\frac{1}{2}}}\varvec{A}{{\varvec{D}}^{-\frac{1}{2}}}\) is the symmetric normalized adjacency matrix, where \(\varvec{D}\) is diagonal degree matrix and \(\varvec{A}\in {{\mathbb {R}}^{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )\times (\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )}}\) is the adjacency matrix. Single layer graph convolutional network updates user and item embeddings by aggregating first-order neighbors’ embeddings. It captures the direct connectivities of users and items to model users’ preferences and items’ attributes. This can be seen as the first-order similarity.

High-order propagation. However, aggregating the embeddings of first-hop neighbors is insufficient to capture complex collaborative information implied in the user-item interaction graph. To exploit the high-order similarity of users and items, we stack multiple convolution layers to aggregate the embeddings of high-order neighbors. The recurrence formula is as:

$$\begin{aligned} {{\varvec{E}}^{(l)}}=\varvec{\alpha } _{2}^{(l)}{{\varvec{E}}^{(l-1)}}+\hat{\varvec{A}}{{\varvec{E}}^{(l-1)}},l=1,2,\cdots ,L, \end{aligned}$$
(8)

\(\varvec{E}^{(l)}\) and \(\varvec{E}^{(l-1)}\) are embedding matrixes of all users and items in layer l and \((l-1)\) respectively. By implementing the multi-layer matrix-form propagation rule, we can update all user and item embeddings for each convolution layer in an efficient way.

4.3 Prediction

After L layers’ embedding propagation, user and item embeddings from each layer can be obtained. Most existing GCN-based methods adopt residual connection to get the final embeddings of users and items. Taking user u for example, NGCF concatenates the embeddings from each layer: \(\varvec{e}_{u}^{(0)}\Vert \cdots \Vert \varvec{e}_{u}^{(L)}\); LightGCN uses the average of embeddings from each layer as the final embeddings: \(\frac{1}{L+1}\sum \nolimits _{l=0}^{L}{\varvec{e}_{u}^{(l)}}\). Different from these methods, ASP-GCN regards the embeddings from the last layer as the final embeddings. It is because the embeddings from previous layer are proportionally retains in current layer, contributes to the unnecessity of layer combination. If ASP-GCN also adopts a residual connection manner, the ego embeddings would occupy too much proportion in the final embeddings, which would lead to the deterioration of the representation ability of user and item embeddings. Therefore, ASP-GCN regards the user and item embeddings from the last layer as the final embeddings. Taking user u and item i for example, the final embeddings are:

$$\begin{aligned} \varvec{e}_{u}^{*}=\varvec{e}_{u}^{(L)},\varvec{e}_{i}^{*}=\varvec{e}_{i}^{(L)} \end{aligned}$$
(9)

After getting the final embeddings, we conduct inner product operation to estimate the preference score of user u over item i:

$$\begin{aligned} {{\hat{y}}_{u,i}}=\varvec{e}_{u}^{*}\otimes \varvec{e}{{_{i}^{*}}^{\text {T}}} \end{aligned}$$
(10)

In this work, we only use inner product to compute the preference score of users over target items. The inner product operation is proved to be efficient yet simple. Some other operations can also be performed to replace inner product, such as neural network [47]. We leave it to the future work since it is not the focus of our work.

4.4 Model optimization

To refine model parameters, we employ the pairwise BPR loss [12], which has been widely used in recommender systems. The BPR loss function is written as:

$$\begin{aligned} {{{\mathcal {L}}}_{\text {BPR}}}=-\sum \limits _{(u,{{i}^{+}},{{i}^{-}})\in {\mathcal {O}}}{\ln \sigma ({{{\hat{y}}}_{u,{{i}^{+}}}}-{{{\hat{y}}}_{u,{{i}^{-}}}})}+\lambda \Vert \Theta \Vert _{2}^{2}, \end{aligned}$$
(11)

where \({\mathcal {O}}=\{(u,{{i}^{+}},{{i}^{-}})\vert (u,{i}^{+})\in {{\varvec{R}}^{+}},(u,{{i}^{-}})\in {{\varvec{R}}^{-}}\}\) is the pairwise training data, \(\varvec{R}^+\) denotes the positive user-item pair set (observed interactions) and \(\varvec{R}^-\) denotes the negative user-item pair set (unobserved interactions). \(\sigma (\cdot )\) is the sigmoid function and \(\Theta \) denotes all model parameters. \({{\hat{y}}_{u,{{i}^{+}}}}\) and \({{\hat{y}}_{u,{{i}^{-}}}}\) represent the preference scores of user u over item \(i^+\) and \(i^-\) respectively. \(\lambda \) is the coefficient to control the strength of regularization term. BPR loss assumes that positive items (namely the observed items) can better reflect user preference than negative items (namely the unobserved items). Hence, it assigns higher predicted value to the positive items than to the negative items. The regularization term is used to avoid overfitting.

In addition to the regularization term, we add a similarity term. The similarity term makes the embeddings of connected nodes close to each other and that of disconnected nodes far to each other because connected nodes have better reflects to each other. The similarity loss function is as follows:

$$\begin{aligned} {{{\mathcal {L}}}_{S}}=-\sum \limits _{(u,{{i}^{+}},{{i}^{-}})\in {\mathcal {O}}}{(s(\varvec{e}_{u}^{*},\varvec{e}_{{{i}^{+}}}^{*})-s(\varvec{e}_{u}^{*},\varvec{e}_{{{i}^{-}}}^{*}))} \end{aligned}$$
(12)

where \(s(\cdot , \cdot )\) is cosine similarity function. \(\varvec{e}_{u}^{*}\) is the final embedding of user u. \(\varvec{e}_{{{i}^{+}}}^{*}\) and \(\varvec{e}_{{{i}^{-}}}^{*}\) are final embeddings of item \(i^+\) and \(i^-\) respectively. The goal of similarity term is to enlarge the gap between the similarity of positive interactions and negative interactions. But if we use \(s(\varvec{e}_{u}^{*},\varvec{e}_{{{i}^{+}}}^{*})-s(\varvec{e}_{u}^{*},\varvec{e}_{{{i}^{-}}}^{*})\) as the similarity term, the gradient would refine the model weights into a reverse direction, in which the similarity gap between positive and negative interactions would be narrowed. Considering this, it’s natural to add a negative operation in the similarity term. Finally, we combine the BPR loss and similarity loss in a multi-task learning way:

$$\begin{aligned} {\mathcal {L}}={{{\mathcal {L}}}_{\text {BPR}}}+\beta {{{\mathcal {L}}}_{\text {S}}} \end{aligned}$$
(13)

where \(\beta \) is the coefficient to control the strength of similarity term. By introducing the similarity term to the loss function, the overall loss function can guide the model to refine the model parameter to cater to the two tasks simultaneously. We think this kind of combination practice has a certain degree of universality. The similarity term can be combined with not only BPR loss, but also many other loss functions such as Binary Cross-Entropy Loss [47] and Square Loss [48, 49] with a small adjustment. Besides, this kind of practice can also be performed in many other fields, including node classification and link prediction.

4.5 Time complexity analysis

Assuming that the number of convolution layer is L and the dimension of MLP is \({{\mathbb {R}}^{{{d}_{0}}\times {{d}_{1}}}}\) and \({{\mathbb {R}}^{{{d}_{1}}\times 1}}\). The main operation of ASP-GCN is matrix multiplication and categorical distribution generation. The time complexity of categorical distribution generation is \(O(\sum \nolimits _{l=0}^{L}{(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )(d+1){d}'})\), where \(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert \) is the number of all nodes (users and items) and d, \(d'\) are the dimensions of node embedding and the first hidden layer of MLP. The time complexity of matrix multiplication is \(O(\sum \nolimits _{l=0}^{L}{\vert {{R}^{+}}\vert d})\), where \(\vert R^+\vert \) denotes the number of nonzero entities the adjacent matrix. Therefore, the overall time complexity of ASP-GCN is \(O(\sum \nolimits _{l=0}^{L}{\vert {{R}^{+}}\vert d}+(\vert {\mathcal {U}}\vert +\vert {\mathcal {I}}\vert )(d+1){d}')\).

4.6 Model analysis

Relation with NGCF. NGCF [9] is a neural graph collaborative filtering model, which exploits user-item graph structure by propagating embeddings on it. NGCF adds self-loop in the adjacent matrix (i.e., \(\varvec{A}_{vv}=1\), where v is the index of a user or an item), so that ego embedding can be propagated to node itself in the embedding propagation process. However, it’s insufficient because ego embedding only occupies a very small proportion in the embedding update procedure. Meanwhile, once the graph structure is fixed, the proportions of ego embeddings and neighbor embeddings are fixed during the process of embedding propagation. ASP-GCN adopts an adaptive self-propagation mechanism, in which ego and neighbor embeddings can be aggregated proportionally according to the categorical distributions generated by Gumbel-Softmax trick. By doing so, ASP-GCN can mitigate the problem of NGCF: the individual difference of user properties and item attributes is not fully captured.

Relation with LightGCN. LightGCN [3] removes the feature transformation matrix and non-linear activation function in the embedding propagation process, considering these two operations are burdensome and even make model hard to train. However, ego embeddings are not involved in the embedding propagation process of LightGCN, resulting in the loss of inherent information of users and items. Although LightGCN adopts a residual connection mechanism to combine embeddings from each layer, the connection is in a layer-wise manner, which means that the difference of different nodes are not distinguished. ASP-GCN generates categorical distributions for each user or item to update its embedding, which pays special attention to the distinction between different users or items.

Relation with DGCF. DGCF [50] is a disentangled graph collaborative filtering model, which pays attention to the finer-grained user intents. However, DGCF is similar with LightGCN essentially since they are all GCNs without self-propagation. Although DGCF yields disentangled representations for users and items, the operations of DGCF are time consumptive, which limits its generalization. ASP-GCN is essentially different from DGCF. DGCF learns the finer granularity of user intents by dividing user and item embeddings into several intents-aware trunks, and the update process of each trunk is independent. ASP-GCN generates individual categorical distributions for users and items to guide the process of embedding propagation, which does not increase too much time and space consumption for model, so that own a higher generalization.

5 Experimental analysis

In this section, we conduct extensive experiments on three benchmark datasets to evaluate ASP-GCN and answer the following research questions:

  • RQ1: Does ASP-GCN has a better performance compared with present state-of-the-art methods?

  • RQ2: Is adaptive self-propagation mechanism of ASP-GCN effective?

  • RQ3: How does ASP-GCN perform under different aggregation mechanisms?

  • RQ4: Why we regard the embeddings from the last layer as the final embeddings to conduct prediction?

  • RQ5: How different hyper parameters (including temperature factor \(\tau \), the coefficient \(\beta \) of similarity term and model depth L) affect the result of ASP-GCN?

5.1 Datasets and evaluation metrics

We use three publicly available datasets: Gowalla, Yelp, Movielens-100K to conduct our experiments and the statistics of them are shown in Table 2.

Table 2 Statistics of datasets

To evaluate the Top-N recommendation, we select Precision@N, Recall@N, HR@N and NDCF@N as the evaluation metrics. The calculation formulas are as follows:

$$\begin{aligned} \text {Recall@}N=\frac{1}{\vert {\mathcal {U}}\vert }\sum \limits _{u\in {\mathcal {U}}}{\frac{\vert {{S}_{u,N}}\cap {{T}_{u}}\vert }{\vert {{T}_{u}}\vert }}, \end{aligned}$$
(14)
$$\begin{aligned} \text {Precision@}N=\frac{1}{\vert {\mathcal {U}}\vert }\sum \limits _{u\in {\mathcal {U}}}{\frac{\vert {{S}_{u,N}}\cap {{T}_{u}}\vert }{N}}, \end{aligned}$$
(15)
$$\begin{aligned} \text {F1@}N=2\times \frac{\text {Recall@}N\times \text {Precision@}N}{\text {Recall@}N+\text {Precision@}N} \end{aligned}$$
(16)
$$\begin{aligned} \text {NDCG@}N=\frac{1}{\vert {\mathcal {U}}\vert }\sum \limits _{u\in {\mathcal {U}}}{\sum \limits _{p=1}^{N}{\frac{I({{S}_{u,N}}(p)\cap T(p))}{{{Z}_{u}}\cdot {{\log }_{2}}(p+1)}}}, \end{aligned}$$
(17)

where \(S_{u,N}\) is the set of top N items recommended to user u and \(T_{u}\) is the ground-truth item set of user u in the testing set. \(S_{u,N}(p)\) denotes the item set of p-th element in \(S_{u,N}\). \(I(\cdot )\) is an indicator function, that equals to 1 if the set is not empty, otherwise 0. \(Z_u\) is the ideal discount cumulative gain so that a perfect recommended list obtains NDCG\(_u\)=1.

5.2 Baselines

  • BPR-MF [12]: BPR-MF is a matrix factorization model optimized by Bayesian Personalized Ranking (BPR) loss, which exploits the user-item interactions to compute the loss and refine model parameters.

  • GMF [47]: GMF is a generalized matrix factorization recommendation model, which uses a linear kernel to calculate users’ preference over items.

  • MLP [47]: MLP is a Multi-Layer Perceptron based model, which uses a non-linear kernel to estimate users’ preference over items.

  • NCF [47]: NCF is the combination of GMF and MLP, which can simultaneously capture the linear and non-linear feature of users and items.

  • GC-MC [1]: GC-MC is a GCN-based model, which stacks only one convolution layer to update node embeddings by aggregating the embeddings of first-order neighbors.

  • PinSage [4]: PinSage is the implement of GraphSage in large Web-scale recommender systems.

  • NGCF [9]: NGCF is a GCN-based collaborative model, that stacks multiple convolution layer to update user and item embeddings. It additionally encodes the interactions of target node and neighbor node, compared with convolutional GCNs.

  • LR-GCCF [2]: LR-GCCF is a linear GCN-based collaborative filtering model, whose embedding propagation is linear. It only uses feature transformation and removes the non-linearity in the aggregation stage.

  • LightGCN [41]: LightGCN is a simplifying and powering GCN model that removes feature transformation and non-linearity simultaneously. It improves the recommendation performance, while reducing the memory and time consumption.

  • DGCF [50]: DGCF is a disentangled graph collaborative filtering method, which pay special attention to user-item relationships at the finer granularity of user intents. It disentangles user intents and yield disentangled representations so as to improve the robustness and interpretability of recommendation model.

  • SGL [51]: SGL is the implement of self-supervised learning on user-item interaction graph with the aim of improving the accuracy and robustness of GCNs for recommendation. We select Edge Dropout (SGL-ED) and Random Walk (SGL-RW) as the competitive models since they generally show better performance.

  • CIGCN [52]: CIGCN is an embedding disentanglement model, which designs a channel independent graph convolutional network to disentangle user and item embeddings. It assigns different dimension of embeddings with different importance to update embedding dimension independently. To notice, we don’t use the item-item relations since there are no item-item relation data in the datasets we used.

Table 3 The comparison of overall performance with baseline methods on Gowalla dataset

5.3 Experimental environment and parameter settings

Experimental environment: ASP-GCN is implemented using pytorch framework and speed up by a NVIDIA 2080Ti GPU.

Parameter settings: For the sake of fairness, the embedding sizes of all methods are fixed to 64. For all multi-layer GCN-based methods, we search the layer number in {1, 2, 3, 4, 5} for the best performance and the size of each layer is set to 64. The batch size is selected in range {256, 512, 1024, 2048} and the learning rate is searched in {\(10^{-5}\),\(10^{-4}\),\(10^{-3}\),\(10^{-2}\),\(10^{-1}\), 1} to tune the convergence speed. The temperature factor \(\tau \) is selected in range {1, 5, 10, 50, 100}. Moreover, the coefficient of similarity term and regularization term is searched in range {0.01, 0.05, 1} and {\(10^{-5}\),\(10^{-4}\),\(10^{-3}\),\(10^{-2}\),\(10^{-1}\)} respectively. Meanwhile, we use a dropout strategy for all models and search it in {0.1, 0.2, ..., 0.8}. For every methods, one negative sample is selected to match each positive sample. Xavier [53] and Adam [54] are used to initialize and optimize the model parameters of ASP-GCN respectively.

Table 4 The comparison of overall performance with baseline methods on Yelp dataset
Table 5 The comparison of overall performance with baseline methods on Movielens-100K dataset

5.4 Performance comparison (RQ1)

From the results shown in Tables 3, 4 and 5 (The boldfaced values mean the best performance in each column, and the underlined values indicate the best performance of baselines), we have the following observations:

  • ASP-GCN has the best recommendation performance over other present baselines on both three datasets, which indicates the effectiveness of ASP-GCN proposed in this paper. By generating categorical distributions between neighborhood embeddings and hybrid embeddings and aggregating these two kinds of update signals according to the generated categorical distribution, ASP-GCN can benefit the GCN model for retaining nodes’ (users’ and items’) inherent information and capturing the users’ personalized needs.

  • BPR has poor performance on Gowalla and Yelp datasets consistently. It samples the user-item interactions to calculate the loss and then refine user and item embeddings using backward propagation. This indicates that simply perform inner product between users and items is insufficient.

  • GCN-based methods (GC-MC, PinSage, NGCF, LR-GCCF, LightGCN, DGCF, CIGCN) have better performance than BPR, GMF, MLP and NCF. They use graph convolutional network to exploit high-order connectivities of users and items. In particular, GC-MC’s performance is worse than multi-layer based methods such as LR-GCCF and LightGCN. This is because GC-MC stacks only one convolution layer, which can only exploit the first-order connectivities and fails to capture the high-order similarity of users or items.

  • Compared with NGCF and LR-GCCF, LighGCN consistently shows better performances on two datasets. It suggests that non-linear activation and feature transformation are two burdensome operations for recommendation tasks. They not only rise the time and memory consumption, but also make the model difficult to train.

  • SGL outperforms LightGCN since it performs self-supervised learning on user-item interaction graph. SGL mitigates two problems existed in the GCN-based recommendation model: long-tail and noisy interaction problems. By doing so, SGL can improve the robustness of GCN-based recommendation model.

5.5 Ablation experiments

5.5.1 Effectiveness of adaptive self-propagation mechanism (RQ2)

To investigate whether the adaptive self-propagation mechanism of ASP-GCN is helpful to improve recommendation performance, we remove the similarity term in the loss function (denoted by ASP-GCN-s). ASP-GCN-s is the pure adaptive self-propagation mechanism based GCN as compared with other GCN models. From Figure 4, we can find that ASP-GCN-s outperforms all nine comparative methods. It verifies that the adaptive self-propagation mechanism designed in this paper is effective. Concretely, ASP-GCN utilizes the Gumbel-Softmax trick to generate the category distribution between neighborhood embeddigns and hybrid embeddings, and aggregates these embeddings proportionally accordding to the generated categorical distributions. This kind of embedding propagation mechanism captures both ego and neighbor features, which can further improve the recommendation performance.

Fig. 4
figure 4

Performance comparison of ASP-GCN-s (i.e., the variant of ASP-GCN, which removes the similarity term of loss function) and other recommendation models

5.5.2 Impact of different aggregation mechanisms (RQ3)

In ASP-GCN, we employ proportional summation between neighborhood embeddings and hybrid embeddings in embedding aggregation stage. To study its rationality, we also design three variants of ASP-GCN: ASP-GCN-max (performing proportionally max pooling between two types of embeddings), ASP-GCN-concat (proportionally concatenating two types of embeddings) and ASP-GCN-mean (performing proportionally mean pooling between two types of embeddings). From the results shown in Table 6 (The boldfaced values indicate the best performances), we can observe that the best performance in general is ASP-GCN. On Yelp dataset, ASP-GCN-mean has a better performance than the others. But we found that ASP-GCN-mean is hard to train and time consumptive. It is particularly obvious on Movielens-100K dataset, not only rising the time cost, but also deteriorating overall performance.

Table 6 Performance of 3-layer ASP-GCN with different choices of aggregation mechanisms in graph convolution

5.5.3 Impact of layer combination (RQ4)

ASP-GCN regards the embeddings from the last layer as the final embeddings to predict preference scores of users over items. To show why we do this, we design a variant \(\text {ASP-GCN}_{\text {all-layer}}\) that sums embeddings from each layer as the final embeddings of users and items. As shown in Figure 5, ASP-GCN consistently has better recommendation performance than \(\text {ASP-GCN}_{\text {all-layer}}\) on both Gowalla, Yelp and Movielens-100K datasets, which can be explained by ASP-GCN’s adaptive self-propagation mechanism. In each layer, ASP-GCN aggregates a certain proportion of hybrid embeddings which consist of ego embeddings and neighbor embeddings. It means that a certain proportion of embeddings from the previous layer can be retained. Hence, it is unnecessary for ASP-GCN to perform layer combination, which would affect the recommendation performance adversely.

Fig. 5
figure 5

Results of ASP-GCN and the variant that sums embeddings from each layer under different number of layers on Gowalla and Yelp datasets

5.6 Hyper-parameter analysis (RQ5)

5.6.1 Impact of temperature factor and coefficient of similarity term

The temperature factor \(\tau \) and the coefficient \(\beta \) of similarity term are two crucial hyper-parameters of ASP-GCN. Performances of ASP-GCN under different \(\tau \) and \(\beta \) settings are shown in Figure 6.

\(\tau \) influences the categorical distributions generation process. For low temperatures, the categorical distributions approximate one-hot distributions. For high temperatures, the categorical distributions gradually become uniform distributions. These two extremes will lead to two problems: (1) The first one causes the imbalance of ego embeddings and neighbor embeddings; (2) The second one stifles the differences between nodes. Hence, as shown in Figure 6, the Recall@20 and Precision@20 have showed a pattern that the curve increases early then decreases later as the temperature increases.

\(\beta \) influences the model training process. As shown in Figure 6, Recall@20 decreases after the increasing as \(\beta \) rises. As the coefficient of similarity term increases, the embeddings of connected nodes become closer and disconnected nodes become further iteratively. It is beneficial because connected nodes can better reflect the features of target nodes. This proves the effectiveness of our optimized loss function. But as \(\beta \) increases into a relatively higher level, Recall@20 shows a downward trend. This is because a larger proportion of similarity term will weaken the rule of main loss (BPR), causing ASP-GCN deviate from the main goal.

Fig. 6
figure 6

Performance of 3-layer ASP-GCN under different temperature factors and coefficients of similarity term

Table 7 The performance of ASP-GCN under different number of layers

5.6.2 Impact of model depth

To investigate how model depth affects the performance of ASP-GCN, we search the number of convolution layer in the range {1, 2, ..., 7}. As summarized in Table 7 (The boldfaced values indicate the best performances), we can have the following observation:

  • ASP-GCN benefits from the multi-layer graph structure. In particular, the recommendation performance has consistent improvement on Movielens-100K when layer number increases from 1 to 4, on Gowalla when layer number increases from 1 to 5 and on Yelp when layer number increases from 1 to 6. This can be attributed to the exploitation of high-order collaborative signals. By aggregating the embeddings of high-hop neighbors, the high-order collaborative signals can be captured to model the high-order similarity of users and items. Therefore, increasing the number of convolution layers can enhances the model’s representational ability, further improving the recommendation performance.

  • When model depth increases from 5 to 7 on Gowalla, from 6 to 7 on Yelp and from 4 to 7 on Moivelens-100K, the model witnesses the deterioration. It is because the problem of over-smoothing, which is the reason why GCNs cannot get satisfying performance when increasing the model depth continuously. Deepening the model causes the exponential increase of the number of nodes that involved in the embedding update process of target nodes. Hence, these nodes smooth to each other through the process of embedding propagation. This contributes to the slight discrimination of embeddings, and further limits the recommendation performance.

6 Conclusion

In this work, we hold the view that neither discarding ego embeddings nor combining ego embeddings in a uniform way in the process of embedding propagation is efficient to update node embeddings (user and item embeddings). We conduct a pilot experiment to verify our obervation. Considering this issue, we propose an Adaptive Self-propagation Graph Convolutional Network (ASP-GCN) to proportionally aggregate neighbor embeddings and hybrid embeddings composed of ego embeddings and neighborhood embeddings. Specifically, we resort to Gumbel-Softmax trick to generate categorical distributions between aforementioned two types of embeddings. Generated categorical distributions are used as the weight of each type of embeddings in the representation propagation process. In back propagation stage, we optimize BPR loss with a similarity term that forces embeddings of connected nodes close to each other and that of disconnected nodes far to each other. Finally, comprehensive experiments on three publicly available datasets are conducted to prove the effectiveness and efficiency of ASP-GCN.