1 Introduction

Recommender systems have been widely used in various fields and can effectively alleviate the issue of information overload. Traditional recommender systems usually only utilize a single type of user-item interaction data [1] (i.e., e-commerce, purchase behavior using use-item), but it has data sparsity and cold-start issues. Therefore, more and more works have begun to consider the incorporation of multiple auxiliary behaviors to enhance user preferences, and the approach using multiple behaviors is actually more practical. Figure 1 shows an example of multi-behavior recommendation in E-commerce platform. With the increase in the variety and quantity of items, the data provided by a single behavior (purchase) is limited, and the introduction of auxiliary behaviors such as page-view, favorite, and add-to-cart to infer user preferences will attract much attention.

Fig. 1
figure 1

An example of the heterogeneity in E-commerce scenario. Specifically, multiple implicit feedback provide richer information than single target behavior

In recent years, how to take advantage of auxiliary behaviors has become a trending topic, and many works have also proposed many novel methods and achieved excellent performance. Summarizing existing methods [2, 3] for multi-behavior or multi-relation recommendation, Neural Multi-Task Recommendation(NMTR) [4] proposed to build a model using a multi-task learning framework and shared embedding layers, and using Neural collaborative filtering [1] as the score prediction function. Efficient Heterogeneous Collaborative Filtering (EHCF) [5] believed that there is a transitive relationship between various behaviors, and designed a new non-sampling transfer learning scheme to improve the recommendation performance. Multiplex graph neural networks (MGNN) [6] used multiple network structures and graph convolutional networks to learn shared embedding and behavior-specific embedding for nodes. Graph heterogeneous multi-relational recommendatio (GHCF) [7] revealed the uncovering relationships between heterogeneous user-item interactions and embedded both embeddings of nodes and relations to exploit the high-order information in heterogeneous graph. Despite effectiveness, they cannot consider the preference strength of neighbor nodes and the high-order relationship under the multi behavior message passing architecture of different nodes.

Motivation by the above observations in previous multi-behavior recommendation works, there are two major issues in Graph Neural Networks (GNNs). The first issue is that GNNs’ propagation weights are based on conventional aggregation methods, in which propagation weight in most methods depends on the neighboring nodes or the set of neighboring nodes. However, this does not take into account the intensity of the target node’s preference for its neighbors. For instance, Figure 1 shows an example of behavior heterogeneity by incorporating multiple feedback to enhance recommendation. User u1 purchase smart watch and mug in Figure 1. As we all know, a smart watch provides more user preferences’ information than a mug, and u1 purchase a smart watch and a mug but if their neighboring nodes sets are a similar size. For the user node, the weight of links between them are similar in size. In fact, the link to the smart watch should gain more weight than the link to the mug, so we should take into account user preference intensity for neighboring nodes. In short, multi-behavior recommendation is required to consider the preference strength from neighboring nodes rather than obtaining node weights based on conventional aggregation methods

The second issue of GNNs is that traditional recommender system does not consider high-order relationships between nodes(users and items) by incorporating multiplex behaviors under message passing architecture. The effectiveness of previous multi-behavior methods relies on sufficient user-item interactions to learn better embeddings. We believe that exploring user-item high-order relationships can better capture the connectivity between users and items under sparse relational data. As shown in Figure 1, there are few purchase behavior records. Traditional recommendation methods have intensively solved the relationship between target node and neighboring nodes to learn better embedding. Therefore, we try to alleviate the cold start problem by considering the introduction of auxiliary behaviors and the high-order relations between user-item heterogeneous interaction

To be more specific, we build a unified heterogeneous graph containing two types of nodes (users and items) and edges of different user-item interaction under different behaviors. Firstly, in order to capture node preference intensity, we proposed to leverage the attention mechanism to distinguish different neighboring nodes and assign different weights according to the importance of different items during embedding propagation, and recursively propagate the embedding from the neighboring nodes to update embeddings. Secondly, by considering the behavior relationships between user-item heterogeneous interaction, we incorporate the relationship between users and items into the heterogeneous graph together to utilize the high-hop signals between nodes(users and items) to build a unified multi-behavior prediction model. By incorporating user preference intensity into user-item interaction in the heterogeneous graph, we leverage collaborative signals from high-order neighbors to learn better node embeddings in graph neural networks for enhanced recommendation.

The main contributions of this paper are as follows:

Considering capturing users’ preference intensity for different items, the traditional rule-based method of calculating the propagation weight has defects. We exploit the attention embedding propagation layer [8], which propagates the embedding from the neighbors of the node to update its embedding in a recursive way. During the learning propagation weight, weight is calculated according to the importance of different items, which can obtain different contributions from neighboring nodes.

Considering the high-order relationship propagation between user and item under multi-behavior passing architecture, the previous works lacked explicit modeling of the user-item heterogeneous interaction in the high-hop graph structure. We utilize relation-aware propagation layers, which incorporate relation embedding into nodes embeddings to high-order propagation with the hip-hop graph structure of the user-item heterogeneous interaction. And it can fully utilize node information by exploring the high-hop heterogeneous connection, which is helpful to capture high-order relations in nodes embedding learning.

We conduct extensive experiments on two real-world datasets Beibei and Taobao. And the experimental results show that NAH model effectively improves the recommendation performance compared with the state-of-the-art baselines, which is extremely useful for recommendation tasks to cold-start users.

2 Problem scope

In this paper, we aim to further study introducing multiple auxiliary behaviors data in heterogeneous graphs to alleviate the problems of data sparsity and cold start in recommender systems that rely on massive amounts of implicit feedback data. Since implicit feedback lacks user explicit preference information, we focus on how to extract users’ different preference intensity to provide more practical research for multi-behavior recommendation tasks.

Therefore, our main research is to incorporate multiple auxiliary behaviors into the recommender system and combine the weights of user-adaptive learning neighbors to perform high-order propagation to learn better node representations. On the one hand, we aim to employ graph neural network’s special designs for heterogeneous interaction data to make full use of extracting user preference intensity in recommender systems. On the other hand, we consider the fusion of entities nodes and behaviors for propagation to explore their high-order propagation relationships.

Based on the research on multi-behavior recommender systems, there are two challenges existing in our problem: 1) how to capture and calculate user preference intensity and 2) how to perform high-order propagation to obtain semantic information by incorporating user preference intensity from different hops. Motivated by the above challenges, we propose a novel method NAH that leverages the attention propagation layer to capture user preference intensity and employs the composition method to incorporate relation embeddings into node embeddings to high-order propagation in graph neural networks for multi-behavior recommendation.

3 Preliminaries

In the e-commerce recommendation scenario, user utilizes different types of behaviors to help recommender systems to achieve the purpose of improving the recommendation performance. According to the introduction, GNNs [9,10,11] have performed well in recommendation methods due to the powerful learning capability of graph-structured data. single-behavior recommendation [1, 12, 13] may not perform as well as multi-behavior recommendation. To sum up, multi-behavior recommendation is studied from two categories: the first category is to utilize multi-behavior data to improve the sampling strategy of positive and negative samples or to optimize the loss function. In previous work, multi-channel Bayesian persionalized ranking (MC-BPR) [14] employs an extended sampling method to obtain different types of feedback reflecting different strengths of user preference according to different types of implicit feedback. Specifically, EHCF [5] designed an efficient heterogeneous collaborative filtering model to capture the interaction behavior between users and items, establish fine-grained user-item relationships, and effectively learn model parameters from pure data to further improve performance. Another category is to leverage multi-behavior data to improve the capability to learn better user and item embeddings. Graph neural networks for social recommendation (GraphRec) [15] provides an approach to jointly capturing interactions and opinions in the user-item graph, which coherently models heterogeneous advantages. Intra- and inter-heterogeneity recommendation (ARGO) [16] explored a graph-based message-passing architecture to model interaction heterogeneity with relational aggregation networks and recursively propagate the embedding of adjacent nodes on the user-item graph. In our work, we intend to design a recommendation model by leverage multiple type behavior, which takes into account the user preference intensity for the different neighboring items and explores high-order information between nodes(users and items).

Table 1

Assume that there are two types of entities U and V, which denote the set of users and items, respectively. Assume that M and N denote the number of users and items respectively, where u denotes a user, and v denotes an item. The multiple types behavior of user interaction matrix denote as {R(1),R(2),...,R(K)}, where K denotes the number of behaviors, {R(1),R(2),...,R(K− 1)} denote the auxiliary behaviors, and R(K) denotes the target behavior. R(k) denotes whether the user has interacted with the item under behavior k. We suppose that the item of interaction matrix \(R^{(k)}_{uv}\) has a value of 1 or 0:

$$ R^{k}_{uv}= \left\{\begin{array}{ll} 1,\text{if u has interacted with v under k;}\\ 0,\text{otherwise.} \end{array}\right. $$
(1)

In the multi-behavior recommender system, the R-th behavior is generally selected as the target behavior to be optimized. The target behavior is usually purchase behavior in the e-commerce scenario. And the auxiliary behavior includes click, favorite, share, add to cart, etc. Given the target user u, the multi-behavior recommendation task can be formulated as: Input: user-item interaction data based on multiple types of behaviors {R1,R2, ...,RK}. Output: Under the target behavior, according to the estimated probability \(\hat {R}_{(K)uv}\), the items that have not interacted with user u are ranked top-N and recommended.

4 Methodology

In this section, the details of our proposed NAH model will be described, and its architecture is shown in Figure 2. Our model has three important components, 1)a sharing embedding layer that generates initial features for the user, item and behaviors embeddings; 2)a message aggregation layer that aggregates feature information from adjacent vertices to learn user preference intensity and extracts user-item interaction information to high-hop propagation based on multi-behavior data; 3) joint prediction layers to fuse each layer embedding, which will be used to predict the likelihood that the user will interact with items under target behavior.

Fig. 2
figure 2

The illustration of NAH model architecture. Given user u, item i and relation S(r) as the input. 1.Attention Aggregation Layers: aggregate the neighboring node messages and relational embedding; 2.Multi-order embedding: aggregate high-order relations by performing the composition operation on adjacent node v under its relation r. 3.Joint prediction layer: get user, item and u-i interaction embedding behavior after propagating L layers

4.1 Unified heterogeneous graph

According to the above, we know that the multi-behavior recommendation task is to utilize various auxiliary behaviors to recommend the target user under the target behavior. Therefore, we aim to build a unified heterogeneous graph to model the research problem. For an undirected graph G = (V,E,R), node set V contains user node uU and item node vV. The edges in E denote edges of different user-item interactions under different behaviors, and relation R represents the set of all behavior types. Specifically, user u1 adds item v1 to cart under auxiliary behavior r2, then there exists an edge in graph G, denoted as \(R^{r_{1}}_{u1,v1} =1\). In general, graph G is used to propagate and update node embedding, during which the neighborhood messages are aggregated based on behavior-aware interaction information between users and items. As far as our method is concerned, we assign different weight for neighboring node embedding propagation that user-item interaction depends on user preference so that justifies the necessity of capturing user preference intensity in multi-behavior task. In order to make better use of multi-behavioral data, the user-item propagation architecture explores high-hop connectivity that incorporates user preference intensity to capture more accurate behavioral collaborative signals for multi-behavior recommendation.

4.2 Sharing embedding layers

Similar to the existing multi-behavior methods [4, 7, 17], we employ one-hot encoding to the input user and item IDs. Let \(p^{(0)}_{u}\in {\mathcal {R}^{d}}\) and \(q^{(0)}_{v}\in {\mathcal {R}^{d}}\) represent the features of user u and item v respectively, where d is embedding dimension. Let \(\textbf {P}=\{p^{(0)}_{u_{1}},p^{(0)}_{u_{2}},...,p^{(0)}_{u_{M}}\}\) and \(\textbf {Q}=\{q^{(0)}_{v_{1}},q^{(0)}_{v_{2}},...,q^{(0)}_{v_{N}}\}\) represent the embedding matrix of user and item respectively, and ID embedding layer can be defined as a fully connected layer.

$$ p_{u}^{(0)} = \textbf{P}\cdot {X^{T}_{u}},\\ q_{v}^{(0)} = \textbf{Q}\cdot {Y^{T}_{v}} $$
(2)

where the size of P and Q are M × d and N × d, and represent the u-th and v-th row vectors in P and Q, respectively. \({X^{T}_{u}}\) and \({Y^{T}_{v}}\) denote the one-hot feature vectors for user u and item v. Note that embeddings matrices P and Q as initial features of node users and node items can be considered as input features for each node in our framework. For different types of behavior, the initial feature vector of behavior denotes as follows:

$$ s_{r}^{(0)} = \textbf{S}\cdot {Z^{T}_{r}} $$
(3)

where the size of S is \(K\times {d^{\prime }}\), and K represents the number of behavior types, \(d^{\prime }\) represents relations embedding dimension. Behavior embedding is also generated by an ID embedding layer, which is used to project the behaviors to the user-item vector space.

4.3 Message aggregation layers

After receiving embedding merged by a sharing embedding layer, the next step is to aggregate the neighboring node messages and update target node embedding under different type behaviors. In our framework, we aggregate the neighboring nodes to capture user preference intensity by an embedding aggregation mechanism and update user and item embeddings by high-order propagation based on type-aware behavior for recommendation.

4.3.1 Embedding aggregation

Our main idea is to consider the user preference intensity on item differently by combining the two key factors of relational embedding and aggregation of messages from the user’s neighboring nodes according to different behavior types under the multi-behavior messaging architecture. For each target node, the information of neighboring nodes is fused into the embedding by propagation, which reinforces embedding learning, by considering that the contribution of each item to user preference is different. In our task, to make better use of the neighboring node information, we leverage the attention mechanism to capture the importance of the target node and neighboring nodes, and according to the importance of neighbors, to help learn structure information.

In our message passing architecture, we leverage the attention mechanism to calculate the importance weights of user node u and neighboring node v. In particular, we obtain the embedding representation of the corresponding neighbors of the neighboring node v of node u through a layer of neural network. Then, we employ the similarity function to calculate the similarity of the node itself to its neighbors, which is formally calculated as follows:

$$ \alpha^{\prime}_{{u_{i}}{v_{j}}}=a(\textbf{W}_{1} p_{u_{i}},\textbf{W}_{2} q_{v_{j}}) $$
(4)

where ui denotes the target user, and vj denotes neighboring nodes of ui. a(⋅) is a similarity function, which represents the similarity between ui and vj. In this paper, we define a(⋅) as a layer neural network. \(\alpha ^{\prime }_{{u_{i}}{v_{j}}}\) represents the importance of item vj to user ui. In addition, W1 and W2 denote trainable transformation matrices.

In order to obtain the attention coefficient easier to calculate and compare, we incorporate the softmax function to regularize all the adjacent nodes vj of the target node ui. We leverage the broadcast mechanism to get the attention matrix and normalize the target node by a softmax function to calculate the importance weights as follows:

$$ \alpha_{{u_{i}}{v_{j}}} = \frac{exp(\alpha^{\prime}_{{u_{i}}{v_{j}}})}{{\sum}_{v_{j}\in{\mathcal{N}_{u}}}exp(\alpha^{\prime}_{{u_{i}}{v_{j}}})} $$
(5)

where \(\alpha ^{\prime }_{{u_{i}}{v_{j}}}\) denotes the intermediate value of the input softmax function to generate importance weights \(\alpha _{{u_{i}}{v_{j}}}\). Therefore, the embedding aggregation layers formula is as follows:

$$ \alpha_{{u_{i}}{v_{j}}} =\frac{exp(\sigma(a[\textbf{W}_{1}p_{u}||\textbf{W}_{2}q_{v}]))}{{\sum}_{k\in{\mathcal{N}_{u}}}exp(\sigma(a[\textbf{W}_{1}p_{u}||\textbf{W}_{2}q_{k}]))} $$
(6)

where || denotes the concatenation operation and we define σ() as Leaky_ReLU [18] nonlinear activation function. And we obtain the importance weights \(\alpha _{{u_{i}}{v_{j}}}\) between user i and item j under behavior type k. Then, the embedding aggregation process is transformed as follows:

$$ p^{(l)}_{u}=\sigma\left( \sum\limits_{v_{j}\in{\mathcal{N}_{u}}}\alpha_{{u_{i}}{v_{j}}}\textbf{W}^{(l-1)}q^{(l-1)}_{v}\right) $$
(7)

where σ(⋅) is Leaky_ReLu. pu and qv are the input user and item embeddings for nerual network layer, respectively. Likewise, item embedding also can be based on the above aggregation and propagation process. In our embedding generated process, different embeddings are aggregated by weight \(\alpha _{u_{i},v_{j}}\) and transformation parameter Wd × d from different latent dimensions. Under behavior type k, our message passing architecture for the target user node ui from its adjacent item nodes proceeds in a similar way in (7).

4.3.2 High-order propagation

After performing the aggregation embeddings of type-specific behavior between nodes(users and items), we model the high-order relations by performing the operations on adjacent node v under its relation r. Given the generated user-item interaction graph structure, we learn high-order relations in multi-behavior framework on a graph G by stacking multiple layers of information propagation. According to the weight of the user’s behavior, we follow to employ weighted sum as combination.

In high-order propagation process, the embeddings of node users(or node items) is represented by accumulating incoming messages from all heterogeneous interaction items(users). Inspired by entity-relation composition operations used in knowledge graph embedding approaches[19], The message passing equation of our model is defined as:

$$ \begin{array}{@{}rcl@{}} q^{(l)}_{v} = \sigma \left( \sum\limits_{(u,r)\in\mathcal{N}_{(v)}}\frac{1}{\sqrt{| \mathcal{N}_{u}||\mathcal{N}_{v}|}}\textbf{W}_{nn}^{(l)}\left( p^{(l-1)}_{u}\odot s^{(l-1)}_{r}\right)\right) \\ p^{(l)}_{u} = \sigma \left( \sum\limits_{(v,r)\in\mathcal{N}_{(u)}}\frac{1}{\sqrt{|\mathcal{N}_{u}||\mathcal{N}_{v}|}}\textbf{W}_{nn}^{(l)}\left( q^{(l-1)}_{v}\odot s^{(l-1)}_{r}\right)\right) \end{array} $$
(8)

where \(\mathcal {N}_{u}\) and \(\mathcal {N}_{v}\) denote the set of neighbors of users and items, respectively; \(\textbf {W}_{nn}^{(l)}\) is the graph neural network parameters of the model; σ is Leaky_ReLU activation function. \(\frac {1}{\sqrt {|\mathcal {N}_{u}||\mathcal {N}_{v}|}}\) is the symmetric normalization term, which is used to avoid the increase of the embeddings scale with graph convolution operations while increasing. ⊙ denotes the element-wise product of vectors and \(q^{(l-1)}_{v}\odot s^{(l-1)}_{r}\) is to incorporate relation embeddings into the message-passing formulation.

After the node embeddings defined in (8) are updated, the relation embeddings are also performed as follows:

$$ s^{(l)}_{r}=\textbf{W}^{(l)}_{r}s^{(l-1)}_{r} $$
(9)

where \(\textbf {W}^{(l)}_{r}\) is a relational neural network weight parameter that projects all relations as nodes into the same embedding scale and make them available on the next layer.

4.4 Joint prediction layer

In the joint prediction layer, we get a multi-layer representation {\(p^{(0)}_{u},...,p^{(l)}_{u}\)} for user, {\(q^{(0)}_{v},...,q^{(l)}_{v}\)} for item, {\(s^{(0)}_{r},...,s^{(l)}_{r}\)} for user and item interaction embedding behavior after propagating L layers. Obtaining embeddings from different layers indicates that the information of neighbors of different orders is to be a combination. So we further combine them to get the final representation: To predict the likelihood of multiple behaviors of a user towards an item, the learned representation under each behavior type is used as a separate prediction layer.

$$ \begin{array}{@{}rcl@{}} p_{u}&=& \sum\limits_{k=0}^{l}\alpha_{k}{p^{k}_{u}},\\ q_{v}&=& \sum\limits_{k=0}^{l}\alpha_{k}{q^{k}_{v}},\\ s_{r}&=& {\sum}_{k=0}^{l}\alpha_{k}{s^{k}_{r}} \end{array} $$
(10)

where αk is a hyper-parameter that represents the importance of the k-th layer embedding. Inspired by layer combination approaches to get final representations in simplifying and powering Graph Convolution Network [12], we select uniform weight \(\frac {1}{L+1}\) as a combination operation. By this way, we predict the final embedding with information from different layers, which not only enriches semantic information but also captures the effect of graph convolution with self-connections.

In order to predict the likelihood of multiple user behaviors towards an item, the learned representation for each behavior is incorporated into a separated prediction layer. To be specific, \({s^{k}_{r}}\) denotes the learned embedding of the k-th behavior, and user u estimates the probability of item v under the k-th behavior as:

$$ \hat{R}_{(r)uv}=p_{u}\cdot diag({s^{k}_{r}})\cdot q_{v}=\sum\limits_{i}^{d} p_{ui}s_{r_{k}i}q_{vi} $$
(11)

where \(diag({s^{k}_{r}})\) represents the diagonal matrix whose diagonal elements correspond to \({s^{k}_{r}}\), and d represents the embedding size.

4.5 Multi-task learning

To learn the parameters, our main idea is that the massive amount of heterogeneous implicit feedback data and data sparsity, the sampling-based learning method will result in a limited number of observed samples with interactions, while large-scale samples without observed interactions are not observed in the recommendation task. Therefore, we decided to adopt the latest high-efficiency Non-negative sampling learning method [5] to optimize our model. In order to better learn the model parameters, we introduce a weighted regression with squared loss [20] to compute the loss for a single behavioral matrix:

$$ \mathcal{L}_{r}=\sum\limits_{u\in{U}}\sum\limits_{v\in{V_{u}^{+}{\cup}V_{u}^{-}}}\lambda^{r}_{uv}(R_{(r)uv}-\hat{R}_{(r)uv})^{2} $$
(12)

where \(\lambda ^{r}_{uv}\) is the weight of R(r)uv. Then, \(V_{u}^{+}\), \(V_{u}^{-}\) represent the set of positive and negative items for target user u, respectively.

Suppose \({\mathscr{L}}^{P}_{r}\) is the loss of positive data and \({\mathscr{L}}^{A}_{r}\) is the loss of all data. Based on the latest high-efficiency Non-negative sampling learning method [5], \({\mathscr{L}}_{r}\) is the sum of the loss of positive data and the loss of all data, and the loss of unlabeled data has been eliminated. Thus, the loss for a single behavioral matrix is performed as: \({\mathscr{L}}_{r}={\mathscr{L}}^{P}_{r}+{\mathscr{L}}^{A}_{r}\), where

$$ \begin{array}{@{}rcl@{}} \mathcal{L}^{P}_{r} &=& \sum\limits_{u\in{U}}\sum\limits_{v\in{V_{u}^{+}}}((\lambda^{r+}_{uv}-\lambda^{r-}_{uv})\hat{R}^{2}_{(r)uv}-2\lambda^{r+}_{uv}\hat{R}_{(r)uv}) \\ \mathcal{L}^{A}_{r} &=& {\sum\limits_{i}^{d}}{\sum\limits_{j}^{d}}((s_{ki}s_{kj})\left( \sum\limits_{u\in{U}}p_{ui}p_{uj}\right)\left( \sum\limits_{v\in{V}}\lambda^{r-}_{uv}q_{vi}q_{vj})\right) \\ \end{array} $$
(13)

Finally, following the Multi-Task Learning (MTL) mode, different but related tasks models are jointly trained, and the minimum loss function is preformed as:

$$ \mathcal{L}= \sum\limits_{r=1}^{R}\gamma_{r}\mathcal{L}_{r}+\mu\left\|\theta\right\|^{2}_{2} $$
(14)

where γr is a hyper-parameter that controls the effect of the r-th behavior on joint training, which is set differently according to different datasets; r is the number of behavior types. We also carry out that \({\sum }_{r=1}^{n}\gamma _{r}=1\) to advance the tuning of hyper-parameters.

To optimize the objective function, we use mini-batch Adam [21] as the optimizer, which enables the learning rate to be self-adaptively updated during training, alleviating the difficulty of choosing an appropriate learning rate. In our model, we also employ the dropout methods, which is an efficient workaround to prevent overfitting [22] of neural networks.

5 Experiments

In this section, we conduct experiments on two real-world datasets in the processing of multi-behavior recommendation tasks to validate the effectiveness of our proposed NAH model by comparing it with the state-of-the-art baselines. We describe the details of the distribution of the data sets, ablation study of auxiliary behavior, and setting of hyper-parameters.

5.1 Experimental settings

5.1.1 Datasets

We evaluate the performance of the model on two real-world datasets, collected on Taobao and Beibei platforms. The specific information of datasets are displayed in Table 1: Beibei: The Beibei dataset [4] is the largest e-commerce platform for maternal and infant specialty products in China. This dataset includes different types of user-item interaction data, and we take three behavior(page-view, add-to-cart, and purchase) to study their influence on recommendation performance in our experiments. Taobao: The Taobao dataset [23] is the most popular online e-commerce platform in China. Compared with the beibei dataset, they have the same three user behaviors, but the data distribution of the two is completely different. Beibei has fewer users and items entities but has many records of auxiliary behaviors and more records of target behaviors. And Taobao has more users and item entities, but fewer behavior records.

Table 1 Statistical details of the evaluation datasets

For a fair comparison with the start-of-the-art baselines, the datasets were preprocessed consistently following previous works [5], which split both datasets according to the number of records of the target behavior and exclude users and items with less than 5 target behaviors.

5.1.2 Evaluation metrics

To evaluate our model performance, we employ widely used leave-one-out techniques [4, 5] and employ Hit Rate (HR) [26] and Normalized Discounted Cumulative Gain (NDCG) [27]. HR is a common indicator to measure recall rate, which represents the more items are on the top-N list. NDCG is an evaluation metric for ranking results, which emphasizes the impact of the item’s position in the top-N list. For users, our evaluation scheme is to rank all unlabeled items in the training set, so we get more convincing results than randomly sampled subset rankings. By doing so, the results are more convincing than only ranking a random subset of negative items [28, 29].

5.1.3 Baselines

To verify the effectiveness of our method, we selected several state-of-the-art models to compare with our model. And we classify them into two categories based on one-behavior models and multi-behavior models. one-behavior model:

BPR :

[24]: it is a widely used learning framework that optimizes pairwise loss for item recommendation.

NCF :

[1]: a state-of-the-art learning framework by introducing neural network to learn the user-item interaction information for item ranking.

LightGCN :

[12]: a simplified GCN model that only includes the most important parts and improves the recommendation performance. multi-behavior model:

CMF :

[25]: it is a popular technique that factorizes multiple matrices jointly to boost the overall factorization quality by decomposing the rating matrix R, side information matrix of user and item, respectively.

MC-BPR :

[14]: the adaptive negative sampling rule in BPR and uses level information to sample negative samples to expand BPR for heterogeneous data.

NMTR :

[4]: it uses a cascaded way to build user behavior relationships and uses multi-task learning to model from users’ multi-behavior data.

EHCF :

[5]: it applied an efficient non-sampling strategy (Non-Sampling, Whole-data based Learning) to a multi-behavior recommender system for the first time, and achieves very significant results in training time and model performance.

GHCF :

[7]: this is a state-of-the-art method that reveals latent relationships between heterogeneous user-item interactions and exploits a based on relationship-aware GCN propagation layer to acquire high-hop signals.

5.1.4 Parameters settings

For parameters, we initialize the latent factor dimension with 64-dim and set the batch size to 256. And we further explore optimal parameters on validation dataset and evaluate the model on test dataset. For the parameters of baseline, we mainly set the parameters according to the parameters and tuning strategies provided by the original model. In training, we optimize our model as mini-batch Adagrad [30] optimizer. Then, we set the learning rate as 0.001 and utilize the Xavier initialization [31] to initialize the parameters. Besides, we set the message dropout ratio ρ as 0.2 and the node dropout ratio as 0.1 to prevent overfitting in our model. Regularization coefficient is set to 10 for Beibei and 0.01 for Taobao. Furthermore, We utilize early stop to avoid overfitting, where the training process will be stopped if recall@10 on validation set does not increase within 50 epochs. For other hyper-parameters, uniform negative entry weight is set to 0.1 for Beibei and 0.01 for Taobao and multi-task learning coefficient γ is set to γ1 = 1/6, γ2 = 4/6, γ3 = 1/6 for sampling-based methods in the baseline. For the non-sampling methods in the baseline, the negative sampling ratio is set is 4, the negative weight values for Beibei and Taobao are set to 0.01 and 0.1 respectively, and experiments show good performance.

5.2 Overall performance

In this subsection, we selected several state-of-the-art baselines to compare with our model. Then, the performance of all models on two datasets is shown in Tables 2 and 3. Note that we refer to the parameters and tuning strategies provided by the original models to make a fair comparison. In order to investigate the Top-N performance, we set length N to [10, 50, 100, 200] in our experiments. From the experiment results, the followings can be observed:

Table 2 NAH recommendation performance on Beibei comparing with start-of-the-art baselines
Table 3 NAH recommendation performance on Taobao comparing with start-of-the-art baselines

Model effectiveness

From the figure, we can find that on HR and NDCG evaluation metrics, our NAH substantially outperforms state-of-the-art baselines on both datasets. For instance, the average improvement of our model over the state-of-the-art baseline is 2.96% and 1.51% on the Beibei dataset and 2.87% and 1.46% on Taobao dataset for Recall and NDCG, which clarifies the performance of our model. Our model also demonstrates that the effectiveness of information of the neighbors of nodes are incorporated into their embeddings.

User preference intensity

Among other single-behavior baselines, models based on graph neural network (i.e., LightGCN) achieve better performance in most cases. This illustrates the rationality of collaborative signals encoding based on user-item graph relationship for embedding propagation between adjacent nodes. However, LightGCN and NMTR fail to learn better the target node representation from neighboring nodes, so it demonstrates that is very important to capture user preference intensity for neighboring nodes for recommendation performance, which enhances the ability of representation learning.

Mutil-behavior model

In comparisons with BPR, NCF and LightGCN, we can observe that incorporating multi-behavior information into predicting can generally outperform methods using only a single behavior, which illustrates the importance of heterogeneous data for recommendation performance. For instance, The improvement of our model over the state-of-the-art single-behavior method is 98.5% and 71.5% on Beibei and Taobao dataset for HR@100, respectively. This illustrates that incorporating multiple types of behaviors to the embedding function of recommender systems plays a positive role.

Non-sampling learning strategy

the non-sampling methods(EHCF,GHCF,ARGO,NAH) generally outperform methods that sampling-based methods (NCF,LightGCN,MC-BPR,NMTR). EHCF demonstrates the effectiveness of using the whole-data-based learning strategy, which is suitable for learning from heterogeneous behavior data.

6 Discussions

In this section, we discuss the impact of auxiliary behavior, impact of data sparsity, hyper-parameter sensitivity and possible limitations of our model.

6.1 Impact of auxiliary behaviors

To understand the effectiveness of multiple auxiliary behaviors, we explore the impact of auxiliary behaviors on the model performance on both datasets with the consideration of different type behavior from multiple behaviors.

NAH-P: the model variant of only purchase.

NAH-PV: the model variant of including purchase and page view.

NAH-PC: the model variant of including purchase and add to cart.

Tables 4 and 5 show the model performance for different combination behaviors. From the results, auxiliary behaviors both page view data and add to cart data can lead to better recommendation performance. And the performance of NAH is further improved when all three behavior data are used simultaneously. This demonstrates the effectiveness of modeling auxiliary behaviors for user preference. Besides, we find that two observations: On one hand, carting behavior on Beibei dataset has a greater effect on the recommendation than carting behavior on Taobao dataset, which may be due to the amount of auxiliary behavior data on two datasets. On the other hand, we incorporate auxiliary behaviors into our model, which significantly improves the recommendation performance. It can be seen that the impact of user implicit feedback on recommender systems, which also makes the research on multi-behavior recommendation meaningful.

Table 4 Performance of variants of NAH model on Beibei
Table 5 Performance of variants of NAH model on Taobao

6.2 Impact of data sparsity

Data sparsity is a big challenge since there are lacking of record of target behavior and for new users without behavior record, which make most recommendation models inefficient. However, we enhance the recommendation by introducing auxiliary behavior [32] and u-i high-order relationship. From Tables 2 and 3, the average improvement of our model is superior to the state-of-the-art baseline based on optimal parameters. The improvement of our model over the state-of-the-art single-behavior method is 139.7% and multi-behavior method is 4.47% on Beibei dataset for HR@50, illustrating the strong power of NAH model. From the Figure 3, our NAH outperforms other methods, including the state-of-the-art multi-behavior models such as NMTR and GHCF. In particular, our method performed well for users with less than 16 purchase records, which verifies the validation of alleviating data sparsity issues. Moreover, the performance on Taobao dataset and Beibei dataset is slightly different, we think this may be caused by the different data distribution of the dataset. For example, Taobao dataset is sparser than Beibei dataset and has fewer user-item interactions; page view data is much more than add-to-cart data due to page view data are easier to collect; Taobao dataset and Beibei dataset have great differences in the number of purchase behavior records in each data sparsity split. The results demonstrate the effectiveness of NAH in alleviating the data sparsity issue with auxiliary behaviors since NAH learns target behavior and auxiliary behaviors in a reasonable way.

Fig. 3
figure 3

Impact of number of Purchase

6.3 Hyper-parameter study

To evaluate how hyper-parameters affect the performance of our NAH, we investigate the effect of two important hyper-parameters coefficient γk and layer numbers L in our model. Since our model is a multi-task model, we test three loss coefficients γ1, γ2, and γ3 in the multi-task loss function. In addition, we analyze the influence of the layer numbers L on performance.

Firstly, we examine the influence of different γk on Beibei and Taobao datasets to check. We tune the three loss coefficients in [0, 1/6, 2/6, 3/6, 4/6, 5/6, 1]. As γ1 + γ2 + γ3 = 1, when γ1 and γ2 are given, the value of γ3 is determined. When γ1= 0 and γ2= 0, the model only has purchase behavior like single behavior recommendation, and it performs badly. For both datasets, setting (1/6, 4/6, 1/6) achieves the best performances.

Next, we test the influence of the depth of our model on Beibei and Taobao datasets to check the effectiveness of multiple embedding propagation layers. In particular, we test vary the depth of our model in the range of [1,2,3,4,5] and the results are shown in Figure 4. When L = 1, it represents that the model has a first-order embedding propagation layer and others are similar. In the figure, the y-axis denotes the performance of HR@100 compared with different layer numbers. From the figure, we can observe that by increasing the model depth from 1 to 4 on the Beibei and Taobao datasets, the performance of our model is significantly improved, which demonstrates that our model has the capability to capture high-order relationships in multi-behavior recommendation scenario.

Fig. 4
figure 4

Impact of model depth

7 Conclusion

In this paper, we investigate the issue of multi-behavior recommendation that considers user preference intensity and high-hop propagation based on heterogeneous user feedback. To fully model user preference intensity between users and items under different types of behaviors, We propose a novel graph-based approach NAH. The proposed NAH has leveraged the attention mechanism to obtain the important weight of different neighborhoods and explore high-order propagation on a heterogeneous graph. Extensive experimental results demonstrate the state-of-the-art performance of NAH on two real-world datasets. Further ablation studies verify the effectiveness of employing different types of auxiliary behaviors and alleviating data sparsity issues in our NAH.

Although specifically designed for extracting user preference intensity, we explore the high-order relationship under the multi-behavior recommendation, the experiment proves that the depth of embedding propagation layers still has certain limitations. In the future, we will further explore how to solve the problems of overfitting or data noise. we are also interested in exploring special designs on a heterogeneous graph, such as adaptive learning behavior importance and semantics. In addition, we also intend to extend NAH model to other heterogeneous graph recommendations scenarios.