1 Introduction

In the current information era, a huge amount of information leads to the critical issue of information overload. Recommender systems (RSs) have been proposed to solve this challenge through information filtering, and thus can effectively provide suggestions for users’ choices on a large number of items [7, 35]. Various approaches have been developed for RSs during the past few years, including collaborative filtering-based approaches, content-based approaches, and hybrid approaches [17, 23]. These approaches mainly focus on learning the long-term static preferences of users for recommendations [25,26,27], which have two major drawbacks: (1) long-term and static preferences of users are sometimes unavailable, and (2) the short-term and dynamic preference of users are usually ignored [28].

Session-based recommender systems (SBRSs) have emerged in recent years, which can alleviate these two drawbacks. SBRSs learn a user’s short-term and dynamic preferences based on the items interacted by the user within sessions. In general, there are four basic types of SBRSs, including (1) Markov chain based SBRSs, (2) factorization based SBRSs, (3) recurrent neural network (RNN) based SBRSs, and (4) graph neural network (GNN) based SBRSs. Markov chain based SBRSs [10, 16] and factorization based SBRSs [2, 12] recommend the next item based on the previous one, which can capture the first-order dependencies (i.e., the dependencies between any two adjacent items within a session). RNN-based SBRSs [5, 6, 13] can further capture the high-order dependencies (i.e., the cascaded dependencies between multiple items within a session) across multiple items by modelling each session as a sequence for next-item recommendations. To further capture the complex transitions over items within each session, GNN-based SBRSs [20, 21, 31] have been proposed. GNN-based SBRSs firstly map each session to a subgraph separately, and then takes the subgraph as the input of a GNN to further capture the dependencies over nodes (i.e., items) in the subgraph for next-item recommendations.

However, most of the existing GNN-based SBRSs have two shortcomings: (1) they often utilize the limited session information only for next-item recommendations, while ignoring other important side information including item attributes and user IDs, which can greatly complement the limited session information; and (2) they usually model the intra-session dependencies (i.e., the dependencies between items within one session) and ignoring the complex inter-session dependencies (i.e., the dependencies between items across multiple sessions). Both shortcomings greatly reduce the performance of next-item recommendations. Accordingly, to address these shortcomings and further improve the performance of SBRSs, there are two challenges still need to be addressed:

  • CH1: How to effectively make full use of the available side information including item attribute information and user information for improving the accuracy of next-item recommendations?

  • CH2: How to effectively extract both intra- and inter-session dependencies to further improve next-item recommendations?

To address the aforementioned two challenges, in this paper, we propose a novel hypergraph learning framework. Specifically, to address CH1, we introduce item attribute information and user information simultaneously to enrich the information for next-item recommendations. To effectively incorporate such side information, we construct a hypergraph to connect the items, item attribute values and users together by integrating the session-based item-item graph, the item-item attribute value graph and user-session graph into a unified hypergraph. In such a case, items are connected not only by sessions, but also by their shared attribute values and the joint users (see Figure 1). A hypergraph is a generalization of graphs in which the edges are arbitrary non-empty subsets of the vertex set. To address CH2, we propose a hypergraph learning framework to learn both the intra-session dependencies (i.e., the dependencies between items within the same session) and inter-session dependencies (i.e., the dependencies between items from different sessions) with a focus on learning inter-session dependencies. For intra-session dependencies, we take each subgraph built on each session into gated graph neural networks to learn the dependencies between items within sessions. For inter-session dependencies, we devise three types of aggregation operations to learn the inter-session dependencies to absorb information from other sessions to benefit the next-item recommendation in the current session. Each aggregation is based on one type of hyperedges in the hypergraph.

Fig. 1
figure 1

An example of a Hypergraph constructed on user transaction information and item attribute information. The hyperedges connecting user and session sets (i.e., the sessions purchased by one user) indicate the user-session relations, the arrowed edges connecting two items indicate the sequential relations between items from a given session, and the dotted edges connecting item and item attribute values mean the item-attribute relations

The first aggregation operation, called user aggregation, is utilized to learn the dependencies between items that are purchased by the same users. Here the hyperedges connecting each user and all the sessions from this user serve as the bridges to connecting items from different sessions from the same user. The second aggregation operation, called item aggregation, is utilized to learn the dependencies between items from different sessions. The hyperedge connecting a given item and all of its co-occurred items (i.e., those items that concurred with the given item in the same sessions) forms the bridge to connect items from different sessions. The third aggregation operation, called attribute aggregation, is utilized to learn the dependencies between items which have shared attribute values. Here the hyperedge connecting each item and all of its attribute values serve as the bridge to connect different items sharing attribute values.

In addition, we devise session aggregation to learn a given user’s preference towards items. Specifically, session aggregation aggregates all the historical sessions of a user to represent her/his preference. Session aggregation is based on hyperedges connecting each users and all of her/his historical sessions.

The main contributions of our framework are summarized as follows:

  • We propose a hypergraph to encode not only the session information, but also the user-session information and the item-attribute information. In this way, all the items and users in a given data set are comprehensively connected together in a unified way. This greatly enriches the information for SBRSs.

  • We propose a hypergraph learning framework to effectively learn not only the intra-session dependencies, but also the widely existed inter-session dependencies for more accurate next-item recommendations. In addition, each user’s preference towards items has also been well captured by the devised session aggregation to further improve the recommendation accuracy.

  • The extensive experiments have been conducted on two real-world datasets, and the experimental results show that our HL framework outperforms the state-of-the-art methods.

The rest of this paper is organized as follows. we provide a comprehensive discussions on related work in Section 2. Section 3 presents the problem statement. We present our proposed hypergraph learning (HL) framework in Section 4. Datasets, evaluation metrics, comparison methods are introduced in Section 5. Finally, we conclude this paper in Section 6.

2 Related work

In this section, we review several related studies on SBRSs and hypergraph learning, where we divided SBRSs into two categories: (1) conventional SBRSs, including Markov chain based SBRSs and factorization based SBRSs, and (2) deep learning based SBRSs, including RNN-based SBRSs and GNN-based SBRSs.

2.1 Conventional SBRSs

Markov chain based SBRSs and factorization based SBRSs are two typical types of conventional SBRSs. Markov chains based SBRSs mainly employ Markov chains to model the users’ behaviour. Specifically, they learn the transition probabilities to capture the first-order dependencies between two adjacent items, and then recommend the next item based on this dependencies [2]. However, due to the property of Markov chains, this method only captures the first-order dependencies between two adjacent items while ignoring the high-order dependencies across multiple items. Factorization based SBRSs factorize the item transition matrix into latent representations of users and items, which are subsequently used to predict the next-item in which a given user may be interested [1]. However, because the users’ preference is sometimes unavailable, factorization based SBRSs do not always work. Further, Rendle et al. [16] firstly proposed Factorized Personalized Markov Chains (FPMC) model for next item recommendation, which factorizes the user-item matrix to learn both the general taste of a user and transitions over items by Markov chains method. However, as the combination of Markov chains based SBRSs and factorization based SBRSs, FPMC still cannot capture higher-order dependencies.

2.2 Deep learning based SBRSs

To capture the high-order dependencies, Hidasi et al. [6] proposed an RNN-based approach for SBRSs, called GRU4Rec, which employs gated recurrent units (GRUs) as the basic cells of an RNN to model the rigid order dependencies among items in each session. Hidasi et al. [5] further introduced a novel ranking loss function to improve the performance of GRU4Rec. However, RNN-based SBRSs usually rely on the rigid order assumption among items within a session, leading to two drawbacks: (1) the rigid order is sometimes meaningless, since a user’s interactions with items in a session may not always follow a rigid order; and (2) the complex transitions between distant items within a session cannot be effectively captured with a single-way RNN.

To capture the complex transitions over items within the entire session, Wu et al. [20] proposed a GNN-based method for session-based recommendation, called SR-GNN. In SR-GNN, first, a directed graph is built for each session, and then GNN are employed on the graph to capture the complex transitions over items within the session. Xu et al. [31] further proposed a graph contextualized self-attention network (GC-SAN), which introduces a self-attention mechanism into GNN-based SBRSs to learn more attentive and informative session representations for more accurate next-item recommendations. Wu et al. [21] further proposed a personalizing graph neural networks with an attention mechanism (A-PGNN), which employes a personalizing graph neural network to capture complex transitions in users’ session sequence and employes a dot-product attention mechanism to explicitly model the effect of historical sessions on the current session. Yu et al. [32] proposed a target attentive graph neural network (TAGNN) to activate each user’s different interests with respect to varied target items for next item recommendations. Qiu et al. [18] proposed a full graph neural network (FGNN) to learn the inherent order of item transition pattern and proposed a weighted graph attention layer (WGAT) network to learn different weights for different neighbors, which aims at improving the accuracy of next-item recommendations. However, all the GNN-based methods mentioned above mainly capture the local dependencies within one session and neglect the cross-session global dependencies.

A few studies tried to model the global dependencies across multiple sessions for accurate session-based recommendations. Wang et al. [30] proposed a global context enhanced graph neural network (GCE-GNN) to learn the session level latent representations of items within one session, and learn the global level latent representations of items over all sessions. However, there are two shortcomings of GCE-GNN: (1) it learns the latent representations for items by modeling pairwise item-transitions, while ignoring the more complex relationships of item-transitions (e.g., list-wise item-transitions); and (2) it fails to explicitly capture the global dependencies based on item attribute information, leading to information loss. To alleviate these shortcomings of SBRSs, Wang et al. [29] proposed a heterogeneous mixed graph learning (HMGL) model to learn an informative representation for each item over the heterogeneous graph by effectively modelling both local and global dependencies for next-item recommendations. However, this work does not consider the significant role of users in connecting the different but relevant sessions as well as items. Such a mission the user information leads to information loss and subsequently deteriorates recommendation accuracy.

Most of the existing SBRS methods [21, 31, 32] mentioned above usually capture intra-session dependencies and omit to consider the inter-session dependencies. This omission leads to both information loss and low recommendation accuracy. In this paper, we propose an HL framework based on a previous study [29]. In addition to capturing intra-session dependencies, our proposed HL framework can further capture the inter-session dependencies by taking the items occurring in different sessions, users shared by different sessions, and attribute values shared by the items from different sessions as the bridges to connect multiple sessions. Therefore, our proposed framework can capture the more complex dependencies, and as a result get better recommendation results.

2.3 Hypergraph learning

Traditional graph learning methods focus on pairwise relationships between objects [11, 24]. To capture more complex relationships, Zhou et al. [34] generalized the spectral clustering graphs into hypergraphs, based on which they further developed an algorithm for hypergraph embedding and transductive classification. Huang et al. [8] employed the hypergraph learning in video object segmentation, where they formulated the task of extracting prominent objects into hypergraph cut. To learn the weights of hyperedges, Gao et al. [4] introduced a l2 regularizer. With the development of deep learning, Feng et al. [3] firstly proposed a hypergraph neural network (HGNN) framework, where they employed hypergraph convolution operation to learn representation of high-order data correlation in a hypergraph structure. Jiang et al. [9] further separated a dynamic hypergraph neural network framework (DHGNN) into dynamic hypergraph construction (DHG) and hypegraph convolution (HGC) operations to capture both the inherent and dynamic hypergraph structures.

3 Problem statement

Given a user-transaction dataset, \(\mathcal {U} = \{u_{1}, u_{2}, ..., u_{|u|}\}\) is the user set consisting of all unique users, |u| is the number of users in the dataset. \(\mathcal {V}^{I} = \{v_{1}, v_{2}, ..., v_{|\mathcal {V}^{I}|}\}\) is the item set consisting of all the unique items, \(|\mathcal {V}^{I}|\) is the number of items. Let \(\mathcal {D} = \{S_{1}, S_{2}, ..., S_{|u|}\}\) be the set of all user-transactions in this dataset. Each user-transaction Si = {si,1,si,2,...,si,|s|} is the set of sessions that are associated with a user ui. Each session \({s_{i,j}} = \{v_{i,j,1}, v_{i,j,2}, ...,\\ v_{i,j,|s_{i,j}|}\}\) consists of all the unique items sequentially interacted by the corresponding user ui in her/his jth session. \(\mathcal {V}^{A} = \{a_{1}, a_{2}, ..., a_{|\mathcal {V}^{A}|}\}\) is the set of all item attribute values in the dataset. \(\mathcal {A}=\{A_{1}, A_{2},...,A_{|\mathcal {V}^{I}|} \}\) is the set of attributes for all items, and \(A_{h}=\{a_{h,1}, a_{h,2},...,a_{h,|A_{h}|} \}\) the set of attribute values associated with item vh. The task of our work is to recommend the next item based on the purchased items in the current session. Formally, given the session si,j of user ui, take vl (vlsi,j) as the target item, \(C_{v_{l}}^{s_{i}}=\{v_{1},v_{2},...,v_{l-1}\}\) is the corresponding session context consisting of all the items that have occurred prior to item vl in session si,j, while the corresponding item attribute value set \(C_{v_{l}}^{a} =\{\mathcal {A}_{v_{1}}, \mathcal {A}_{v_{2}},...,\mathcal {A}_{v_{l-1}}\}\) forms the corresponding attribute context. Hence, given a context set \(C_{v_{l}}=[C_{v_{l}}^{s_{i}}, C_{v_{l}}^{a}]\), the task of our work is to recommend the lth item vl in session si,j.

4 Hypergraph learning framework

In this section, we present our proposed hypergraph learning (HL) framework. In particular, first, the HL framework constructs a hypergraph to connect the items, users and item attribute values by integrating the session-based item-item graph, the item-item attribute value graph, and user-session graph into a unified hypergraph. Then, the hypergraph learning model is built to learn the latent representations of users and items for next-item recommendations from the hypergraph. Accordingly, as shown in Figure 2, the HL framework consists of three components: the hypergraph construction module, the hypergraph learning module, and the next-item prediction module.

Fig. 2
figure 2

The workflow of our proposed HL framework. First, we integrate session information, user information and item attribute information into a hypergraph. Then, the hypergraph learning module learns the intra- and inter-session dependencies between items. Specifically, three node level aggregations are devised to learn three types of inter-session dependencies, while the node level updating aims to learn the intra-session dependencies and graph pooling learns a unified representation for each session context. Finally, we predict the next item by taking the session context representation as the input

4.1 Preliminaries

Graphs can only indicate pairwise relationships. In contrast, hypergraphs preserve the multi-wise relationships, and thus, are a natural choice for the modeling complex relationships. We define a hypergraph as follows:

Definition 1 (Hypergraph)

A hypergraph \(\mathcal {{G}} = \{\mathcal {V}, \mathcal {E} \}\) on a finite set of vertices (or nodes) \(\mathcal {V}= \{v_{i}: i \in \lVert n \rVert \}\), with a family of hyperedges \(\mathcal {E}= \{e_{j}: j \in \lVert p \rVert \}\), where each hyperedge is a non-empty subset of \(\mathcal {V}\), such that \(\cup _{j \in \lVert p \rVert }e_{j}= \mathcal {V} \).

In a hypergraph, a hyperedge links one or more vertices.

4.2 Hypergraph construction

A traditional graph can only represent the pairwise relationships between any two nodes. However, in the real world, there will be much more complex relationships like list-wise relationships than pairwise relationships. Hypergraphs can preserve these list-wise relationships. For better recommending the next item, we construct the user sessions and item-attribute values into a unified hypergraph. Firstly, we model the users, items and item attribute values as three types of nodes in the graph. Formally, \(\mathcal {V} = \mathcal {U} \cup \mathcal {V}^{\mathcal {I}} \cup \mathcal {V}^{A}\) be the node set in the graph, where \(\mathcal {U}\), \(\mathcal {V}^{I}\) and \(\mathcal {V}^{A}\) correspond to the node set of users, items and item attribute values respectively.

Secondly, we model user-session relations, the item-item relations, and the item-attribute value relations as three types of hyperedges in the hypergraph. Formally, \(\mathcal {E} = \mathcal {E}^{\mathcal {U}} \cup \mathcal {E}^{\mathcal {I}} \cup \mathcal {E}^{\mathcal {A}}\) constitute the hyperedge set. The hyperedge \(e_{u} \in \mathcal {E}^{U}\) represents the relation that session \({s_{u}} = \{v_{u,1}, v_{u,2}, ..., v_{u,|s_{u}|}\}\) purchased by user u. In other words, the several items \(v_{u,1}, v_{u,2}, ..., v_{u,|s_{u}|}\) contents of items purchased within one session su by user u are linked together by this hyperedge eu. Hyperedge \(e_{i,j}^{I} \in \mathcal {E}^{I}\) represents the item vj occurs after item vi within one session. The hyperedge \({e_{i}^{A}} \in \mathcal {E}^{A}\) represents that item vi has the attribute value set \(A_{i}= \{a_{i,1}, a_{i,2},...,a_{i,|v_{i}|} \}\), and the item vi and its corresponding attribute values \(a_{i,1}, a_{i,2},...,a_{i,|v_{i}|}\) are linked together by this hyperedge.

As a result, the unified hypergraph connect the users, items and item attribute values by integrating user-session graph, session-based item-item graph and item-item attribute graph that are built on the aforementioned three types of hyperedges respectively together. Such a complex hypergraph is powerful to model the complex high-order relations among different types of objects (i.e., users, items and item attribute values).

4.3 Hypergraph learning

HL mainly has three components: (1) node level aggregation, (2) node level updating and (3) graph level learning. In the node level aggregation part, there are three types of nodes in the hypergraph, including user nodes, item nodes and item attribute nodes, and we mainly learn the latent representations for user nodes and items nodes in the hypergraph. For user nodes, we learn the latent representations of users via session aggregation, which is utilized to aggregate information from a user’s historical sessions and the current sessions to learn his/her preference representation. For item nodes, we learn the latent representations based on three different types of hyperedges. Therefore, three aggregation operations are introduced to respectively process the hypergraph. After we aggregate information for user nodes and item nodes, we update the latent representations of items and users by graph gated neural netwoks (GGNN) based on the constructed hypergraph. Finally, we learn a graph level latent representation for each session sub-hypergrap using graph pooling. We discuss each model component in detail below.

4.3.1 Node level aggregation in hypergraph

In this part, we aim to aggregate information from various hyperedges for user nodes \({u} \in \mathcal {U}\) and item nodes \({v} \in \mathcal {V}^{I}\) in hypergraph. Firstly, we map each user node \({u} \in \mathcal {U}\) and item node \({v} \in \mathcal {V}^{I}\) into a unified low-dimension latent space to obtain the initial representation \(\mathbf {u}, \mathbf {v} \in \mathbb {R}^{d}\), where d denotes the dimension of the representation. Then, for each item node \({v} \in \mathcal {V}^{I}\), we employ three aggregation operations to learn three types of latent representations \(\mathbf {v}_{i}^{U}, \mathbf {v}_{i}^{I}, \mathbf {v}_{i}^{A}\) by absorbing the information from three types of neighborhood nodes of v respectively in \(\mathcal {G}\). Each type of latent representation can be seen as a sub-representation of the item from a particular perspective. Similarly, for each user node \({u} \in \mathcal {U}\), we employ one aggregation operation to update its latent representation by learning the user preference from her/his historical and current sessions.

The first aggregation operation, denoted as user aggregation, is utilized to learn the user-based latent representation \(\mathbf {v}_{i}^{U} \in \mathbb {R}^{d}\) of item vi. The aim of user aggregation is to aggregate information for items from the joint users. We mathematically represent this aggregation:

$$ \mathbf{v}_{i}^{U} = Aggre_{user}\{{\mathbf{v}_{j}, \forall v_{j} \in N_{user}({v}_{i})}\}, $$
(1)

Nuser(vi) is the user-based neighborhood set of item node vi, which consists of item nodes having been interacted by the same user who interacted with item node vi. We specify the user aggregation as follows:

$$ Aggre_{user}(\mathbf{v}_{i}) = \mathbf{A}^{u}_{i}[\mathbf{v}_{i,1}, \mathbf{v}_{i,2},...,{\mathbf{v}_{i,|N_{user}({v}_{i})|}}]^{T}, $$
(2)

where \([\mathbf {v}_{i,1}, \mathbf {v}_{i,2},...,{\mathbf {v}_{i,|N_{user}({v}_{i})|}}]\) is the list of latent representations of item nodes \({v}_{i,1}, {v}_{i,2},...,{{v}_{i,|N_{user}({v}_{i})|}} \in N_{user}({v}_{i})\). The weight matrix \(\mathbf {A}^{u}_{i}\) shows the importance of each item node in Nuser(vi) for item node vi, which is calculated as follows:

$$ \mathbf{A}^{u}_{i,j} = \frac{\left\|\mathcal{V}_{i,j}^{e_{u}}\right\|_{2}}{\left\|\mathcal{V}_{i}^{e_{u}}\right\|_{2}+\left\|\mathcal{V}_{j}^{e_{u}}\right\|_{2}}\qquad $$
(3)

where \(\mathcal {V}_{i,j}^{e_{u}}= \{u | (v_{i},v_{j}, u) \in e_{u}\}\) consists of the user nodes u that have hyperedges eu with item node vi and vj simultaneously. \(\mathcal {V}_{i}^{e_{u}}= \{u | (v_{i}, u) \in e_{u}\}\) consists of the user nodes u that have hyperedges eu with item node vi. \(\left \| \cdot \right \|_{2}\) is the L2 norm.

The second aggregation operation, denoted as item aggregation, is utilized to learn the inter-session-based latent representation \(\mathbf {v}_{i}^{I} \in \mathbb {R}^{d}\) of item vi. The aim of item aggregation is to aggregate information from other items which occurred in the same session as item vi.

$$ \mathbf{v}_{i}^{I} = Aggre_{sess}\{{\mathbf{v}_{j}, \forall v_{j} \in {N}_{sess}({v}_{i})}\}, $$
(4)

Nsess(vi) is the session-based neighborhood set of item node vi, which contains all the item nodes in the hypergraph that occur in the same session with item node vi. The item aggregation is specified as:

$$ Aggre_{items} (\mathbf{v}_{i})= \mathbf{{A}}_{i}^{s} [\mathbf{v}_{i,1}, \mathbf{v}_{i,2},...,{\mathbf{v}_{i,|sess|}}]^{T}, $$
(5)

where [vi,1,vi,2,...,vi,|sess|] is the list of latent representations of item nodes vi,1,vi,2,...,vi,|sess|Nsess(vi). The weight matrix \(\mathbf {A}_{i}^{s}\) shows the importance of each item node in Nsess(vi) for item node vi, which is calculated as below:

$$ \mathbf{{A}}_{i}^{s} = \frac{\left\|\mathcal{V}_{i,j}^{e^{I}}\right\|_{2}}{\left\|\mathcal{V}_{i}^{e^{I}}\right\|_{2}+\left\|\mathcal{V}_{j}^{e^{I}}\right\|_{2}}\qquad $$
(6)

where \(\mathcal {V}_{i,j}^{e^{I}}= \{sess | (v_{i},v_{j}, sess) \in e^{I}\) consists of the sessions sess that have hyperedges eI with item node vi and vj simultaneously. \(\mathcal {V}_{i}^{e^{I}}= \{sess | (v_{i}, sess) \in e^{I}\}\) consists of the sessions sess that have hyperedges eI with item node vi.

The third aggregation operation, denoted as attribute aggregation, is utilized to learn the attribute-based latent representation \(\mathbf {v}_{i}^{A} \in \mathbb {R}^{d}\) of item vi. The aim of attribute aggregation is to aggregate information from items, which have shared attribute values with a given item vi. To mathematically represent this aggregation, we use the following function:

$$ \mathbf{v}_{i}^{A} = Aggre_{attri}\{{\mathbf{v}_{j}, \forall v_{j} \in {N}_{attri}({v}_{i})})\}, $$
(7)

where Nattri(vi) is the attribute-based neighborhood set of item node vi, which consists of item nodes sharing the same attribute values with item node vi. Aggreattri is the item aggregation function based on attribute information. The aggregation function is specified below:

$$ Aggre_{attri}(\mathbf{v}_{i}) = \mathbf{A}_{i}^{A} [\mathbf{v}_{i,1}, \mathbf{v}_{i,2},...,{\mathbf{v}_{i,|{N}_{attri}({v}_{i})|}}]^{T}, $$
(8)

where \([\mathbf {v}_{i,1}, \mathbf {v}_{i,2},...,{\mathbf {v}_{i,{i,|{N}_{attri}({v}_{i})|}}}]\) is the list of latent representations of item nodes \({v}_{i,1}, {v}_{i,2},...,{{v}_{i,{i,|{N}_{attri}({v}_{i})|}}} \in N_{attri}(v_{i})\). The weight matrix \(\mathbf {A}_{i}^{A}\) shows the importance of each item node in Nattri(vi) for item node vi, which is calculated below:

$$ {A}_{i}^{A} = \frac{\left\|\mathcal{V}_{i,j}^{e^{A}}\right\|_{2}}{\left\|\mathcal{V}_{i}^{e^{A}}\right\|_{2}+\left\|\mathcal{V}_{j}^{e^{A}}\right\|_{2}}\qquad $$
(9)

where \(\mathcal {V}_{i,j}^{e^{A}}= \{a | (v_{i},v_{j}, a) \in e^{A}\) consists of the attribute values a that have hyperedges eA with item node vi and vj simultaneously. \(\mathcal {V}_{i}^{e^{A}}= \{a | (v_{i}, a) \in e^{A}\}\) consists of the attribute values a that have hyperedges eA with item node vi.

4.3.2 Node level updating in hypergraph

Once the three types of latent representations of a given item vi are learned, they are combined together to form the final latent representation vi of item node vi via a standard MLP. The \(\mathbf {v}_{i}^{U}, \mathbf {v}_{i}^{I}\), and \(\mathbf {v}_{i}^{A}\) are concatenated before feeding into MLP. Formally, the latent representation is defined as

$$ \mathbf{v}_{i}= \sigma(\textbf{W} \cdot Concat[\mathbf{v}_{i}^{U}, \mathbf{v}_{i}^{I}, \mathbf{v}_{i}^{A}] + b), $$
(10)

where the activation function σ is specified as the activation function. Then, for each item node vi, we employ a gate graph neural network (GGNN) to iteratively update its representation vi by aggregating the information from other item nodes within the given session. Therefore, the intra-session dependencies are learned and encoded into the item representations. We first aggregate information \(\mathbf {a}_{s,i}^{t}\) from the intra-session-based neighborhoods of vs,i via the connection matrix Ms:

$$ \mathbf{a}_{s,i}^{t} = \mathbf{M}^{s}_{i,:} [\mathbf{v}_{s,1}^{t-1}, \mathbf{v}_{s,2}^{t-1},...,{\mathbf{v}_{s,|s|}^{t-1}}]^{T}\mathbf{H} + \mathbf{b}, $$
(11)

where \(\mathbf {H} \in \mathbb {R}^{d\times d}\) is the weight matrix, b is the bias, \([\mathbf {v}_{s,1}^{t-1}, \mathbf {v}_{s,2}^{t-1},...,{\mathbf {v}_{s,|s|}^{t-1}}]\) is the list of latent representations of item nodes vs,1,vs,2,...,vs,|s| in the intra-session-based neighborhood set of vs,i at (t − 1)th iteration. The connection matrix \(\mathbf {M}^{s} = [\mathbf {M}^{s,I},\mathbf {M}^{s,O}] \in \mathbb {R}^{|s|\times 2|s|}\) is the concatenation of Ms,I and Ms,O, \(\mathbf {M}^{s}_{i,:}\) denotes the ith row of Ms corresponding to vi. The matrices Ms,I and Ms,O indicate the connections between two item nodes within session s, which is specified as below:

$$ \mathbf{{M}}^{s,I}_{i,j} = \frac{|e_{i,j}^{I}|}{|D^{I}({v_{i}})|}\qquad \mathbf{{M}}^{s,O}_{i,j} = \frac{|e_{j,i}^{I}|}{|D^{O}({v_{i}})|}\qquad $$
(12)

where \(e_{i,j}^{I}\) is the set of heyperedges eI, which connect from item node vi to item node vj, DI(vi) and DO(vi) represent the indegree and outdegree of item node vi respectively. In this way, the communications between item nodes within one session indicated by the frequencies of hyperedges are captured. We give an example of a user session s : v1v2v3v4v2v1, specially, in our dataset, we regard the order of the appearance of items in one session as the shopping order in one transaction. The corresponding connection matrix Ms,i in Figure 3.

Fig. 3
figure 3

An example of a directed subgraph based on session s and the corresponding connection matrix \(\mathbf {M}^{s}_{i}\)

We input the aggregated information \(\mathbf {a}_{s,i}^{t}\) within one session and the latent representation \(\mathbf {v}_{s,i}^{{t-1}}\) of vi to update the latent representation \(\widetilde {\mathbf {v}}_{s,i}^{t}\) in the tth iteration by the equations below:

$$ \begin{array}{@{}rcl@{}} \mathbf{z}_{s,i}^{t} &=& \sigma(\mathbf{W}_{z}\mathbf{{a}}_{s,i}^{t} +\mathbf{U}_{z}\mathbf{v}_{s,i}^{t-1}), \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} \mathbf{r}_{s,i}^{t} &=& \sigma(\mathbf{W}_{r}\mathbf{{a}}_{s,i}^{t} +\mathbf{U}_{r}\mathbf{v}_{s,i}^{t-1}), \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} \widetilde{\mathbf{v}}_{s,i}^{t} &=& tanh(\mathbf{W}_{o}\mathbf{{a}}_{s,i}^{t} +\mathbf{U}_{o}({r}_{s,i}^{t} \odot {\mathbf{v}_{s,i}^{t}} ) ), \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} \mathbf{v}_{s,i}^{t} &=& (1-\mathbf{z}_{s,i}^{t})\odot {\mathbf{v}_{s,i}^{t-1}} + \mathbf{z}_{s,i}^{t}\odot {\widetilde{\mathbf{v}}_{s,i}^{t}}, \end{array} $$
(16)

where \(\mathbf {r}_{s,i}^{t}\) represents the reset gate vector and \(\mathbf {z}_{s,i}^{t}\) represents the update gate vector, which are determined by the aggregated intra-session information \(\mathbf {{a}}_{s,i}^{t}\) in the tth iteration. \(\mathbf {W}_{z},\mathbf {W}_{r}, \mathbf {W}_{o} \in \mathbb {R}^{{2d} \times {d}}\) and \(\mathbf {U}_{z}, \mathbf {U}_{r}, \mathbf {U}_{o} \in \mathbb {R}^{{d} \times {d}}\) are learn-able weight matrices, activation function σ(⋅) is specified as the sigmoid, the ⊙ is the element-wise multiplication operation. The latent representation \(\mathbf {v}_{s,i}^{t}\) of vs,i in the tth iteration is decided by the update gate vector \(\mathbf {z}_{s,i}^{t}\), the reset gate vector \(\mathbf {r}_{s,i}^{t}\), the latent representation \(\mathbf {v}_{s,i}^{t-1}\) of vs,i in the (t − 1)th iteration and aggregated intra-session information \(\mathbf {{a}}_{s,i}^{t}\).

After we obtain the latent representation of each item node, firstly, we devise session aggregation to aggregate the information from historical sessions and current session of user ui, which is specified as:

$$ Aggre^{u}(\mathbf{u}_{i}) = \mathbf{A}_{i}^{u} [\mathbf{v}_{i,1}, \mathbf{v}_{i,2},...,{\mathbf{v}_{i,|{N}({u_{i}})|}}]^{T}, $$
(17)

where N(ui) is the neighborhood set of user node ui, which consists of all item nodes in both historical and current sessions of user ui, \([\mathbf {v}_{i,1}, \mathbf {v}_{i,2},...,{\mathbf {v}_{i,{|{N}({u_{i}})|}}}]\) is the list of latent representations of item nodes \({v}_{i,1}, {v}_{i,2},...,{{v}_{i,{|{N}({u_{i}})|}}} \in {N}({u}_{i})\). The weight matrix \(\mathbf {A}_{i}^{A}\) shows the importance of each item node in N(ui) for user node ui, which is calculated using:

$$ {A}_{i}^{u} = \frac{\left\|\mathcal{V}_{i,j}^{e_{u}}\right\|_{2}}{\left\|\mathcal{V}_{i,j}\right\|_{2}}\qquad $$
(18)

where \(\mathcal {V}_{i,j}^{e_{u}}= \{v_{j}|(v_{i}, v_{j}, u) \in e_{u} \}\) is the number of item nodes vj that have connected with item node vi and user node u with hyperedges eu. Then, we feed the aggregated information \(Aggre^{u}(\mathbf {u}_{i})\) and the initial latent representation ui into a MLP, and obtain the final latent representation \(\mathbf {u}^{Final}_{i}\) of user node ui below:

$$ \mathbf{u}^{Final}_{i} = \sigma(\mathbf{W} \cdot [Aggre^{u}(\mathbf{u}_{i}), \mathbf{u}_{i}]), $$
(19)

where the activation function σ is specified as the sigmoid function, and W is the learnable weight matrix.

4.3.3 Graph learning in hypergraph

Once the final representations \(\mathbf {v}_{1}, \mathbf {v}_{2},..., {\mathbf {v}_{|\mathcal {V}^{I}|}}\) of all items and \(\mathbf {u}^{Final}_{1}, \mathbf {u}^{Final}_{2},...,\\ {\mathbf {u}^{Final}_{|\mathcal {U}|}}\) of all users are learned, we can learn the latent representation for each user session. This task is often referred to as graph pooling, i.e., pooling together the latent representations of all the nodes to learn the latent representation of the entire graph [22]. Specifically, given a user u’s session context C = {vc,1,vc,2,...,vc,|C|}, the representation C of C is obtained below:

$$ \begin{array}{@{}rcl@{}} \textbf{C}_{0} &=& \textbf{u}^{Fianl}, \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} \textbf{C}_{t}^{*} &=& LSTM (\textbf{C}_{t-1}), \end{array} $$
(21)

Firstly, we set the latent representation of user u as the initialization vector C0 and then use LSTM to update it iteratively. \(\textbf {C}_{t}^{*} \in \mathbb {R}^{d}\) represent the query vector, which is used to calculate the attention vector ec,i,t for item node vc,i.

$$ \begin{array}{@{}rcl@{}} e_{c,i,t} &=& f(\textbf{v}_{c,i},\textbf{C}_{t}^{*}), \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} a_{c,i,t} &=& \frac{exp(e_{c,i,t})}{{\sum}_{c,j} exp(e_{c,j,t})}\qquad \end{array} $$
(23)

The attention vector ec,i,t of item node vc,i is determined by the query vector \(\textbf {C}_{t}^{*}\) and the latent representation of item node vc,i, f is an attention function. Then, we normalize the attention vector ec,i,t to compute the attention probability score ac,i,t.

$$ \begin{array}{@{}rcl@{}} r_{t} &=& \underset{c,i}{\sum} a_{c,i,t}\textbf{v}_{c,i}, \end{array} $$
(24)
$$ \begin{array}{@{}rcl@{}} \textbf{C}_{t} &=& Contact[\mathbf{C}_{t}^{*}, r_{t}], \end{array} $$
(25)

Then, in Equation (24), a weighted sum of the latent representations of all item nodes in the session context is computed based on the attention probability score. Finally, we output Ct as the latent representation for user session context that is used for next-item recommendations.

4.4 Prediction and optimization

After we get the context representation C, we compute the probability \(\hat {y_{i}}\) for target item vi by input C and the latent representation vi of candidate item vi. Firstly, we calculate the inner product of the context representation C and the latent representations vi of the candidate item, and then we employ the softmax function to calculate the probability \(\hat {y_{i}}\) for candidate item vi:

$$ \hat{y_{i}} = softmax((\mathbf{v}_{i})^{T}\mathbf{C}). $$
(26)

For each user session, we employ cross-emtropy as the loss function to learn the parameters of our HL framework and predict the next item. The loss function is specified as:

$$ {Loss}= -\sum\limits_{i=1}^{m}y_{i} log(\hat{y_{i}}) + (1-y_{i}) log(1-\hat{y_{i}}), $$
(27)

for each candidate item vi, yi is the label it, When the candidate item is the true target item, the value of yi is 1, otherwise its 0. Finally, we use the Back-Propagation Through Time (BPTT) algorithm to train the proposed model.

5 Experiments

In this section, we first describe the datasets, evaluation metrics, and comparison methods and parameter settings used in the experiments. Then, we evaluate the recommendation performance of our proposed HL framework by comparing it with other representative methods. Finally, we make detailed analysis of our results.

5.1 Experiment setup

5.1.1 Data preparation

We evaluate the proposed method on two real-world datasets, i.e., Tmall and Dunnhumby:

  • Tmall:Footnote 1 involves more than 130,000 sessions accumulated on Tmall.com/ Taobao.com and the app Alipay (i.e., the Chinese version of Amazon) within six month. This dataset contains user information, session information, item information and item attribute information.

  • Dunnhumby:Footnote 2 contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It provides all of each household’s purchases, and the category of the items.

For fair comparison, following [25], we firstly filter out items appearing less than 5 times and users appearing less than 3 times in both datasets. Then, we generate a session of a given user by putting all the items purchased in one transaction together. Following, we filter out the user sessions containing less than one item since at least two items should be used to build an information context and one item as the target item. Finally, for each user sessions, we split them into three sets: a training set, a test set and a validation set, specially, we randomly select 70% of each user sessions as the training set, 20% of each user sessions as the test set, and the remaining sessions as validation set. The statistics of two datasets are shown in Table 1.

Table 1 Statistics of the experimental datasets

5.1.2 Evaluation metrics

Following metrics are used to evaluate all the compared methods, which are also widely used in other related works [29, 32]:

  • Rec@k (Recall) is widely used as a measure of predictive accuracy. It represents the proportion of correctly recommended items amongst the top-k recommended items. Here we choose k ∈ {10, 20}.

  • Mrr@k (Mean Reciprocal Rank) is the average of reciprocal ranks of the correctly-recommended items. The reciprocal rank is set to 0 when the rank exceeds 20. The Mrr measure considers the ranking order of recommended items, where a large Mrr value indicates that correctly recommended items are ranked in the top positions of the recommendation list. Here we choose k ∈ {10,20}.

  • NDCG@k (Normalized Discounted Cumulative Gain) is a standard ranking metric. In the context of session based recommendation, it is formulated as:

    $$ NDCG@k =\frac{DCG@k}{IDCG@k} $$

    where DCG@k is the discounted cumulative gain and IDCG@k is the ideal discounted cumulative gain. They can be computed as below:

    $$ DCG@k =\sum\limits_{i=1}^{k} \frac{rel_{i}}{log_{2}(i+1)}\qquad IDCG@k =\sum\limits_{i=1}^{k} \frac{1}{log_{2}(i+1)}\qquad $$

    where reli is the graded relevance of the result at position k, reli ∈ {0, 1}. Here we choose k ∈ {5, 20}.

5.1.3 Comparison methods

In our experiments, we compare the performance of HL with the following ten baselines, which are divided into three classes of SBRSs: (1) conventional SBRSs, including POP, BPR-MF, FPMC, (2) deep learning based SBRSs, including iGRU4Rec-BPR, STAMP, NextItNet, ATEM, SR-GNN, and (3) the state-of-the-art SBRSs, HMGL.

  • POP is a simple baseline that always recommends the most popular item in the training dataset.

  • Item-KNN recommends items similar to the previously purchased items in the session, where similarity is defined as the cosine similarity between the vector of sessions [19].

  • BPR-MF is a learning method based on stochastic gradient descent with bootstrap sampling, and is optimized for personalized ranking using a Bayesian analysis [15].

  • FPMC is a Markov chains and factorization based method for next item recommendations, which factorizes the user-item transition matrix between two adjacent items within one session to learn both the general taste of a user and transitions over items [16].

  • iGRU4Rec-BPR is a RNN-based approach for SBRSs, which employ gated recurrent units (GRUs) to model the rigid order dependencies between items within one session, and introduce a novel ranking loss function to improve the performance [5].

  • STAMP is an attention-based approach for next item recommendations, which employs attention layer to capture the users’ general interest from the previous click and employs a self-attention to capture the current interest from the last click [14].

  • ATEM is an attention based transaction embedding model for session context embedding to weight each observed item in a transaction without assuming order [25].

  • NextItNet is a CNN-based approach for next item recommendations, which employs a convolutional generative network to learn the high-level representation from both short- and long-rang item dependencies [33].

  • SR-GNN is a GNN-based approach for next item recommendations, which employs a gated graph neural networks to capture the complex dependencies within one session, and then generate the latent representations by attention mechanism [20].

  • HMGL is a GNN-based SBRSs, firstly constructs a heterogeneous graph to integrate session information and item attribute information, and then learns a unified representation for each item by simultaneously modelling the local and global dependencies for next-item recommendations [29].

5.1.4 Parameter settings

For a fair comparison, we initialize all the compared approaches with parameter settings in their papers and optimize them for best performance. For our HL framework, all the parameters are initialized using a Gaussian distribution with a mean of 0 and a standard deviation of 0.1. The sizes of the latent representations of item nodes and user nodes are set to 128. We use the Adam optimizer with the initial learning rate 0.001. The batch size for mini-batch optimization is set to 128, and we run 30 epoches to train our HL model for best performance.

5.2 Evaluation of the recommendation accuracy

In this section, we conduct the extensive experiments on two dataset, which aim to answer the following research questions: how does our proposed HL framework perform compared with the state-of-the-art SBRSs?

We compare the performance of our HL model with the nine representative approaches in terms of Rec and Mrr and show the results in Table 2. As for the result of conventional SBRSs, the first two methods PoP and Item-KNN, due to the simple intuition based on the frequency of appearance, the performances is relatively poor. The performance of BPR-MF is better than PoP and Item-KNN, because it factorizes the user-item transition matrix. However, the matrix are extremely sparse and easily suffer from data sparse problem. For instance, in Tmall dataset, each session only contains an average of 4.98 items of over 24,766 ones, leading to the case that each row of the constructed matrix contains less than five items and all other entries are empty. Compared with BPR-MF, FPMC is a Markov Chains and factorization machines based methods, which can takes the advantages of them, thus the performance improved. In a word, both BPR-MF and FPMC predict the target item based on its previous one only while ignoring all other items bought in the same session, which lost some information.

Table 2 Rec and Mrr values of all compared methods on two datasets

As for the deep learning based SBRSs, including iGRU4Rec, STAMP, ATEM, NextItNet, SR-GNN, they achieved better performance than conventional SBRSs. For example, iGRU4Rec employs a GRUs, which can capture the rigid order dependencies between items within one session. Thus it achieved better performance. STAMP employs attention mechanism to attentively learn the long-term dependencies within sessions and thus performs better. ATEM also employed an attention mechanism to build an attentive context embedding over the session context without any order assumption, which can capture more flexible and complex dependencies within each session. Thus, it achieved better performance than STAMP. NextItNet does not perform well on Tmall dataset but perform well on Dunnhumby dataset, this maybe caused by the different data characteristics, e.g., the average session length. SR-GNN employs a graph neural network, which can capture the complex transitions among items within one session and thus the performance have improved. However, it captures the intra-session dependencies from the limited session information only and fails to explicitly capture the inter-session dependencies by utilizing available side information, leading to information loss.

As for the performance of the state-of-the-art methods HMGL in SBRSs, which can capture both local dependencies (i.e., the dependencies between items within a session) and global dependencies (i.e., the dependencies across multiple sessions). Thus, the performance of HMGL is better than the methods above. However, this work ignores the significant role of users in connecting the different but relevant sessions as well as items. In contrast, our HL framework can explicitly capturing both the intra-session dependencies and the inter-session dependencies via effectively utilizing both session information and side information. we get the best performance of our HL framework on two datasets. As show in Table 2, our proposed HL framework outperforms all the other baseline methods. The improvements of HL over the best-performing method HMGL, range from 1.53% to 3.04% with an average of 2.25% in terms of Rec, and range from 2.03% to 4.13% with an average of 3.24% in terms of Mrr. The NDCG (cf. Figure 4) on two datasets also shows HL leads all the baseline methods.

Fig. 4
figure 4

Our proposed HL achieves the highest NDCG values on both datasets

The simplest method PoP recommends the next item solely based on the occurrence frequency of items, which is problematic in session-based recommendation scenarios. As for Item-KNN, they recommend similar items with previously purchased items in the session are recommended, which lead to the recommendation of the same item repeatedly, thereby reducing the recommendation accuracy. When the conventional SBRSs such as BPR-MF and FPMC are considered, they learn the probabilities to capture first-order dependencies, and can not capture the high-order dependencies. As for the deep-learning based SBRSs, including iGRU4Rec-BPR, STAMP, NextItNet, SR-GNN, HMGL owing to the powerful of deep neural networks, these methods outperform the conventional methods. A RNN-based approach iGRU4Rec-BPR employs GRUs to capture the long-term dependencies in each session, while STAMP improves the recommendation accuracy with an attention layer. ATEM achieved better performance due to the non-order assumption. NextItNet is a CNN-based approach for next item recommendations, which employs a convolutional generative network to learn the high-level representation from both short- and long-range item dependencies. Those methods usually model the rigid order within one session, while SR-GNN further capture the item dependencies within one session with graph neural networks. HMGL incorporates the session information with item attribute information to further improve the recommendation accuracy. However, it neglects the significant role of users. Our framework HL incorporates the session information and item attribute information and user information together, and captures the intra and inter-session dependencies to further improve the recommendation accuracy.

6 Conclusions

In this paper, we have proposed a hypergraph learning (HL) framework for next-item recommendations. We introduced side information including item attribute information and user information simultaneously to enrich the information for next-item recommendations. We have constructed a hypergraph to connect the items, item attribute values and users to effectively integrate both the session information and side information together in preparation for leaning informative item representations. Then, we have built a hypergraph leaning model to learn the representation of each item while modeling both the intra- and inter-session dependencies from the constructed hypergraph for more accurate next-item recommendations.

Current works mainly focus on single behaviours (e.g., purchasing), however, in real situations, there are usually multi-behaviour (e.g., clicking, adding to cart and purchasing) in each session. Hence, in order to capture item dependencies among multi-behaviours and further recommend the next item and the corresponding behaviours associated with it, we plan to extend the HL framework to learn more complex dependencies between items embedded in sessions consisting of other types of behaviours, e.g., clicking, adding to cart. Consequently, more informative item representations can be learned for recommending next item and its associated behaviours accurately.