1 Introduction

Recommender systems aim to predict the items a user might be interested-in based on her previous interactions. Recommender systems have recently become very important, especially in e-commerce due to availability of a large pool of items a user can select from. Recommender systems play a crucial role to consumers and owners of the business. Traditional recommender systems employ Matrix Factorization (MF) [1, 2] methods to learn a low rank user representation from the ratings of previous interactions. By using this representation, the recommender system then predicts the rating of other items the user maybe interested in.

However, on most online platforms, users do not explicitly rate items. Rather, implicit feedback, such as clicks, must be relied on for recommendation. Since user interests are dynamic, traditional MF methods cannot capture such changes. To utilize implicit feedback and capture user interest drifts, researchers have focused on sequential recommender systems. A particular case of sequential recommendation, called session-based recommendation has gained a lot of attention recently. In session-based recommendation, sessions cannot be linked to a particular user which may be warranted due to privacy concerns.

To effectively recommend relevant items to users in session-based recommendation, three important criteria have to be considered; short-term preference, long-term preference and session co-occurrence patterns. Consider the example in Fig. 1, in Session 1 the user watched the “Fast and Furious” movies in serial order from the fifth movie to the seventh movie. Based on the watch history, it will be a good idea to recommend the next movie (eighth movie). In this instance, the last watch (clicked) item is important for recommendation and it is captured by the short-term preference. However, from Session 2, the watch history includes animation movies and action movies. The user may not be really interested in the animation movie but watched them with her kids. In this case, recommending another animation movie may not be the best decision but rather recommending both action movie and animation movie. Here the short-term preference is insufficient. However, the long-term preference addresses this issue by capturing the overall session interest. Consider Session 3, the user is obviously interest in action movies, however, using the knowledge from Session 1 that after watching “movie 3” users watch “movie 4”, it may be relevant to recommend “movie 4.” The session co-occurrence pattern captures the inter-session interaction for improved recommendation.

Fig. 1
figure 1

A toy example of relevance of short-term, long-term and session co-occurrence pattern in session-based recommendation

Short-term preference is represented by the most recent interactions. Markov chain (MC) models have shown to be successful on this task [3, 4]. FMPC [5] assumes independence between interactions and models the first-order MC for sequential recommendation. Older interactions are important to fully understand session long-term preference due to drift in user interest. Here MC models fail due to the independent assumption and the difficulty in the scalability of higher order MC models. Recurrent Neural Networks (RNN) models are a great alternative to MC models for modeling longer sequence and have become the state-of-the-art in session-based recommendation [6,7,8,9]. NARM [10], for example, models both user’s short-term and long-term preference using GRU with the last hidden state as the short-term user preference. An attention mechanism is then used to learn a user’s long-term preference.

Since sessions cannot be tied to particular users, item co-occurrence patterns can elicit behavioral pattern between different sessions. The existing session-based recommendation models consider only the current session for recommendation. However, user behavior can be influenced by others as the old adage goes “birds of the same feather flock together”. Studies have shown that recommender systems are subjected to conformity bias [11, 12]. That is, users are influenced by the actions of others. Researchers [13, 14] have leverage this trend to improve recommendation. However, in session-based recommendation, user information or social information is not readily available.

To this end, we propose IC-GAR (Item Co-occurrence Graph Augmented Session-based Recommendation) model that efficiently combines the three important criteria in a session-based recommendation problem. To model the short-term and long-term user preference, we use a GRU. The last hidden state of the GRU represents the user’s short-term interest. We then use attention mechanism to capture the long-term interest of users from all hidden states of the GRU. To model item co-occurrence, we first construct a weighted undirected graph containing all the training sessions. Each weighted edge of the graph represents the frequency of transitions from one item to another. By using a variant of Graph Convolutional Network GCN [15], we can learn an item representation that is aware of the various transition patterns between that item and all the other items. We then aggregate the item representations from the GCN for each session to learn the session co-occurrence representation. By using the short-term, the long-term and the session co-occurrence representations, we then employ a trilinear decomposition to recommend the most relevant items.

In summary, our main contributions are as follows:

  • We propose a model that considers three factors i.e., short-term, long-term and session-occurrence representation for session-based recommendation

  • A novel IC-GAR model is proposed that accounts for user interest dynamics and item co-occurrence patterns in an end-to-end neural network.

  • A graph representation is proposed to learn the session co-occurrence representation in all training sessions.

  • We conduct extensive experiments on three datasets to demonstrate the effectiveness of IC-GAR. The proposed IC-GAR model significantly outperforms the state-of-the-art models in terms of MRR and Precision.

The rest of the paper is organized as follows: Sect. 2 presents related works, while Sect. 3 gives a detailed description of IC-GAR model. The experimental results and discussion are in Sect. 4 and the conclusion is in Sect. 5.

2 Related works

Recommender systems have evolved over range with two main branches emerging. That general recommendation and sequential recommendation. General recommender systems do not consider the temporal nature of user interest, while sequential recommender systems are built with the dynamic nature of user interest in mind. General recommender systems can be categorized into collaborative filtering, content based and hybrid methods [16, 17]. Collaborative filtering generate recommendation for users by exploring the preferences of other related users. Content-based methods generate recommendations for users by exploring similarity between items previously consumed by the users. Hybrid methods on the other hand combines the benefit of both collaborative filtering and content-based methods. Recently, fuzzy tools have been developed for improving general recommendation [18].

Session-based recommendation is a special type of sequential recommendation where user information is not available and sessions are short. This section will present some related literature that are most relevant to our work. Related works on session-based recommendation are presented in Sect. 2.1, while graph-based recommendation systems are discussed in Sect. 2.2.

2.1 Session-based recommendation

Session-based recommendation is a sub-task of recommender systems where given the historical sequential interactions, the next item is predicted. Session-based recommendation additionally assumes that sessions cannot be tied to a particular user (anonymous user sessions). Traditional collaborative filtering models cannot be used for session-based recommendation because they do not consider the sequence of interactions. Hence, MC-based models have been extensively used [3, 5, 19, 20]. These models predict the next action in a sequence using the last action (or last few actions) and assume independence relationship between actions in a sequence. Zimdars et al. [19] used MC to extracted sequential pattern for session-based recommendation. Shani et al. [3] improved the maximum likelihood of MC transition graphs by using heuristic methods for sequential recommendation. FPMC [5] generalized MC and MF to learn sequential patterns and long-term user preference. However, MC-based models suffer from the independent assumption and an unmanageable state space when considering long sequences.

RNN solved the limitation of MC-based models. RNN can efficiently learn longer sequences and have recently shown superior performance in tasks such as machine translation [21, 22], image captioning [23, 24] and conversation systems [25]. RNN have also shown superior performance in sequential recommendation tasks such as next location [26, 27], next click [28,29,30] and next basket [31, 32] recommendation. Hidasi et al. [6] is the first to propose using RNN for session-based recommendation, which uses parallel mini-batches with pair-wise ranking. Tan et al. [33] improved the model by using data augmentation, privileged information and point-wise ranking loss. These models and others [34,35,36] only consider the last hidden state (local user preference) of the RNN for recommendation. To improve the capability of modeling a user’s dynamic interest, NARM was proposed to learn both of the global and the local user preference [10]. Other models [37,38,39] have leveraged these two preferences and have achieved improved performance. STAMP [37] uses memory network for local and global user preference with a trilinear decomposition. LSAMN [40] proposed using hierarchical attention to balance between global user interest and sequential behavior. HLN [39] introduces a hierarchical leap network to skip preference un-related items. In addition to global and local user preference, CSRM [9] uses a memory network to incorporate neighborhood sessions.

2.2 Graph neural network based recommendation systems

Graph neural networks GNN are deep learning methods on graph structured data [41]. They learn powerful representation by using message passing technique between the nodes [15]. The main technique of GNN is to iteratively aggregate the features from the neighboring nodes with the features of the current node for a powerful node representation. GNN have achieved great success in tasks such as node classification [42,43,44], protein structure [45, 46] and physical systems [47, 48]. Naturally, recommendation task can be represented as a bipartite graph of user-item interactions. Several GNN models have been proposed on bipartite graphs [49,50,51,52,53]. Berg et al. [49] used graph auto-encoder to learn the node embeddings on a user-item bipartite graph. Ying et al. [50] improved the scalability of GNN in recommender systems by using random walks for feature aggregation. IG-MC [52] constructs one-hop subgraph based on user-item pairs to learn an inductive matrix completion method. Other models focus on both the user-item bipartite graph with additional side information graphs, such as social networks, [54,55,56] and the knowledge graphs [57,58,59]. Wu et al. [55] captured the heterogeneous information from the social and user-item graphs to model the social influence in recommendation. KGAT [59] proposed to learn the relationships in a higher-order collaborative knowledge graph. Recently, GNN have been applied on sequence data for recommendation [60,61,62,63,64]. SR-GNN [60] employs a gated graph neural network and an attention mechanism with bilinear decoder for recommendation. A-PGNN [62] proposed a personalized recommendation model to capture the complex item transition in a user-specific fashion. DHCN [65] replaces the directed graph used in SR-GNN with hypergraph and proposed a self-supervised learning for improved performance. GCE-GNN [66] uses epsilon neighbor and augment the long-term user preference in SR-GNN, while neglecting the short-term user preference. GAG [67] considers dynamic sessions against the static sessions and proposed using GNN with Wassertein reservoir for streaming session-base recommendation.

In this paper, our proposed model differs with the existing models from the following three points: (1) We proposed to augment an RNN-based session-based model with item co-occurrence graph which has not been considered. (2) Different from the existing GNN session-based recommendation models that construct two (incoming and outgoing) graphs for each session, we construct one undirected graph for all sessions in the training sequences. (3) We consider three sources of information i.e., global preference, local preference and item co-occurrence pattern for recommendation.

3 IC-GAR model

In this section, we present a detailed description of the proposed IC-GAR model. First, we present an overview of the model in Sect. 3.1. We then present the details of each of the three modules of IC-GAR: Encoder Module, Session Co-occurrence Module and Prediction Module in Sects. 3.2, 3.3 and 3.4, respectively. For simplicity, Table 1, shows the meaning of symbols used in the paper.

Table 1 Meaning of symbols

3.1 Overview of IC-GAR

Let \(V=\{{v}_{1},{v}_{2}, \dots ,{v}_{n}\}\) be the set of all items and \(s=[{v}_{s,1},{v}_{s,2},\dots ,{v}_{s,m-1}]\) be the sequence list of items clicked in session \(s\) such that \({v}_{s,i}\in V\). Session-based recommendation aims to predict the next item that will be clicked in session \(s\), \({v}_{s,m}\). The output of IC-GAR are the probability scores \(\widehat{{\varvec{y}}}\) for all the candidate items, where the top-k items based on the highest probabilities are recommended as the potential next click.

The IC-GAR model is composed of three modules, Encoder, Session Co-occurrence and Prediction Modules as shown in Fig. 2. The Encoder Module learns both the local and the global preference. To model the local preference, we use the last hidden state of GRU. For the global preference, we use an attention mechanism on all hidden states of GRU to selectively model the global preference. The important contribution of IC-GAR model is the session co-occurrence representation. To model the session co-occurrence, we first construct a weighted undirected graph of the transition on all items in the training set. Each weighted edge of the graph represents the frequency of transition from one item to another. The size of the graph is of the order \({V}^{2}\) and for a large item size, it will slow the computation. By using a variant of GCN, we model a lower dimension representation of the item co-occurrence graph of order \(Vd\) where \(d\ll V\). The learned lower dimension representation incorporates the higher-order transition patterns between the items. By using a permutation invariant aggregation function, we form session co-occurrence representation with the representation of all the items in the current session. The session co-occurrence learns nonlinear transition patterns between the current session and all the items in the training set. Note that, we only use the sessions in the training set to construct the graph. The prediction module makes inference by first efficiently learning the final session representation using a trilinear decomposition. Finally, the probability of each next item is computed by multiplying the final session representation with candidate item and taking the softmax operation. Detailed description of each component of the model is presented next.

Fig. 2
figure 2

Overall schematic architecture of IC-GAR model. For each session the encoder module learns the global and the local preference, \({\mathbf{s}}_{g} \,\mathrm{and} \,{\mathbf{s}}_{l}\). The session co-occurrence module first learns representations by using GCN. For the current session, the corresponding embeddings are aggregated to model session co-occurrence representation, \({\mathbf{s}}_{s}\). The prediction module decomposes the three representations and predict the ranking probabilities

3.2 Encoder module

The encoder module learns both the global and the local session preference. The local preference represents the current interest of the session, while the global preference represents the changes in interest over the current session. To learn these preferences, we use a GRU which has been shown in [33] to outperform LSTM [68] and the vanilla RNN in session-based recommendation. GRU eliminates the vanishing gradient problem of the vanilla RNN by using the reset gates and the update gates. The last hidden state \({\mathbf{h}}_{t}\) of GRU is a linear combination between the previous hidden state \({\mathbf{h}}_{t-1}\) and a candidate state \({\widehat{\mathbf{h}}}_{t}\). It is given by:

$${\mathbf{h}}_{t} = \left( {1 - z_{t} } \right){\mathbf{h}}_{t - 1} + z_{t} {\hat{\mathbf{h}}}_{t} ,$$
(1)

where \({z}_{t}\) is the update gate and is computed as:

$$z_{t} = \sigma \left( {{\mathbf{W}}_{z} {\mathbf{x}}_{t} + {\mathbf{U}}_{z} {\mathbf{h}}_{t - 1} } \right),$$
(2)

while \({\mathbf{x}}_{t}\) is the input at timestamp \(t\). The candidate state can be computed as:

$${\hat{\mathbf{h}}}_{t} = {\text{tanh}}\left( {{\mathbf{Wx}}_{t} + {\mathbf{U}}\left( {r_{t} \odot {\mathbf{h}}_{t - 1} } \right)} \right),$$
(3)

where \({r}_{t}\) is the reset gate and is computed as:

$$r_{t} = \sigma \left( {{\mathbf{W}}_{r} {\mathbf{x}}_{t} + {\mathbf{U}}_{r} {\mathbf{h}}_{t - 1} } \right).$$
(4)

\({\mathbf{W}}_{z},{{\mathbf{U}}_{z},\mathbf{W}}_{r},{\mathbf{U}}_{r},\mathbf{W}, \mathrm{and}\, \mathbf{U}\) are weight matrices of the update gate, reset gate and candidate state, respectively. The final hidden state \({\mathbf{h}}_{n}\) of the GRU represents the current interest of the session. Hence, we represent the local preference, \({\mathbf{s}}_{l}\) as:

$${\mathbf{s}}_{l} = {\mathbf{h}}_{n} .$$
(5)

The global preference aims to capture the changes in interest over the current session. However, some item clicks in the session may not truly represent a user interest, or may not contribute to current user interest. To effectively model user dynamic interest over the session, we use an attention mechanism conditioned on the last clicked item. The global preference of each session, \({\mathbf{s}}_{g}\) is thus computed by:

$${\mathbf{s}}_{g} = \mathop \sum \limits_{i = 1}^{n} \alpha_{i} {\mathbf{h}}_{i} .$$
(6)

\({\mathbf{h}}_{i}\) is the hidden state of the GRU at timestamp \(i\) and \({\alpha }_{i}\) is the attention weight at \(i,\) computed as:

$$\alpha_{i} = {\mathbf{q}}^{{\text{T}}} \sigma \left( {{\mathbf{W}}_{1} {\mathbf{h}}_{t} + {\mathbf{W}}_{2} {\mathbf{h}}_{i} + b} \right).$$
(7)

\({\mathbf{W}}_{1},{\mathbf{W}}_{2} \mathrm{and }\,\mathbf{q}\) are learnable parameters that control the attention weights. \(\sigma\) is a nonlinear activation function defined as \(\sigma =1/\left(1+\mathrm{exp}\left(-x\right)\right)\). The encoder module converts the current session into two representations \( {\mathbf{s}}_{l} \,\mathrm{and} \,{\mathbf{s}}_{g}\), the local session preference and the global, respectively.

3.3 Session co-occurrence module

The session co-occurrence module learns the transition patterns between each item in the current session and all the other items in the training sessions. The session co-occurrence improves the performance of recommendation through injecting similarity in transition patterns. The session co-occurrence module is composed of three stages, (1) Item co-occurrence graph construction. (2) Learning lower dimension representation. (3) Aggregation of the learned item co-occurrence embedding. We discuss each stage as follows:

3.3.1 Item co-occurrence graph construction

We construct a weighted undirected graph to represent the item co-occurrence patterns. The graph \(\mathcal{G}= \left(\mathcal{V},\mathcal{E}\right)\) is constructed where \(v\in \mathcal{V}\). A weighted undirected edge \({(v}_{i-1},{v}_{i})\in \mathcal{E}\) exists if \({v}_{i-1}\) is clicked before or after \({v}_{i}\). The weights indicate the frequency of transitions between each pair of items in the training set. A weighted undirected adjacency matric \(\mathbf{A}\) is then obtained for the graph. We embed each item \(v\in \mathcal{V}\) into a unified embedding space, and then use a GCN to learn the higher-order transitions between the items.

3.3.2 Higher-order transition representation learning

GCN is an implementation of graph neural network based on message passing technique. The GCN model proposed in [15] updates the representation at each layer by message construction and aggregation. The update at layer \(l\) is given by:

$${\mathbf{M}}_{l} = {\text{ReLU}}\left( {\tilde{\mathbf{D}}}^{ - 1/2} {\tilde{\mathbf{A}}}{\tilde{\mathbf{D}}}^{ - 1/2} {\mathbf{M}}_{l - 1} {\mathbf{W}}_{l - 1} \right)$$
(8)

\({\tilde{\mathbf{A}}}={\mathbf{A}}+{\mathbf{I}}, \; {\tilde{\mathbf{D}}}\) is the diagonal matrix of \(\stackrel{\sim }{\mathbf{A}}\), \({\mathbf{W}}_{l-1}\) is the weight matrix at layer \(l-1\) and \({\mathbf{M}}_{l-1}\) is the representation at layer \(l-1.\) \({\mathbf{M}}_{0}\) is given by:

$${\mathbf{M}}_{0} = {\mathbf{X}}$$
(9)

\(\mathbf{X}\) is the initial embedding of the items. In our case it is given by \(\mathbf{V}\).

However, the GCN model in [15] was proposed for node classification task. To make the model suitable for our task, we make the following modifications. First, our update at layer \(l\) is given by:

$${\mathbf{M}}_{l} = {\text{LeakyReLU}}\left( {{\tilde{\mathbf{D}}}^{ - 1} {\tilde{\mathbf{A}}\mathbf{M}}_{l - 1} {\mathbf{W}}_{l - 1} } \right)$$
(10)

We found out that these modifications can improve the stability of the model. Secondly, similar to [51], the final embedding is given by concatenating the output of each layer. Although, [43] proposed other alternatives such as max pooling or sum pooling, concatenation can outperform these methods. The final embedding of each item in the co-occurrence graph \({\mathbf{m}}^{*}\) is thus given by

$${\mathbf{m}}^{*} = {\mathbf{m}}^{0} \left\| \ldots \right\|{\mathbf{m}}^{l}$$
(11)

where \(\parallel\) is a concatenation operation. It controls the range of the propagation and enriches the final embedding. We also employ node and message dropout in the propagation layers to improve the robustness. The node dropout acts by dropping the nodes with \({p}_{1}\) probability, while the message dropout acts by dropping connections between the nodes with probability \({p}_{2}\).

3.3.3 Aggregation

To obtain the session co-occurrence representation, we aggregate the individual item embedding of each session. Assuming a session is given by, \(s= {v}_{1}, {v}_{2}, {v}_{3},\) we obtain the session co-occurrence representation as:

$${\mathbf{s}}_{s} = \mathop \sum \limits_{i = 1}^{3} {\mathbf{m}}_{i}^{*}$$
(12)

where \({\mathbf{m}}_{i}^{*}\) is the final representation of item \({v}_{i}\) in GCN and \({v}_{i}\in s\).

3.4 Prediction module

The final prediction of the model consists of two stages. Firstly, obtaining the final session representation from the local preference, the global preference and the session co-occurrence; secondly, obtaining the probabilities of all the candidate items for recommendation. To efficiently obtain the final session representation, we employ a trilinear decomposition given by: \(<a,b,c>=a \odot \left(b\oplus c\right)\). Specifically, the final session embedding \({\mathbf{s}}_{f}\) is given by:

$${\mathbf{s}}_{f} = {\mathbf{s}}_{l} \odot \left( {{\mathbf{s}}_{g} \oplus {\mathbf{s}}_{s} } \right)$$
(13)

where \(\odot\) denotes the Hadamard product and \(\oplus\) denotes the element-wise addition. The two representations \({\mathbf{s}}_{g} \,\mathrm{and} \,{\mathbf{s}}_{s}\) are conditioned on \({\mathbf{s}}_{l}\) to amplify the current user interest for recommendation.

With the embedding of each session \({\mathbf{s}}_{f}\) obtained, the candidate item \(\widehat{\mathbf{z}}\) can be computed as:

$${\hat{\mathbf{z}}} = {\mathbf{s}}_{f}^T {\mathbf{v}}$$
(14)

\(\mathbf{v}\) is the initial embedding of all the candidate items. Softmax function is then applied to obtain the output probabilities \(\widehat{\mathbf{y}}\) of the candidate items

$${\hat{\mathbf{y}}} = {\text{softmax}}\left( {{\hat{\mathbf{z}}}} \right)$$
(15)

For each session, we use cross-entropy loss function between the predicted click and the ground truth. The cross-entropy loss function is defined as:

$${\mathcal{L}}\left( {{\hat{\mathbf{y}}}} \right) = - \mathop \sum \limits_{i = 1}^{n} {\mathbf{y}}_{{\mathbf{i}}} {\text{log}}\left( {{\hat{\mathbf{y}}}_{i} } \right) + \left( {1 - {\mathbf{y}}_{i} } \right){\text{log}}\left( {1 - {\hat{\mathbf{y}}}_{i} } \right)$$
(16)

where \(\mathbf{y}\) is the ground truth represented by one-hot encoding. We use Back-Propagation Through Time (BPTT) to train IC-GAR model. Similar to [10, 33], we truncate the back-propagation at 19 timestamps.

4 Experimental results and performance analysis

In this section, we first describe the datasets, the state-of-the-art baseline models and the evaluation metrics for performance evaluation. We then intend to answer the following research questions.

  • RQ1 Does the proposed IC-GAR model achieve the state-of-the-art performance?

  • RQ2 What is the effect of the item co-occurrence graph on the performance of IC-GAR?

  • RQ3 How well does IC-GAR perform with different embedding size, the aggregation methods and the graph type?

4.1 Dataset

To evaluate the performance of IC-GAR model, we used two popular transactional datasets, namely; RetailRocketFootnote 1 and Yoochoose.Footnote 2 RetailRocket dataset contains 6 months personalized transactions from an e-commerce site. Yoochoose was published in RecSys Challenge 2015. It consists of click streams from an e-commerce site. Similar to [10, 33, 37, 60], we use the most recent 1/64 and 1/4 fractions of the Yoochoose dataset in our evaluations.

In order to filter noisy data, we filter out sessions with less than 2 items and items appearing less than 5 times in both datasets. After filtering, 37,484 items with 7,966,257 sessions remained in the Yoochoose dataset, while the RetailRocket dataset contains 46,874 items with 710,856 clicks. The summary of the dataset is given in Table 2. Following [6, 9, 69] we set the data of the last day as the test data and the remaining data for training on the Yoochoose 1/64 and Yoochoose 1/4 fractions. For RetailRocket dataset, we set the data of the last week as the test data similar to [61] and the remaining dataset for training.

Table 2 Statistics of datasets used for evaluation

4.2 Evaluation metrics

We used two accuracy metrics to evaluate the performance of all the models. Precision (\(P@k\)) and Mean Reciprocal Ranking (\(MRR@k\)) similar to previous [9, 10, 37, 60]. Both metrices evaluate the accuracy of the recommended top-k list. MRR@k additionally penalizes the ranking order of the recommended list.

\({\varvec{P}}@{\varvec{k}}\): Mathematically, \(P@k\) can be defined as:

$$P@k = \frac{{n_{hit} }}{N},$$
(17)

where \({n}_{\mathrm{hit}}\) is the number of correctly recommended items within the top-\(k\) positions, and \(N\) is the total number of items in the test set. It measures the proportion of the test items that are correctly recommended in top-\(k\) positions within the ranking list.

\({\varvec{M}}{\varvec{R}}{\varvec{R}}@{\varvec{k}}\): \(MRR@k\) can be defined as:

$${\text{MRR}}@k = \frac{1}{N}\mathop \sum \limits_{t \in T} \frac{1}{{{\text{Rank}}\left( t \right)}},$$
(18)

where \(t\) is an item within the ranking list \(T.\) The \(\mathrm{MRR}@k\) is set to zero if the rank of \(t\) is above k. It is the average of the reciprocal ranking of correctly recommended items. It is a better metric to evaluate the accuracy of recommender systems since the aim is to put the most relevant items at the top of the recommended list. We evaluate \(P@k\) and \(\mathrm{MRR}@k\) where \(k=5, 10\) since users are more likely to select items that appear in the top of the recommended list compared to the items with lower rankings [70, 71].

4.3 Baselines

We evaluate the performance of IC-GAR model with the following representative state-of-the-art baselines and closely related works. We use the hyperparameters in the initial paper for models with similar dataset and tune the hyperparameter for the other datasets.

BPR-MF [72] cannot be directed used for session-based recommendation because it does not consider the sequence of interactions. To use MF for session-based recommendation, latent representation of items within a session can be used to represent the session.

FMPCFootnote 3 [5] is an MC-based model for sequential recommendation. It is a state-of-the-art method for next-basket recommendation.

GRU4RecFootnote 4 [6] first introduces RNN for session-based recommendation. It uses a GRU with pair-wise ranking and parallel mini-batches to speed-up the performance for recommendation.

NARMFootnote 5 [10] is an encoder-decoder model for session-based recommendation. It uses a GRU to learn both the user’s local and global preference within each session for recommendation.

STAMPFootnote 6 [37] is an attention memory priority model that uses MLP to capture the long term and the short term user interest within the current session for session-based recommendation.

SR-GNNFootnote 7 [60] uses a GNN to inject the higher-order transition between the items in each session and learns the global and the local preference for session-based recommendation.

CSRMFootnote 8 [9] uses an inner and outer memory network for session-based recommendation. The inner memory network learns a user’s interest from current session and the outer memory network uses a similarity function to learn a user’s interest from the neighboring sessions.

GCE-GNNFootnote 9 [66] uses epsilon neighbor and augment the long-term user preference in SR-GNN while neglecting the short-term user preference.

4.4 Parameter settings

All the weight matrices and the embeddings were initialized using a Gaussian distribution with 0 mean and 0.1 standard deviation. Zero initialization was then used for all the biases. A mini-batch of size 512 was used and the epoch is set to 10. Grid search was used on all the datasets for hyperparameter selections based on MRR@10 score on the validation set. Hyperparameters in the grid search includes: learning rate η in {0.01, 0.05, 0.001, 0.005, 0.0001}, learning rate decay λ in {0.1, 0.3, 0.5, 0.7}, embedding dimension d in {50, 100, 150, 200}. Based on the average performance, we used the following hyperparameter settings in the test data: {η = 0.001, λ = 0.1, d = 100}. We set the number of GNN layers to 2 with the message dropout in each layer set to 20%. The node dropout is set to 40% to overcome overfitting. IC-GAR was implemented using Tensorflow.Footnote 10 Our implementation will be made available for reproducibility.Footnote 11

4.5 Performance comparison

To evaluate the performance of the proposed IC-GAR model, we start with comparing the performance against the state-of-the-art models. We further compare the training time of the proposed model with other RNN-based state-of-the-art models.

4.5.1 Overall performance

Table 3 shows the of performance comparison with best performance shown in bold face. The following observations can be made:

  • BPR-MF shows the worst performance on all the three datasets. It shows that the traditional MF methods are not sufficient for modeling user dynamic preference. FMPC is a first-order MC sequential model that only considers the last item for recommendation. FMPC outperforms BR-MF on all the three datasets, demonstrating the necessity of modeling user sequential pattern for performance enhancement.

  • GRU4Rec is an RNN-based model that is able to model longer sequence for recommendation. It outperforms both FMPC and BPR-MF on all the datasets demonstrating the necessity of longer sequence modeling. However, it only uses the last hidden state of the GRU for recommendation.

  • NARM and STAMP both outperformed GRU4Rec on all the three datasets. It demonstrates the necessity of learning both a user’s local preference and global preference for recommendation. Particularly, STAMP slightly outperforms NARM on RetailRocket dataset. On both Yoochoose datasets, NARM outperforms STAMP. This might be the result of the nature of the dataset. It also shows that RNN-based models are sufficient for session-based recommendation in most settings.

  • CSRM is an RNN-based model that performed better than NARM and STMAP on all the three datasets. In addition to the local and the global preference, CSRM utilizes the neighboring sessions for improved recommendation

  • SR-GNN and GCE-GNN are GNN-based models that also performed better than NARM and STAMP on all the three datasets. In addition to the local and global preference, SR-GNN utilizes the transition interaction between the items in the same session to improve the performance of the recommendation. GCE-GNN on the other hand, utilizes item level information from epsilon neighbors to augment the global preference. GCE-GNN does not consider the local preference as in the other models. Compared with CSRM, the two GNN-based models (SR-GNN and GCE-GNN) performed better on the Yoochoose 1/64 and Yoochoose 1/4 datasets. However, on the RetailRocket dataset, CSRM and GCE-GNN outperformed SR-GNN. It showed the significance of utilizing the additional information from neighboring sessions in session-based recommendation.

  • IC-GAR significantly outperforms all the baseline models on MRR@5, 10 and P@5. In particular, on Yoochoose 1/64 dataset, IC-GAR outperforms the best baseline by 17.9, 15.4 and 5.9% on MRR@5, MRR@10 and P@5, respectively. On Yoochoose 1/4, IC-GAR performs better than the best baseline by 11.6, 9.7 and 4.2% on MRR@5, MRR@10 and P@5, respectively. IC-GAR outperforms all the baselines on the RetailRocket. It outperforms the best baseline by 21.1, 19.6, 13.1 and 7.1% on MRR@5, MRR@10, P@5 and P@10, respectively. Of particular importance is the performance of IC-GAR in terms on MRR@5,10. It outperforms the best baseline by 9.7–21.1% on all datasets. It clearly shows that considering the item co-occurrence patterns can significantly improve the ordering of the recommended list. IC-GAR slightly performs worse than the best baseline on both 1/64 and 1/4 Yoochoose datasets on P@10. It may be due to the fact that IC-GAR model only constructs one graph for all sessions and some local patterns may not be fully exploited with only one graph. However, GNN models are slow in training, especially when the size and number of graphs are large. The training time is reduced as only one graph is used for all the sessions.

Table 3 Overall performance comparison with the state-of-the-art models (values are in percentages)

4.5.2 Performance w.r.t to session length

The performance of session-based recommendation models may differ as the length of sessions increases or decreases. We compare the performance of IC-GAR on different session lengths. Particularly, we compare the performance of SR-GNN, CSRM, GCE-GNN and IC-GAR for short session and long sessions on the Yoochoose 1/64 and RetailRocket datasets with P@10 and MRR@10. Similar to SR-GNN, we divide sessions into “short” and “long” based on the average length of the session. On both datasets, we use sessions with length greater than 5 as “long” session, while the rest is used as “short” sessions. Table 4 shows the performance of SR-GNN, CSRM, GCE-GNN and IC-GAR on Yoochoose 1/64 and RetailRocket datasets for “short” and “long” session. It can be seen that across all models, the performance significantly drops for “long” session. GCE-GNN significantly outperformed other models on the Yoochoose 1/64 dataset on P@10 metrics. The performance may be attributed to the epsilon neighborhood that GCE-GNN considers. The performance of SR-GNN is of particular importance on the RetailRocket dataset for “long” session. It can be seen there is massive drop in performance which can be attributed to the maximum length on RetailRocket dataset. The performance shows that SR-GNN may not be a suitable model as the session length drastically increases. However, for “short” sessions, there is improvement in performance across all models. It shows that session-based recommendation models were designed with short sessions in mind. It also shows that, as the session length increases, there is need to consider other factors for improving performance.

Table 4 Performance w.r.t to session length

4.5.3 Performance w.r.t to training time

We compare the training time of IC-GAR with the baseline models in terms of performance, namely: SR-GNN, GCE-GNN and CSRM. The training time comparison is motivated by the slow nature of training GNN models as the size and number of the graphs increases. Figure 3 shows the average training time per epoch on all the three datasets on the same GPU server. It can be seen that SR-GNN and GCE-GNN take on average twice the time required to train CSRM per epoch. The time required will significantly increase as the length of the session increases in SR-GNN and GCE-GNN due to the size of the resulting outgoing and incoming adjacency matrices that the models construct for each session. On average, IC-GAR takes less training time per epoch than CSRM despite using GNN. It can be attributed to the fact that IC-GAR only constructs one graph for the whole dataset and that the graph constructed in IC-GAR does not depend on the length of the session rather the number of items in the catalog.

Fig. 3
figure 3

Training time per epoch (best viewed in colour)

4.6 Effect of item co-occurrence graph

IC-GAR distinguishes itself from other RNN-based models for session-based recommendation by constructing item co-occurrence graph using GNN. Here, we investigate the relevance of the item co-occurrence graph for session-based recommendation. Table 5 shows the performance of IC-GAR with and without the item-occurrence graph. We name the model without the item co-occurrence as SRB, while the model with item co-occurrence remains as IC-GAR. It can be seen that on all three datasets, using item co-occurrence graph significantly improves the performance. On average, there is an improvement of at least 15.7, 8.5 and 36% on Yoochoose 1/64, Yoochoose 1/4 and RetailRocket datasets, respectively. It shows that learning co-occurrence patterns can significantly improve the performance in session-based recommendation. Although, the effect of item co-occurrence graph is more significant on RetailRocket, it significantly improves the performance on all datasets.

Table 5 Effect of item co-occurrence graph in IC-GAR (values are in percentages)

4.7 Ablation study

As various components play different roles in the performance of IC-GAR, we investigate the relevance of the different choices in the architecture. First, we study the effect of the embedding size of the GRU and GCN. We then study the effect of different aggregation methods. Finally, we study the effect of the graph type used in the GCN.

4.7.1 Effect of embedding size

For fair comparison, we used the same embedding size as the other baseline models in Table 2 for the overall performance (embedding size = 100). However, in this section we show the effect of different embedding sizes on the performance of IC-GAR. Table 6 shows the performance as embedding size varies from 50 to 200 on all the three datasets. We used the same embedding size for GCN, GRU as well as all of the weights. It can be seen that on all datasets, the performance deteriorates when the embedding size is 50. However, the performance is fairly similar with the dimensions of 100, 150 and 200. It shows that once the embedding size is sufficient, the performance is insensitive for any larger embedding size. However, as the embedding size increases, the training time and the model size increase correspondingly. Hence using embedding size of 100 was an optimal selection.

Table 6 Effect of embedding size on the performance of all three datasets (values are in percentages)

4.7.2 Effect of aggregation

Different permutation invariant aggregation methods such as concatenation, max pooling and mean pooling can be used to obtain the output of GCN. Table 7 shows the effects of concatenation, max pooling and mean pooling on the performance of IC-GAR on all three datasets, respectively. It can be seen that concatenation outperforms other aggregation methods across all metrics. Concatenation may contribute to the success of IC-GAR. We further compare the performance of these aggregation methods as the number of epochs increases. We specifically compare the performance as the number of epochs increase from 1 to 10 on P@10 and MRR@10 across all datasets. Figure 4 shows that across all the datasets, concatenation outperforms both the mean pooling and the max pooling. However, performance varies at lower epochs. On Yoochoose 1/4 and RetailRocket dataset, mean pooling outperforms other methods at 1 and 2 epochs but concatenation stabilizes to a higher accuracy as the number of epochs increases (Fig. 4).

Table 7 Effect of different GCN aggregation methods
Fig. 4
figure 4

Effect of GCN Aggregation Method

4.7.3 Effect of graph type

Previous studies on GNN-based session-based recommendation, such as [60, 61, 73], used directed graph and modeled both the incoming and outgoing adjacency matrices. However, these models apply GNN on each session. Inspired by STAMP [37] that showed the order of interaction may not be important on online transactional datasets such as Yoochoose, we used an undirected graph for IC-GAR model, which may reduce the computational complexity introduced by using both the incoming and outgoing adjacency matrices. To show the effect of such decision, Table 8 compares the performance between the undirected graph and the directed graph (having both the incoming and outgoing graphs) on IC-GAR model. Although close performance is achieved by undirected graph and directed graph, IC-GAR reduces the computational complexity and ensures a comparable training time with non-GNN based models.

Table 8 Effect of using directed and undirected graph on performance of IC-GAR

5 Discussion

In this section, we will discuss our results keeping in mind the research questions we aimed to answer. The section will discuss each of the research questions.

5.1 Does the proposed IC-GAR model achieve the state-of-the-art performance?

We conducted experiments on two publicly available datasets on two accuracy metrics to determine the performance of IC-GAR against other state-of-the-art models. Table 3 shows that IC-GAR can achieve state-of-the-art performance against RNN-based models like CSRM and GNN-based models like GCE-GNN. However, as the value of k increases, the performance of IC-GAR deteriorates on Yoochoose dataset. However, on the RetailRocket dataset, across all metrics, IC-GAR outperformed the competition. The results suggests that, performance of models differ from one dataset to another and that for industrial application, the bias and nature of the dataset need to be considered before selecting any model. We also compare the performance of IC-GAR for different session length on the Yoochoose 1/64 and RetailRocket datasets. The results showed similar trend in performance as when the whole datasets were used. However, performance of SR-GNN particularly deteriorates on “long” session for RetailRocket dataset. It suggests that, SR-GNN may not be a good model as the session length drastically increases. Finally, we compare the training time for SR-GNN, CSRM, GCE-GNN and IC-GAR on the whole sessions. Results suggest that CSRM and IC-GAR have similar time complexity, while time complexity for SR-GNN and GCE-GNN more than doubles that of the other models. The overall results suggest that IC-GAR is an efficient model that can outperform other state-of-the-art on relevant datasets.

5.2 What is the effect of the item co-occurrence graph on the performance of IC-GAR?

IC-GAR comprises of local preference, global preference and item co-occurrence graph for improved performance. We compared the performance with and without the item co-occurrence and the results suggested that the item co-occurrence graph can significantly improve the performance. The results are in line with findings of CSRM where session-level collaborative information was used to improve similar baseline. However, our model uses item-level collaborative information for improved performance.

5.3 How well does IC-GAR perform with different embedding size, the aggregation methods and the graph type?

We also study the effect of some key components in IC-GAR model. The results suggest that, for a small embedding size, IC-GAR does not reach its full performance but after reaching a sufficient embedding size, increasing it further does not significantly improve performance. Rather, as the embedding size increases, the complexity of the model further increases and slows down the training and inference. Also, the results showed that, the performance of aggregation methods vary across datasets but in our experiments, the methods (concatenation, mean-pooling and max-pooling) compared relatively have similar performances. It may be relevant to compare the performance of the aggregation methods in terms of time complexity. Finally, we compare two different graph construction methods (undirected and combined directed). The experimental results suggest that there is no significant different between these methods in terms of performance. However, using both incoming and outgoing adjacency matrix can increase the computation complexity significantly.

6 Conclusion

In this paper, we proposed a novel session-based recommendation model, IC-GAR that uses a trilinear decomposition to model session representation from global preference, local preference and session co-occurrence. The session co-occurrence representation aggregates the higher-order transition patterns of all the items in the training sessions, while the global and the local preferences model user interest in the current session. Experimental results showed that IC-GAR achieved the state-of-the-art performance for session-based recommendation by using the item co-occurrence patterns.