Keywords

1 Introduction

In the Internet era, recommender systems have found their way into various business applications, such as e-commerce, online advertising, and social media. Recently, sequential recommendation has emerged as the mainstream approach for next-item recommendation. Learning a user’s latent intentions from his/her temporally ordered interactions lies in the core of sequential recommendation. In real-world scenarios, a user normally exhibits multiple intentions in his/her historical interactions. To this end, some very recent studies [4, 7, 10] have started to explore a user’s multiple latent intentions in different ways.

While these studies have confirmed that modeling a user’s multiple intentions is a rewarding research direction, we argue that they still suffer from two major limitations. First, they largely neglect the dynamic evolution of individual intentions. While previous studies emphasize the extraction of multiple intentions from user interaction sequences, they overlook the benefits of modeling the dynamic evolution of each individual intention, which is essential for next-item recommendation. Second, modeling a user’s intention to interact with an item as a weighted sum of multiple intentions is counter-intuitive. While a user exhibits multiple intentions in his/her historical interaction sequence, the interaction with a particular item is usually driven by a single intention.

In this paper, we propose a novel Multi-Intention Sequential Recommender (MISRec) to address these two limitations. We first design a multi-intention extraction module to extract multiple intentions from user interaction sequences. Next, we propose a multi-intention evolution module, consisting of an intention-aware remapping layer and an intention-aware evolution layer. The intention-aware remapping layer incorporates position information and recommendation time intervals to generate multiple intention-aware sequences, where each sequence corresponds to a learned intention. The intention-aware evolution layer is used to learn the dynamic evolution of each intention-aware sequence. Finally, we produce next-item recommendations by explicitly projecting a candidate item into multiple intention subspaces and determining its relevance to each intention. Empowered by Gumbel-softmax, we devise a multi-intention aggregation module to adaptively determine whether each intention is relevant to the target item or not. We perform a comprehensive experimental study on three public benchmark datasets and demonstrate that MISRec consistently outperforms representative state-of-the-art competitors.

2 Related Work

Sequential recommendation has been an emerging paradigm for next-item recommendation. GRU4Rec [1] is the first to employ gated recurrent units (GRUs) to extract sequential patterns from user interaction sequences. Caser [8] considers convolutional neural networks (CNNs) as the backbone network to learn sequential patterns as local features of recent items. NARM [6] uses an attention mechanism to capture more flexible sequential patterns from user interaction sequences. SASRec [2] proposes to leverage self-attention to adaptively consider interacted items. All the above works assume that a user has only a monolithic intention and thus a single embedding representation, which does not reflect the reality well, leaving much room for further improvement. As such, some recent works have started to explore how to better model users using multiple intentions. MCPRN [10] designs a dynamic purpose routing network to capture different user intentions. SINE [7] activates sparse user intentions from a given concept pool and then aggregates the intentions for next-item recommendations.

3 Proposed Method

3.1 Problem Setting

Let \(\mathcal {U}=\{u_1, u_2, \cdots , u_{|\mathcal {U}|}\}\) and \(\mathcal {I}=\{i_1, i_2, \cdots , i_{|\mathcal {I}|}\}\) be the set of all users and the set of all items, respectively. Given a sequence of user u’s historically interacted items \(S^u = \left( s_1^u, s_2^u, \cdots , s_{l}^u\right) \) with \(s_i^u \in \mathcal {I}\) and the corresponding time sequence \(T^u = \left( t_1^u, t_2^u, \cdots , t_{l}^u\right) \) with \(t_1^u \le t_2^u \le \cdots \le t_l^u\), the goal of sequential recommendation is to predict the next item with which user u is most likely to interact next. In addition, the recommendation time t is important for recommendation. We transform the interaction time sequence \(T^u\) into a new time interval sequence \(Tiv^u = \left( tiv_1^u, tiv_2^u, \cdots , tiv_{l}^u\right) \), where \(tiv_i^u = \min (t - t_i^u, \tau )\) with \(\tau \) being a hyperparameter controlling the maximum time interval.

3.2 Embedding Layer

Following previous works, we first transform the user u’s interaction sequence \((s^u_1, s^u_2, \cdots , s^u_{l})\) into a fixed-length sequence \((s^u_1, s^u_2, \cdots , s^u_n)\), where n denotes the maximum length that our model handles. In the embedding layer, we create an item embedding matrix \(\textbf{E}_i \in \mathbb {R}^{|\mathcal {I}|\times d}\) based on the one-hot encodings of item IDs, where d is the dimension of embedding vectors. Then we retrieve the interaction sequence embedding matrix \(\textbf{E}_{S^u} = \begin{bmatrix} e_{s^u_1}, e_{s^u_2}, \cdots , e_{s^u_n}\end{bmatrix} \in \mathbb {R}^{n \times d}\), where \(e_{s^u_i}\) is the embedding of item \(s^u_i\) in \(\textbf{E}_i\). We also establish two embedding matrices, \(\textbf{E}_{P} = \begin{bmatrix} e_{p_1}, e_{p_2}, \cdots , e_{p_n} \end{bmatrix} \in \mathbb {R}^{n \times d}\) for absolute positions and \(\textbf{E}_{{Tiv}^u} =\begin{bmatrix} e_{{tiv}^u_1}, e_{{tiv}^u_2}, \cdots , e_{{tiv}^u_n}\end{bmatrix} \in \mathbb {R}^{n \times d}\) for relative time intervals.

3.3 Multi-Intention Extraction Module

To capture multiple intentions behind a user’s historical interaction sequence, we propose a multi-intention extraction module based on multi-head attention mechanism. Specifically, we map the embedding matrix of a user’s interaction sequence \(\textbf{E}_{S^u}\) into different latent subspaces using multiple heads, where each head represents an intention of a user. Let \(\gamma \) be the number of heads and thus the number of intentions. We generate the kth intention \(m_k^u\) via

$$\begin{aligned} m_k^u = head_k^u\textbf{W}_t, \end{aligned}$$
(1)
$$\begin{aligned} head_k^u = Attention(\textbf{E}_{S^u} \textbf{W}_k^Q, \textbf{E}_{S^u} \textbf{W}_k^K, \textbf{E}_{S^u} \textbf{W}_k^V), \end{aligned}$$
(2)

where \(head_k^u \in \mathbb {R}^{1 \times \frac{d}{\gamma }}\) is the output of kth head through a multi-head attention layer. Note that, to match the dimension of item embeddings, a transformation matrix \(\textbf{W}_t \in \mathbb {R}^{\frac{d}{\gamma } \times d}\) is proposed to transform \(head_k^u\) from \(\mathbb {R}^{1 \times \frac{d}{\gamma }}\) to \(\mathbb {R}^{1 \times d}\). \(Attention(\cdot )\) is an attention function, and \(\textbf{W}_k^Q\), \(\textbf{W}_k^K\), and \(\textbf{W}_k^V \in \mathbb {R}^{d \times \frac{d}{\gamma }}\) are the trainable transformation matrices of the kth head’s query, key and value, respectively. Inspired by previous works [9], we adopt scaled dot-product as the attention function:

$$\begin{aligned} Attention(\textbf{Q,K,V}) = softmax(\frac{\textbf{QK}^\top }{\sqrt{d}})\textbf{V}. \end{aligned}$$
(3)

After the multi-intention extraction module, we obtain a user u’s \(\gamma \) intentions, denoted by \((m_1^u, m_2^u, \cdots , m_{\gamma }^u)\).

3.4 Multi-Intention Evolution Module

With the extracted multiple intentions from the multi-intention extraction module, we next capture the dynamic evolution of each intention via a multi-intention evolution module, which consists of an intention-aware remapping layer and an intention-aware evolution layer.

Intention-Aware Remapping Layer. Simply capturing the sequential patterns on the learned intentions lacks guarantees to model the dynamic evolution of user intentions precisely [5]. Therefore, we first design an intention-aware remapping layer to explicitly inject sequentiality and temporal information into intention-aware interaction sequences. In particular, we devise an extended scaled dot-product attention mechanism, where the learned intentions play the role of query vectors, and the key and value of the scaled dot-product attention are the interaction sequence injected with positional and temporal information:

$$\begin{aligned} (\textbf{Key}:\textbf{Value}):(\textbf{E}_{S^u} \textbf{W}_S^K + \textbf{E}_P\textbf{W}_P^K + \textbf{E}_{{Tiv}^u} \textbf{W}_T^K: \textbf{E}_{S^u} \textbf{W}_S^V + \textbf{E}_P \textbf{W}_P^V + \textbf{E}_{{Tiv}^u} \textbf{W}_T^V), \end{aligned}$$
(4)

where \(\textbf{E}_{S^u}\), \(\textbf{E}_P\), \(\textbf{E}_{{Tiv}^u} \in \mathbb {R}^{n \times d}\) are the embedding matrices of the interaction sequence, position sequence and time interval sequence, respectively. \(\textbf{W}^K\) and \(\textbf{W}^V \in \mathbb {R}^{d \times d}\) are the trainable matrices for keys and values, where the subscripts S, P and T indicate the matrices for the interaction sequence, position sequence and time interval sequence, respectively. Then we compute a new intention-aware interaction sequence \(\textbf{S}_k^u=(s^u_{k1}, s^u_{k2}, \cdots , s^u_{kn})\) via

$$\begin{aligned} \textbf{S}_k^u = softmax \left( \frac{(m_k^u \textbf{W}^Q_{S_k}) \textbf{Key}^\top }{\sqrt{d}}\right) \textbf{Value}, \end{aligned}$$
(5)

where \(\textbf{W}^Q_{S_k} \in \mathbb {R}^{d \times d}\) is the trainable matrix for intention \(m^u_k\).

Intention-Aware Evolution Layer. To capture the dynamic evolution of each intention, we employ gated recurrent units (GRUs) to model the dependencies between interacted items under each individual intention. Specifically, the input to the GRU for the kth intention is the kth intention-aware interaction sequence \(\textbf{S}^u_k\). We utilize the last hidden state \(h_{k}^u\) of the GRU to represent the user u under the kth intention. We further adopt a point-wise feedforward network (FFN) to endow the model with non-linearity and consider interactions between different latent dimensions:

$$\begin{aligned}&m_k^u = h_k^u + Dropout(FFN(LayerNorm(h_k^u))), \end{aligned}$$
(6)
$$\begin{aligned}&LayerNorm(x)= \alpha \odot \frac{x - \mu }{\sqrt{\sigma ^2 + \epsilon }} + \beta , \end{aligned}$$
(7)
$$\begin{aligned}&FFN(x)=ReLU(x \textbf{W}_{1} + b_{1}) \textbf{W}_{2} + b_{2}, \end{aligned}$$
(8)

where \(\textbf{W}_{1}, \textbf{W}_{2} \in \mathbb {R}^{d \times d}\) are learnable matrices, and \(b_{1}\), \(b_{2}\) are d-dimensional bias vectors. \(\mu \) and \(\sigma \) are the mean and variance of x, \(\alpha \) and \(\beta \) are the learned scaling factor and bias term, respectively. We apply layer normalization to the input \(h^u_k\) before feeding it into the FFN, apply dropout to the FFN’s output, and add the input \(h^u_k\) to the final output.

3.5 Multi-Intention Aggregation Module

Intuitively, a user’s interaction with an item is usually driven by a single intention. Directly combining the multiple intention representations as the final intention representation is counter-intuitive and cannot maximize the benefits of extracting multiple intentions. In addressing this issue, we adopt the Gumbel-softmax to adaptively determine whether an intention is relevant to the candidate item or not. Specifically, we first explicitly project the candidate item’s embedding \(e_{n+1}\) into different intention subspaces (see Eq. 9), and then calculate the relevance between each intention representation and the candidate item’s embedding in each intention subspace via the inner product operation (see Eq. 10). Distinct from the previous methods using softmax to aggregate the multiple intention representations, we adopt the Gumbel-softmax to identify the most relevant intention (see Eqs. 11 and 12). Finally, we obtain the final representation \(m^u\) of user u at the finer granularity of intentions.

$$\begin{aligned}&e_{n+1}^k=e_{n+1} \mathbf {W^k}, \end{aligned}$$
(9)
$$\begin{aligned}&r_{n+1}^k=e_{n+1}^k {m_k^u}^\top ,\end{aligned}$$
(10)
$$\begin{aligned}&a_k =\frac{\exp ((\log (r_{n+1}^k) + g_i) / \tau )}{\sum _{j=1}^{\gamma } \exp ((\log (r_{n+1}^j) + g_j) / \tau )},\end{aligned}$$
(11)
$$\begin{aligned}&m^u = \sum _{k=1}^{\gamma } a_k *m_k^u. \end{aligned}$$
(12)

3.6 Model Training

After we get the final representation \(m_u\) of user u, prediction scores are calculated as the inner product of the final user representation \(m_u\) and the candidate item’s embedding \(e_{i}\):

$$\begin{aligned} r_{u, i} = e_{i} m_u^\top . \end{aligned}$$
(13)

We use the pairwise Bayesian personalized ranking (BPR) loss to optimize the model parameters. Specifically, it encourages the predicted scores of a user’s historical items to be higher than those of unobserved items:

$$\begin{aligned} \mathcal {L_{BPR}} = \sum _{(u,i,j)\in \mathcal {O}} -\ln \sigma (r_{u,i}-r_{u,j}) + \lambda {\Vert {\boldsymbol{\varTheta }} \Vert _2^2}, \end{aligned}$$
(14)

where \(\mathcal {O}=\{ (u,i,j)|(u,i) \in \mathcal {O}^{+}, (u,j) \in \mathcal {O}^{-} \}\) denotes the training dataset consisting of the observed interactions \(\mathcal {O}^{+}\) and sampled unobserved items \(\mathcal {O}^{-}\), \(\sigma (\cdot )\) is the sigmoid activation function, \(\boldsymbol{\varTheta }\) is the set of embedding matrices, and \(\lambda \) is the \(L_2\) regularization parameter.

4 Experiments

4.1 Experimental Setup

Datasets and Evaluation Metrics. We evaluate our framework on three public benchmark datasets that are widely used in the literature. Amazon-Review datasetsFootnote 1 contain product reviews from the online shopping platform Amazon, and we use two representative datasets, Grocery and Gourmet Food (referred to as Grocery and Beauty). MovieLensFootnote 2 datasets contain a collection of movie ratings from the website MovieLens. and we use MovieLens-1M (referred to as ML1M) in our experiments. Following previous works [2, 5], we filter out cold-start users and items with fewer than 5 interactions and sort the interactions of each user by timestamps. Similarly, we use the most recent item for testing, the second most recent item for validation, and the remaining items for training. We evaluate our framework by two widely-adopted ranking metrics, Hit Ratio@N (HR@N) and Normalized Discounted Cumulative Gain@N (NDCG@N).

Baselines. To demonstrate the effectiveness of our solution, we compare it with a wide range of representative sequential recommenders, including four single-intention-aware methods (GRU4Rec [1], NARM [6], Caser [8], and TiSASRec [5]) and a multi-intention-aware method, SINE [7].

Implementation Details. Identical to the settings of previous methods, the embedding size is fixed to 64. We optimize our method with Adam [3] and set the learning rate of Grocery, Beauty, and ML1M to \(10^{-4}\), \(10^{-3}\) and \(10^{-4}\), respectively, and the mini-batch size to 256 for all three datasets. The maximum length of interaction sequences of Grocery, Beauty, and ML1M is set to 10, 20, and 50, respectively. The maximum time interval is set to 512 sec for all three datasets. The temperature parameter \(\tau \) in the Gumbel-softmax is set to 0.1. To address overfitting, we use \(L_2\) regularization with the regularization coefficients of \(10^{-5}\) for Grocery and ML1M and \(10^{-4}\) for Beauty.

Table 1. Performance of different models on the three datasets. All the numbers in the table are percentages with % omitted.

4.2 Main Results

Overall Comparison. We report the overall comparison in Table 1, where the best results are boldfaced and the second-best and third-best results are underlined. We can draw a few interesting observations: (1) TiSASRec achieves the best performance among single-intention-aware methods, indicating the effectiveness of the self-attention mechanism and temporal information in capturing sequential patterns. However, without considering multiple user intentions, these methods cannot identify a user’s true intention accurately, leading to sub-optimal recommendations. (2) As a multi-intention-aware method, SINE performs generally better than most single-intention-aware methods, which shows that explicitly exploring multiple user intentions is a rewarding direction. However, the performance of SINE is still worse than TiSASRec. We deem that it is caused by the negligence of the dynamic evolution of individual intentions and the improper aggregation of multiple intentions. (3) By addressing the two issues mentioned above, MISRec maximizes the benefits of extracting multiple intentions and consistently yields the best performance on all datasets, which well justifies our motivation.

Table 2. Performance of different variants of MISRec. The results of HR@20 and NDCG@20 are omitted due to the space limitation.

Ablation Study. To investigate the contributions of different components on the final performance, we conduct an ablation study to compare the performance of different variants of our MISRec model on the three datasets. The variants include: (1) w/o PE removes positional embeddings in the multi-intention evolution module. (2) w/o TIE removes time interval embeddings in the multi-intention evolution module. (3) w/o GS replaces the Gumbel-softmax with the softmax in the multi-intention aggregation module. Table 2 shows the performance of all variants and the full MISRec model on the three datasets. By comparing the performance of different variants, we can derive that both positional embeddings and time interval embeddings lead to performance improvement, which demonstrates the significance of explicitly modeling the dynamic evolution of different intentions. Furthermore, identifying the most relevant intention rather than aggregating multiple intentions consistently improves the performance by a significant margin, which justifies our motivation.

5 Conclusion

In this paper, we proposed a novel Multi-Intention Sequential Recommender (MISRec) to address the limitations of existing works that leverage users’ multiple intentions for better next-item recommendations. We made two major contributions. First, we designed a multi-intention evolution module that effectively models the evolution of each individual intention. Second, we proposed to explicitly identify the most relevant intention rather than aggregate multiple intentions to maximize the benefits of extracting multiple intentions. A comprehensive experimental study on three public benchmark datasets demonstrates the superiority of the MISRec model over a large number of state-of-the-art competitors.