1 Introduction

The sequence recommendation (SR) [1, 2] task treats the interaction between users and items as a dynamic sequence, capturing interests of user over time by modeling sequence dependencies, and predicting future user interactions. Sequence recommendation provides services in many aspects of daily life, such as e-commerce [3, 4], news media [5, 6], video music [7], and social networks [8].

Hence, a variety of SR models, including both shallow and deep models, have been proposed to improve the performance of sequential recommendations. Specifically, Recurrent Neural Networks built on Gate Recurrent Units (GRU) have been employed to model the long- and short-term point-wise sequential dependencies over user-item interactions for next-item recommendations [9, 10]. Convolutional Neural Network (CNN) [11], self-attention [12, 13] and Graph Neural Network [14, 15] models have been incorporated into sequential recommendation systems for capturing more complex sequential dependencies for further improving the performance.

When modeling user interests using the above structure, it is common to represent user interests as low dimensional embeddings, which contradicts the fact that each user may have multiple interests in reality [16]. Some studies [3, 17,18,19] propose capturing multiple user interest vectors from different aspects instead of a single vector. These methods explicitly generate users’ diverse interest representations from their behavior sequences, breaking the representation bottleneck of using a single generic user embedding. Although these solutions have achieved significant performance improvements, they have not taken into account the differences between multiple interests. In the worst-case scenario, all interest capsules have the same meaning and cannot reflect the diversity of user interests. Recent work has further improved the modeling ability of multiple user interests through routing regularization [20]. However, as shown in Figure 1, we calculated the similarity between multiple interest feature vectors generated by REMI [20] and HPCL4SR model on the Amazon-Clothing dataset, and randomly selected 128 users for statistical analysis. We found that although REMI [20] models can also represent multiple interests of users, the representation of user interests has great similarity, and HPCL4SR significantly reduces the similarity between multiple interests of users. In other words, our model can better represent multiple different interests of users, increasing the diversity between interests.

Fig. 1
figure 1

The similarity between multiple interests of users is randomly selected from 128 users for statistical purposes. The horizontal axis represents the range of similarity values, and the vertical axis represents the percentage of similarity values in the corresponding interval

Concretely, we propose a novel multi-interest sequence recommendation framework (named HPCL4SR). HPCL4SR models users’ high-level preference interests by constructing a global graph of categories, and feeding them as positive examples into contrastive learning, optimizing multiple interest representations of users. Specifically, based on the item sequence behavior of all users, the categories of items are constructed as global graph information for high-level preference interest modeling. In this process, in order to alleviate the imbalance problem in class interaction, attention weights are used to reconstruct the adjacency matrix, and multi-hop aggregation is performed on categories that are not directly connected to each other, reducing the sparsity of the interaction matrix and increasing the correlation between classes. For the user’s item sequence, the context information of the sequence is obtained by encoding the position of the item and fusing attention weights. Then, capsule networks are used to learn the item sequence information for low-level interest preference modeling. In order to further enhance the vector representation ability of items, the network model parameters are reverse optimized using the category information of items as labels. Naturally, we will integrate high-level and low-level preference interest to generate multiple interest features for users. Contrastive learning can maximize the similarity between related samples and minimize the similarity between unrelated samples. This paper draws inspiration from this idea but distinguishes multiple interests of users through learning while not treating interests as completely unrelated and preserving the hidden correlation information between interests. Therefore, we use low-level preference interest corresponding to the fused features of high-level preference interest as positive examples, and other preference interest as negative examples to learn the differences between user interests.

To summarize, the contributions of this paper are as follows:

  • We propose a new novel interest sequence recommendation framework (HPCL4SR), which solves the problem of existing methods not being able to represent the multiple interests of users.

  • We construct a global graph based on the category information of items to model the user’s high-level preference interest. In addition, using it as a positive example in contrastive learning is a relatively optimal approach.

  • We conduct extensive experiments on three real-world datasets to verify the effectiveness of the HPCL4SR. Further analysis has demonstrated that the proposed method can more reasonably model the diversity of multiple interests of users.

2 Related work

This paper mainly uses category information as positive examples to solve the problem of differences among multiple interests in sequence recommendation through contrastive learning. Therefore, this section briefly overviews representative efforts relevant to our work from Multi-Interest Sequence Recommendation, Large Language Models for Recommendation, and Contrastive Learning.

2.1 Multi-interest sequence recommendation

In practical scenarios, users’ historical behavior has complex interaction patterns, and modeling interests as a single vector using the above methods is not sufficient to accurately reflect users’ true multi preference interest. Therefore, studying sequential recommendation models based on multiple interests has become more important and practical.

MIND [17] proposes a multi-interest extractor layer based on the capsule routing mechanism, which is applicable for clustering historical behaviors and extracting diverse interests. SDM [21] uses a multi head attention mechanism in encoding behavior sequences to capture multiple interests of the user. SIN [22] adaptively infers user interaction interests from a large number of interest pools and outputs multiple interest embeddings, and then uses the attention weights of items to generate multiple interest embeddings that best match user characteristics. ComiRec [18] is based on multi head attention based multi-interest routing to capture multiple interests of users and introduce controllable factors to achieve diverse recommendations. PIMI [23] models the periodic features of temporal information between user behaviors and the interactive features between sequence items, respectively, and uses their representations to describe users’ multiple interests. DuoRec [24] designes contrastive regularization to reshape the distribution of sequence representations and selectes sequences with the same target item as hard positive samples, alleviating the problem of representation degradation in multi-interest sequence recommendation tasks. UMI [25] believes that the interests of a user are not only reflected in their historical behavior, but also inherently regulated by the profile information. Therefore, the user profiles are introduced as a source of multi-interest features for users. REMI [20] first mitigates the problem of easy negatives with an ideal interest-aware hard negative sampling distribution and an approximation method to achieve the goal at a negligible computational cost. REMI also incorporates a novel routing regularization to avoid routing collapse and further improve the modeling capacity of multi-interest models.

2.2 Large language models for recommendation

Recent years have witnessed the wide adoption of large language models (LLMs) in different fields, especially natural language processing and computer vision. Such a trend can also be observed in recommendation systems (RS). However, due to the huge number of items in real-world systems, traditional RS usually takes the two-stage filtering paradigm of the matching stage (It aims to extract a small subset of items from the extensive corpus with lightweight models, ensuring low computational costs.) and ranking (It utilities more sophisticated models to rerank the retrieved items), advanced recommendation algorithms are not applied to all items, but only a few hundred of items. [26]. Therefore, existing large language models (LLMs) (e.g., ChatGPT) methods [27,28,29] focus on the sorting stage that utilities more Sophisticated models to rerank the retrieved items. We focus on improving the effectiveness of the matching stage, which serves as a crucial foundation for the recommendation systems. To our knowledge, there is currently no research on the application of the LLMs methods in the matching stage. But, in the experimental section, we will attempt to analyze the performance of ChatGPT (ChatGPT 3.5-Turbo-1106 & ChatGPT 4-Turbo) in the matching stage.

2.3 Contrastive learning

Contrastive learning has been widely applied in the field of computer vision. In contrastive learning, methods such as CPC [30, 31] and DIM [32] feed the encoding of the same image at different scales as positive samples, while MoCo [33], SimSiam [34], CaCo [35] and other methods use multiple image enhancements as positive samples for contrastive learning. In the field of text information processing, some studies use different data-transforming methods or strategies, such as dropout and mask, to change the parameters and structure of the encoder to improve the model’s ability to perform sentence representation [36,37,38]. The introduction of contrastive learning in sequence recommendation systems mainly solves the problems of sparse user-item interaction and noise. Scholars improve recommendation performance by designing auxiliary tasks or loss functions [39]. CBiT [40] combines the cloze task mask and the dropout mask to generate high-quality positive samples and perform multi-pair contrastive learning. ICLRec [39] models user intentions through clustering of item sequences, maximizing the agreement between a view of the sequence and its corresponding intentions to improve recommendation performance. CL4SRec [41] employs contrastive learning to learn consistent perception enhancement representations from sequential pattern encoding and global collaborative relationship modeling.

Although researchers are trying to describe users’ various interests in different ways, they rarely consider the issue of diversity in interests. In the worst case, all interest capsules have the same meaning, or all items may activate the same interest capsule, which makes it difficult to express multiple interests. ComiRec [18] uses controllable factors to recommend diverse user interests, but the paper also mentions that increasing diversity can lead to a decrease in recall rates. REMI [20] observe that the interests tend to over-focus on single items in the behavior sequence, which impacts the expressiveness of multi-interest representations. They introduce the variance regularizer on the routing weights to eliminate sparsity and effectively address the problem. MIRACLE [19] forces interest capsules to satisfy orthogonality, which clearly provides each user with K unrelated interests. However, such K interests can cause unnecessary item recommendations for users, which goes against our common sense that there may be implicit correlations between interests. Therefore, it is necessary and meaningful for multi-interest sequence recommendations to preserve implicit correlation information while ensuring the difference between interests. We attempt to solve the above problem through contrastive learning, which distinguishes the differences in interests through self-supervised learning of data features and maintains the correlation information between the representations of interests.

3 Problem formulation

Assume U denotes a set of users, X denotes a set of items, and C is a set of categories. Each item \(x_{i}\) has its corresponding category \(c_{i}\). Given a user \(u \in U\), we have his/her chronological item interaction sequence \(S_{x}^{u}=\left\{ x_{1}^{u}, x_{2}^{u}, \ldots , x_{N}^{u}\right\} \) and a corresponding category interaction sequence \(S_{c}^{u}=\left\{ c_{1}^{u}, c_{2}^{u}, \ldots , c_{N}^{u}\right\} \), where \(x_{t}^{u} \in X\) and \(c_{t}^{u} \in C\) represents the item and its category that user u interacted with at time step t, respectively. N is the maximum sequence length. The candidate matching stage in RS aims to efficiently retrieve a subset of items the user is likely to interact with from the huge item corpus X.

Fig. 2
figure 2

An Overview of Multi-interest Sequential Recommendation (HPCL4SR) framework

Fig. 3
figure 3

Sparsity analysis on Amazon-Clothing and Tmall-Buy datasets. The horizontal axis represents 1000 randomly selected items in the dataset, and the vertical axis represents the number of times items interact in the dataset

4 Method

In this section, we propose the High-level Preferences as positive examples in Contrastive Learning for multi-interest Sequence Recommendation framework (HPCL4SR), as shown in Figure 2. There are three parts: high-level preference interest extraction module, low-level preference interest extraction module, and multi-interest contrastive learning module.

4.1 High-level preference interest extraction module

Experiments in numerous fields of contrastive learning applications have shown that selecting good positive and negative samples is the key to the effectiveness of contrastive learning. In sequence recommendation tasks, most models use methods such as pruning sequences, dropout, and mask to construct contrastive samples. As shown in Figure 3, the user’s historical interaction is extremely sparse, and such operations will not fully represent the user’s interests and may even result in errors. A large amount of excellent work has proven that taking side information (user profile, category, brand, description, price, position, rating, etc.) into recommendation sequences can better capture user preference information [25, 42, 43]. In real scenarios, item category information is the easiest to obtain and is a high-level conceptual representation of the item. Therefore, in this paper, we will use item category information as a contrastive sample. In fact, even though the number of categories is much smaller than the number of items, the interaction between categories is still sparse. So the method proposed in this paper does not directly take the category sequence corresponding to the item as the user’s high-level preference, but instead models the user’s high-level preference interest by constructing a category global graph, learning more preference interest correlation information from the user.

For user Category sequences \(S_{c}^{u}= \{c_{1}^{u}, c_{2}^{u}, \ldots , c_{N}^{u}\}\), we calculate the number of interactions between \(c_{i}^{u}\) and \(c_{j}^{u}\). And a category global graph (\(A_{1}\)) is constructed based on the historical interaction category sequence \(\{S_{c}^{1}, S_{c}^{2}, \ldots , S_{c}^{|U|}\}\) of all users, where the initial weight of the edges between the two nodes is the total number of interactions between the two categories \(a_{i j}\).

However, such a category global graph (\(A_{1}\)) still has two obvious problems: (1) By analyzing the interaction frequency of categories, it was found that due to the significant difference in the number of items contained in different categories, there is an imbalance in the interaction between categories, which will lead to recommendation results biased towards items in popular categories. (2) The method of constructing a graph through sequential interaction only considers the relationship between adjacent item categories, while ignoring the interaction between non directly adjacent categories. In fact, certain categories that are not adjacent often appear together in the user’s sequence.

In order to alleviate the imbalance of category interaction and avoid the impact of popular items on recommendation results, the adjacency matrix is redefined as follows:

$$\begin{aligned} A_{2}(i,j)=\frac{a_{i j}}{\sqrt{|a_{i}+1||a_{j}+1|}} \end{aligned}$$
(1)

where \(a_{i j}\) denotes the number of interactions with category \(c_{i}\) and category \(c_{j}\), \(a_{i}\) are the number of interactions category \(c_{i}\) with others, and \(a_{j}\) is similar to \(a_{i}\).

To learn the correlation between non-adjacent categories, we adapt the Multi-hop Attention Diffusion [44] method to aggregate information further. The attention score of multi-hop neighbors is calculated by:

$$\begin{aligned} A=\sum _{i=0}^{\infty } \theta _{i} A_{2}^{i} \end{aligned}$$
(2)

where \(\sum _{i=0}^{\infty } \theta _{i}=1(\theta _{i}>0)\), \(\theta _{i}\) is the attention decay factor, \(\theta _{i} > \theta _{i+1}\), i is the power of the adjacency matrix \(A_{2}\) which represents the farthest length of the diffusion relation path and also represents the farthest length of the graph diffusion relation path.

Assume \(H^{(0)} \in {R}^{|C| \times d}\) denotes the initial embedding matrix of the category, and d represents the dimension of the node embedding. We use GCN to aggregate the features of neighbors as a new representation of the target node, and introduce residual connections in this process. The message-passing process is as follows:

$$\begin{aligned} {H}^{(l)}= W(A{H}^{(l-1)})+ H^{(l-1)} \end{aligned}$$
(3)

where l is the number of GCN layers, W is trainable parameter matrices.

The final graph representation \(\hat{H} \in {R}^{|C| \times d}\) is obtained by:

$$\begin{aligned} \hat{H} = -\frac{1}{L} \sum _{i=0}^{L} {H}^{(l)} \end{aligned}$$
(4)

Based on the category information of user historical interactions, category node embedding representation Hg is selected from the graph:

$$\begin{aligned} Hg= \text {selecte} (\hat{\text {H}})[n, :], \ n=1, \ldots , N \end{aligned}$$
(5)

where, \(Hg \in {R}^{N \times d}\), N is the sequence length of user interaction.

Finally, the user’s high-level preference interest vector \(Q_{u}\) calculated as follows:

$$\begin{aligned} Q_{u}=WHg \end{aligned}$$
(6)

where, \(Q_{u} \in \mathbb {R}^{K \times d}\), K is the number of preference interests.

4.2 Low-level preference interest extraction module

In sequence recommendation, positional information can explicitly reflect contextual information between items. Therefore, in this paper, the attention mechanism is first used to encode sequence information:

$$\begin{aligned} X_{i} = E_{i}^{emb} + E_{i}^{pos} \end{aligned}$$
(7)

where \(E_{i}^{emb}\), \(E_{i}^{pos}\) is the embedding of the i-th item, and the positional embedding, respectively, \(X_{i}\) is an item embedding representation with sequence position information.

$$\begin{aligned} \alpha _{i j} = \frac{\exp (X_{i} X_{j}^{T})}{\sum _{n=1}^N \exp (X_{i} X_{n}^{T})} \end{aligned}$$
(8)

where \(\alpha _{i j}\) is the attention weight of item j to item i, We use neural networks to make each item in the sequence perceive the entire contextual information.

$$\begin{aligned} X_{i} = W (X_{i} + {\sum _{j=1}^N \alpha _{i j} X_{j}}) \end{aligned}$$
(9)

In multi-interest recommendation tasks, the effectiveness of Capsule Network [45] has been verified, so we directly draw on the above method to extract user low-level preference interests \(P_{u} \in \mathbb {R}^{K \times d}\).

$$\begin{aligned} P_{u} = CapsNet([X_{1}, X_{2}, \ldots , X_{N}]) \end{aligned}$$
(10)

In addition, in order to enable the network model to learn the differences between items, category can be used as a good supervised label. Specifically, we use the first layer information of the Capsule Network as the feature \(Z=\{z_{1}, z_{2}, \ldots , z_{N}\}\) of the item, and use the fully connected layer as the classifier. The output result \(\hat{Z} \in \mathbb {R}^{N \times d}\) can be represented as follows:

$$\begin{aligned} \hat{Z} = softmax (W Z + b) \end{aligned}$$
(11)

Category \(S_{c} = \{c_{1}, c_{2}, \ldots , c_{N}\}\) is used as a label, and the cross entropy loss function calculates the loss of the classifier:

$$\begin{aligned} L_{class}= -\sum _{n=1}^{N}\left( c_{n} \log \hat{z}_{n}+\left( 1-c_{n}\right) \log \left( 1-\hat{z}_{n}\right) \right) \end{aligned}$$
(12)

4.3 Multi-interest contrastive learning module

The differentiation processing between interests is the key to achieving multi-interest sequence recommendations. Existing methods rarely consider the differences between interest capsules, resulting in user sequence interest capsules having the same meaning in extreme cases. This paper uses a contrastive learning approach to distinguish between interests while preserving their implicit correlation information. The sequence information is used to represent the true interests of the user in an adaptive fusion.

We assume that the high-level preference interests of the category sequence (\(Q_{u}\)) are consistent with the low-level preference interests of the item sequence (\(P_{u}\)). We use fully connected layers to adaptively fuse them, and obtain the final multi-interest representation (\(M_{u}\)) of the user. Finally, \(P_{u}\) and \(M_{u}\) are selected as two views for contrastive learning.

$$\begin{aligned} M_{u} = W \left( cat[Q_{u},P_{u}]\right) \end{aligned}$$
(13)

where \(M_{u} \in \mathbb {R}^{K \times d}\)

Most existing contrastive learning methods are based on InfoNCE:

$$\begin{aligned} L_{cl}= - \log \frac{e^{ \left( \textbf{h}_i \cdot \textbf{h}_{i^*} / \tau \right) }}{e^{\left( \textbf{h}_i \cdot \textbf{h}_{i^*} / \tau \right) } + \sum _{j \ne i^*} e^{ \left( \textbf{h}_i \cdot \textbf{h}_{j} / \tau \right) }} \end{aligned}$$
(14)

where \(\tau \) is a temperature hyperparameter, (\(\textbf{h}_i\) , \(\textbf{h}_{i^*}\)) is positive pair, (\(\textbf{h}_i\) , \(\textbf{h}_{j \ne i^*}\)) is negative pair.

However, due to a lack of decision margin, a small perturbation around the decision boundary may lead to an incorrect decision. To overcome the problem, inspired by ArcFace [46], we propose a new training objective for multi-interest contrastive learning by adding an additive angular margin m between positive pair \(\textbf{e}_{i}\) and \(\textbf{e}_{i^*}\). Therefore, (14) can be rewritten as follows:

$$\begin{aligned} L_{cl}=-\log \frac{e^{\cos \left( \theta _{i, i^*}+m\right) / \tau }}{e^{\cos \left( \theta _{i, i^*}+m\right) / \tau }+\sum _{j \ne i^*} e^{\cos \left( \theta _{i, j}\right) / \tau }} \end{aligned}$$
(15)

where m is additive angular margin.

$$\begin{aligned} \theta _{i, j}=\arccos \left( \frac{e_{i}^{\top } e_{j}}{\left\| e_{i}\right\| *\left\| e_{j}\right\| }\right) \end{aligned}$$
(16)

To some extent, more negative samples can lead to better performance in contrastive learning. In this paper, we set {\(i \in M_{u}^i\), \( i^* \in P_{u}^i\)} or {\( i \in P_{u}^i\), \(i^* \in M_{u}^i\)}, \({j \in \{P_{u} \cup M_{u}\}}\). In this way, for a sequence, any i will have (\(2K-2\)) negative samples, and the contrastive loss of multi-interest sequences recommended function is:

$$\begin{aligned} L_{mulCL}= - \sum _{i} \log \frac{e^{\cos \left( \theta _{i, i^*}+m\right) / \tau }}{\sum _{j \ne i} e^{\cos \left( \theta _{i, j}\right) / \tau }} \end{aligned}$$
(17)

4.4 Model training

For the given target item embedding y, we use an argmax operator to obtain the interest that is the most related to the target item through (18):

$$\begin{aligned} m_{u}=M_{u}\left[ :, {\text {argmax}}\left( M_{u}^{\top } y\right) \right] \end{aligned}$$
(18)

The loss function between the predicted results of the model and the given target is :

$$\begin{aligned} L_{rec}= - \log \frac{\exp (m_{u} y^T)}{\sum _{j \in X^{'}} \exp (m_{u} y_j^T)} \end{aligned}$$
(19)

where \(X^{'}\) is the item obtained through sampling softmax objective [47].

The joint loss is defined as a linear combination of these three losses :

$$\begin{aligned} L= L_{rec} + \lambda _1 L_{class} + \lambda _2 L_{mulCL} \end{aligned}$$
(20)

where \(\lambda _1\) and \(\lambda _2\) are the hyperparameters to control the impact of different losses.

5 Experiments

5.1 Experimental settings

Dataset We consider three real-world e-commerce datasets. The specific statistics are shown in Table 1.

  • Amazon-ClothingFootnote 1. The Amazon Review Dataset is a classic data set commonly used in recommender systems, which records product reviews. We use the Clothing Shoes and Jewelry subset in our experiment.

  • Tmall-BuyFootnote 2. The Tmall dataset is collected by Tmall.com, which is an online shopping website. It contains users’ shopping history for about six months. We retain users’ purchase behaviors as a subset for experiments.

  • TafengFootnote 3. The Tafeng dataset collects user transaction behavior data from November 2000 to February 2001. The dataset covers everything from food and office supplies to furniture.

Baselines We compare our model with some sequential recommendation methods.

  • GRU4Rec [48]. GRU4Rec is a representative recommendation model that first introduces recurrent neural networks into sequence recommendation.

  • MIND [16]. MIND is one of the first frameworks to model users’ multiple interests based on dynamic routing algorithms.

  • ComiRec [18]. ComiRec is a representative baseline for the multi-interest recommendation. It uses two methods to represent user interests: attention mechanism and dynamic routing.

  • PIMI [23]. Considering the limitations of ComiRec, PIMI introduces the study of periodicity and interactivity of item sequences, capturing both global and local item features.

  • REMI [20]. REMI consists of an Interest-aware Hard Negative mining strategy and a Routing Regularization method to solve the issues of increased easy negatives and routing collapse during the training process.

Evaluation Metrics We use three common accuracy metrics for performance evaluation: Recall, Normalized Discounted Cumulative Gain(NDCG), and Hit Rate(HR). Metrics computation relies on the top 20/50 recommended candidates (e.g., Recall@20). For the three metrics, higher scores demonstrate better recommendation performance.

Implement Details For each dataset, we partition all users into the training, validation, and test sets with a ratio of 8:1:1. The maximum sequence length of the Amazon-Clothing and Tafeng datasets is set to 30, and the maximum sequence length of Tmall-Buy dataset is 20. The user sequence whose length exceeds the maximum value is truncated, and the user sequence whose length is insufficient is filled with 0. We filter users/items with fewer than 12 interactions to guarantee the length of recent sequences. All parameters are set as follows if not otherwise noted: following [20], the learning rate is 0.001, the mini-batch size is 128, the embedding size is set to 64, the interest number K = 4, and Adam is used as a gradient optimizer. We analyze in detail the effects of other hyperparameters in Section 5.4 and ultimately determined their values as \(\lambda _1\) = 0.1, \(\lambda _2\) = 1, \(\tau \) = 0.05, m = 10, respectively.

Table 1 Statistics of the three datasets

5.2 Performance evaluation

To demonstrate the recommendation performance of our model HPCL4SR, we compare it with other multi-interest models. The experimental results of three datasets are presented in Table 2. We have the following observations.

Table 2 Model comparison results on three benchmark datasets (%)

First, although the three datasets have different characteristics, HPCL4SR consistently yields the best performance, indicating the robustness of our model. By modeling users’ diverse interests through contrastive learning, even for the Tafeng dataset with limited user interests, we can still fully use the limited information to capture user interests and make optimal recommendations. It proves the effectiveness of finer-grained characterization of user persona contained in a user sequence. Overall, we model the high-level preference interests and low-level preference interests of user sequences, and distinguish the feature representations between interests through contrastive learning, which can more finely characterize users. Considering the meaning behind different user behaviors, that is, the real users of each interaction project, we mine and utilize the information in the user behavior sequence to understand users’ interests from the perspective of project usage, which helps to model users more accurately.

Next, judging from the performance of sequential recommendation models, The performance of PIMI, UMI, and HPCL4SR models on three datasets is superior to most models, such as ComiRec and MIND, indicating that adding side information is beneficial for user modeling.

Finally, on the Tafeng dataset with a small number of items (10,176), models that use a single vector to model user interests (GRU4Rec) outperform simple multi-vector models (MIND and ComiRec). However, the model performance is still worse than the PIMI model using time information and the HPCL4SR model the category information we proposed. The multi-interests model is better than the single interests model on the Amazon-Clothing and Tmall-Buy datasets with many users and items. In addition, the results of the REMI model indicate that selecting high-quality negative samples can bring surprises, but this requires a significant time cost in screening negative samples. Overall, modeling users’ interests from different aspects are better than using only one vector to model users’ overall interests. Because the multi-interest models can provide users with more mixed recommendation results, thereby improving the accuracy of recommendations.

In addition, to better demonstrate the effectiveness of our method, we supplemented the experimental results of CL4SRec [41] and DuoRec [24] in the sequence recommendation task. In Table 3, the experiment demonstrates the effectiveness and ingenuity of using contrastive learning methods, and also demonstrates that modeling users with multiple interests in sequence recommendation tasks can improve recommendation performance.

Table 3 Model comparison results on Amazon-Clothing datasets with purely contrastive learning-based sequence recommendation method (%)

5.3 Ablation study

In this section, we select the Amazon Clothing dataset to analyze the effectiveness of our proposed method (HPCL4SR). Firstly, we refer to the method in HPCL4SR that only uses item information as the base and the method that uses category as the supervised signal as the base(w \(L_{class}\)). Then, After constructing a global graph based on category to obtain user high-level preference interest information, we further attempted three methods of integrating high-level preference interest (\(Q_ {u} \)) and low-level preference interests (\(P_ {u} \)): addition, multiplication, and adaptive fusion, represented as HPCL4SR(w ’+’), HPCL4SR(w ’*’), HPCL4SR, respectively. Finally, we analyze and consider the contribution of differences in interests to multi-interest sequence recommendation. We not only attempt to replace the multi-interest contrastive learning module with Capsule Regulation [19], denoted as HPCL4SR(w CR), but also analyze the impact of the lack of additive angle margin m, denoted as HPCL4SR(w/o m). The experimental results on three data sets are shown in Table 4. From the table, it can be seen that category, as a supervisory signal, improves certain performance by optimizing the representation of items. When combined with \(Q_{u}\), the performance improvement is significant, especially when using the MLP method. The experiment confirms that the difference between interests is the main reason for affecting the performance of sequence recommendation, and the contrastive learning method shows greater advantages than the Capsule Regularization form due to its ability to distinguish the differences between interests while preserving the correlation information between them.

5.4 Hyper-parameter study

\(\lambda _1\) and \(\lambda _2\) are hyperparameters of the joint loss function during the training process, which directly affect the optimization of model parameters. We selected \(\lambda _1 \in \{0.01, 0.05, 0.1, \) \( 0.5, 1\}\), \(\lambda _2 \in \{0.01, 0.1, 1, 5, 10\}\), and conducted experiments on three datasets using NDCG@50 as the evaluation metric. As shown in the left side of Figure 4, we can see that the best performance is achieved when \(\lambda _1=0.1\). This matches our intuition since using category as the supervisory signal for items is effective, but excessive weight can cause recommendation loss and reduce model performance. As shown in the right side of Figure 4, an increase in the weight \(\lambda _2\) of the comparison loss can help distinguish the differences between multiple interests, but if \(\lambda _2\) is too large, it can also mask the recommendation loss and reduce the model’s recommendation task ability. Therefore, a reasonable value \(\lambda _2=1\) is needed.

Table 4 Ablation study on Amazon-Clothing dataset
Fig. 4
figure 4

Study on balance parameter \(\lambda _1\) and \(\lambda _2\). We show NDCG@50 on three datasets

The temperature \(\tau \) and angular margin m in the multi-interest contrastive learning module affect its effectiveness. For \(\tau \), we carry out an experiment with \(\tau \) varying from 0.01 to 0.1 with an interval of 0.01. The results are shown in the left side of Figure 5. On Amazon-Clothing and Tmall-Buy datasets, the performance is best when \(\tau =0.05\), and on the Tafeng dataset, the result is best when \(\tau =0.03\) (however, the performance difference between it and \(\tau =0.05\) is small). Taking all factors into consideration, we chose \(\tau =0.05\) for all our experiments. For m, as shown in the right side of Figure 5, we selected \(m \in \{0, 5, 10, 15, 20\}\). Although the performance is best on the Tmall-Buy dataset when \(m=15\), it is best on the other two datasets when \(m=10\). Therefore, we set \(m=10\) during the experiment.

5.5 Case study

We analyze the proposed model’s effectiveness in solving the multi-interest recommendation problem by showing the model’s recommendation results. Because the Amazon-Clothing dataset contains detailed information such as items and item categories, while the Tmall-Buy only gives the data number, we use the item to represent the recommendation results of the datasets.

Fig. 5
figure 5

Study on balance parameter \(\tau \) and m. We show NDCG@50 on three datasets

Fig. 6
figure 6

A case study on Amazon-Clothing dataset

Figure 6 shows the recommendation results of the proposed HPCL4SR model for a certain user behavior sequence. It can be seen from the figure that the user is more interested in baby boy suits and boy socks, but the PIMI model only recommends items related to men and women and does not learn the interests of the two demand sides of baby boys and boys. The HPCL4SR models both high-level and low-level preference interests of users, and diversifies the interests to recommend the “baby suit” and “socks” that users want. Moreover, in the list of recommended items given by the HPCL4SR model, the “baby suit” item that the user actually interacts with ranks higher. That is, the ranking quality of the list of recommended items provided by the model is higher. In addition, the HPCL4SR model not only learns the interests of boys but also captures the preferences of boys, and at the same time, learns other categories such as socks and shoes.

Table 5 shows the Top-20 recommendation results of the PIMI and HPCL4SR models on the randomly selected user behavior numbered 68079 in the Tmall-Buy dataset. As can be seen from the table, the HPCL4SR model correctly predicts the items that two users interact with (the number is bold). Compared with the PIMI model, item ID 31744 in the recommendation result list given by the HPCL4SR model ranks higher. Therefore, the HPCL4SR model exceeds the performance of the PIMI model.

Table 5 A case study on Tmall-Buy dataset
Table 6 The result of recommendation compared with LLMs

5.6 vs. LLMs

In order to compare the recommendation ability of HPCL4SR and large-scale language models in the matching stage, we use two prompt methods and have ChatGPT (ChatGPT 3.5-Turbo-1106 & ChatGPT 4-Turbo) provide recommended items based on user interaction information. The first method is to input user historical interaction information and prompt ChatGP to generate 50 items of interest to the user. The second method is to input user interaction history information and input the user’s next real item as well as randomly selected items, prompting ChatGPT to select items that the user may be interested in from the existing 50 items based on interaction history. The experimental details and results are shown in the Table 6. It can be seen that The performance of ChatGPT 4-Turbo is better than that of ChatGPT 3.5-Turbo-1106, but the results of both far lower than our model and even many existing models. This indicates that the current general LLMs still cannot be well applied to specific task domains [26, 49]. In addition, although T5-small [50] is much lower in model parameters than ChatGPT, we found that the large language model can further improve the performance of recommendation systems through fine-tuning, but it still cannot meet the model we have carefully designed for sequence recommendation. However, the expressive power of the large model will be enlightening for our next work.

6 Conclusion

In this paper, we propose a novel framework named HPCL4SR for multi-interest sequence recommendation. In order to achieve the representation of multiple user interests, HPCL4SR uses contrastive learning methods to differentiate interests, while preserving their correlation information, which is more in line with user behavior in real scenarios. We verify the effectiveness of the proposed method through experiments on three datasets. Additionally, we compare the recommendation ability of our approach in a task-specific domain with LLMs (ChatGPT 3.5-Turbo-1106 & ChatGPT 4-Turbo), further showcasing the superiority of HPCL4SR in multi-interest sequential recommendation. In the future, we consider enhancing the interpretability of recommendation tasks based on multi-interest recommendation models.