Keywords

1 Introduction

Sequential recommendation predicts potentially interesting items based on the user’s historical behavior. In the internet age, the amount of user behavior data and available items has grown exponentially [1]. The deep neural network learns item representation through a large amount of data, and many classic models emerge. For example, Caser [11] employs a convolutional neural network (CNN) as the backbone network, and GRU4Rec [4] uses a recurrent neural network (RNN) as the backbone network. In particular, the transformer [12] structure shines in sequential recommendation, such as SASRec [5], BERT4Rec [9].

However, due to the sparseness of sequence data, deep neural network cannot learn accurate item representations. The emergence of contrastive learning [6] solves the problem of sparse sequence data to a certain extent. CL4Rec [15] augments data through random crop, mask and reorder. DuoRec [8] utilizes a Dropout based approach to enhance sequence representation at the model level. On the other hand, it mines positive and negative samples using sequences of similar target items. But due to the noise in the sequence data, the augmented data is still disturbed by the noise in the original sequence.

But contrastive learning methods do not solve the problem of noise in the sequence. Noise has always been a major difficulty in representation learning and is no exception in sequential recommendation [13, 14]. For example, in real online shopping, the user’s mistaken click may not be the user’s real intention behavior. The augmented data generated by randomly cropping, masking and reordering the original sequence may lack robustness due to the presence of noise data. Poor quality data augmentation can have negative effects on model training. Furthermore, most of current methods obtain the user’s intent from the user’s original sequence [3, 7, 10]. And it is easy to think of the user’s recent behavior as the user’s intention or query vector, but it may not be accurate due to the changing existence of the user’s intention.

Based on the above observation, we propose a Noise-augmented Contrastive Learning for Sequential Recommendation (NCL4Rec) to address the noise problem in sequential recommendation. In our method, we use noise probabilities to guide the data augmentation process and mitigate the impact of noise in the original sequence. We introduce supervised noise recognition during training instead of relying on the original sequence, thereby eliminating the influence of noise in the original data. The noise probability is dynamically updated online after a certain number of training epochs. During training, we calculate the noise probability and design positive and negative sample augmentations based on it. Positive samples are generated by processing items with low noise probability, while negative samples are generated by processing items with high noise probability. Additionally, we design positive and negative loss functions to minimize the distance between positive samples and maximize the distance between positive and negative samples.

Our contributions:

  • We propose a Noise-augmented Contrastive Learning for Sequential Recommendation (NCL4Rec), which addresses noise issues and data sparsity by unifying sequential recommendation and self-supervised contrastive learning methods.

  • We propose novel noise-guided data positive and negative augmentations to better discriminate noisy data by exploiting the relevance of items to user intent. And a noise loss function is designed to better distinguish noise items from normal items.

  • We conduct extensive experiments on three benchmark datasets, and our method consistently outperforms currently existing state-of-the-art models, with performance gains ranging from 3.37% to 7.10%.

2 Problem Formulation

Formally, let \(S_u = (s_u^1, s_u^2, \dots , s_u^n)\) be a sequence of items, and let \(s_u^{n+1}\) be the next item in the sequence to be predicted. We define the problem of sequence recommendation as follows:

Given a set of training sequences \(D = {(S_u, s_u^{n+1})}\), where each training sequence \(S_u\) consists of n items, and the corresponding next item \(s_u^{n+1}\), the goal is to learn a function f that maps a user’s historical sequence \(S_u\) to the next item \(s_u^{n+1}\). More formally, we seek a function f such that:

$$\begin{aligned} s_u^{n+1} = f(S_u) \end{aligned}$$
(1)

where f is learned from the training set D. The learned function f can then be used to make predictions on new, unseen sequences.

3 Methods

The emphasis of this paper is on effective data augmentation, and there is no detailed description of the sequence encoding model. Instead, we use the backbone network that is commonly used in contrastive learning-based sequential recommendation models. It’s important to note that the purpose of contrastive learning methods is to address the problem of sparse training data and help us obtain a more effective encoding model.

In this section, we describe in detail our proposed Noise-augmented Contrastive Learning for Sequential Recommendation (NCL4Rec). The framework of our method is shown in Fig. 1. Our method mainly consists of four parts, (1) the generation of sequence item noise probabilities; (2) data augmentation guided by noise probabilities, (3) user representation encoding model, (4) noise contrastive loss function.

Fig. 1.
figure 1

Framework of NCL4Rec.

3.1 The Generation of Sequence Item Noise Probabilities

For our user sequence item, there are often a lot of noise data. Noise is an item that does not conform to the user’s intention. Most current methods are based on the original sequence to enhance the data of the item. However, the user’s sequence behavior will be transferred according to the user’s next item, so we use the user’s target item to calculate the user’s noise probability. On the one hand, it can effectively eliminate the interference between the original sequences and grasp the user’s intention more accurately. The user’s intention transfer can be better learned. We define the user sequence as \(S_u = {s^1_u, s^2_u, s^3_u...s^n_u}\), where n is the sequence length. \(s_u^{n+1}\) is the user’s next interaction item, which is also the supervision signal in our training.

First our sequence passes through the embedding layer,

$$\begin{aligned} Z_u = Embedding(S_u) \end{aligned}$$
(2)
$$\begin{aligned} Z_u = {z^1_u, z^2_u, z^3_u...z^n_u} \end{aligned}$$
(3)

where \(z_u^i\) is the embedding space representation of the i-th item of user u. We calculate the similarity between the target item and the sequence item through the soft attention mechanism, which represents the noise probability of the item.

$$\begin{aligned} prob(z^i_u) = 1 - \frac{\exp (cor_i)}{\sum _{j=1}^n \exp (cor_j)} \end{aligned}$$
(4)

where \(cor_i = sim(z_u^i, z^{n+1}_u)\), sim is our correlation calculation method. In this article, we use cosine similarity.

From the above method, we get the noise probability of each item of the user \(Porb(Z_u) = {prob(z^1_u),prob(z^2_u),prob(z^3_u),...,prob(z^n_u)}\), unlike all previous methods, we use the supervision signal to directly calculate the noise probability, because we only need to calculate the noise probability in the training set, and the supervision signal will not work in the test set.

Noise Update Strategy for Sequential Items. As our noise probabilities are calculated based on the embedding representation of items, after a certain number of training epochs, the noise probabilities of items may not be accurate enough and require updating. Our update interval epoch is a hyperparameter t, and every t epochs we recompute our noise probabilities for each item. Assuming that our total training round N is 50 and t is 20, we will update the sequence item noise update in the 20th and 40th epoch of training.

3.2 Data Augmentation Based on Noise Probability

According to the noise probabilities of the items in the sequence calculated in the previous section, we perform corresponding data augmentation. In this section, we design 5 sequence data augment methods. We perform positive data augmentation and negative data augmentation on the crop and mask in CL4Rec according to the noise probability. Our reorder operation will not change the element, so we only take positive data augmentation for it.

  • Crop or Mask for Noise reduction. In order to reduce the noise data of the user behavior sequence, we select k items with the highest noise probability to crop or mask, so that the similarity between the behavior items in the sequence and the user’s intention is higher, where k is calculated by our crop or mask coefficient \(\alpha \), \(k=\alpha |Z_u|, 0<\alpha <1\).

    $$\begin{aligned} Z_u^{crop+} = [\hat{v}_1,\hat{v}_2,...,\hat{v}_{|Z_u|}] \end{aligned}$$
    (5)
    $$\begin{aligned} \hat{v}_i = { \left\{ \begin{array}{l} z_{u}^i, prob(z_u^i) <Porb(Z_u).sort()[k]\\ {\emptyset } \ {or} \ {[mask]},prob(z_{u}^i) >= Porb(Z_u).sort()[k] \end{array}\right. } \end{aligned}$$
    (6)
  • Crop or Mask for Noise augmentation. In order to augment the noise data of the user behavior sequence, we select k items with the smallest noise probability to crop or mask, so that the items in the sequence are contrary to our user intentions as much as possible, where k is calculated by our crop or mask coefficient \(\beta \), \(k= \beta |Z_u|, 0<\beta <1\). The formulaic expression is as before

  • Reorder for Noise reduction. In order to minimize the impact of noise items in users on user sequence intentions, we select k subsequences with the highest noise probability for random reorder. where k is calculated by our reorder coefficient \(\gamma \), \(k = \gamma |Z_u|, 0 < \gamma < 1\).

3.3 Sequence Encoder

Transformer has a good encoding ability for sequence data, and can overcapture the internal relationship between sequences through the self-attention mechanism. It is also widely used as the backbone network for sequential recommendation. Moreover, other sequence encoders are also valid, similar to GRU4Rec, Caser, BERT4Rec.

$$\begin{aligned} \hat{Z}_{u} = TranfomerEncoder(Z_{u}) \end{aligned}$$
(7)

We follow the common approach of sequential recommendation models and use the last item representation \(z_u\) as the representation of the whole sequence.

$$\begin{aligned} z_u = \hat{Z_u}[-1] \end{aligned}$$
(8)

3.4 Noise Contrastive Loss

In our data augmentation method, we differ from CL4Rec or DuoRec in that we introduce unique negative data augmentation, which is similar to our idea of contrastive learning by maximizing the difference between positive and negative samples.

Traditional Sequential Recommendation Loss Function. In this paper we adopt cross-entropy [2] as our supervised learning loss function.

$$\begin{aligned} \mathcal {L}_{seq}\left( s_{u}\right) =-\log \frac{\exp \left( {\text {sim}}\left( z_{u}, z^{n+1}_u\right) \right) }{\sum _{i =1}^{||V||} \exp \left( {\text {sim}}\left( z_{u}, z^{v_i} \right) \right) } \end{aligned}$$
(9)

where \(z_u\) is the representation of the user sequence, \(z^{n+1}_u\) is the representation of our next item, \(z^{v_i}\) is the embedding of all candidate item sets, ||V|| is the size of the item set.

Positive Contrastive Loss Function. We use a contrastive loss function [6] to calculate whether two positive samples come from the same user history sequence. We minimize positive samples from the same sequence with different augmentations, and maximize the difference between different sequences.

$$\begin{aligned} \mathcal {L}_{cl}^+\left( s_{u}\right) =-\log \frac{\exp \left( {\text {sim}}\left( z_{u}^{a_{i}}, z_{u}^{a_{j}}\right) / \tau \right) }{\exp \left( {\text {sim}}\left( z_{u}^{a_{i}}, z_{u}^{a_{j}}\right) /{\tau }\right) +\sum _{s^{-} \in S^{-}} \exp \left( {\text {sim}}\left( z_{u}^{a_{i}}, z^{s^{-}}\right) /{\tau }\right) } \end{aligned}$$
(10)

where \(z_u^{a_i},z_u^{a_j}\) is the representation of user sequence from two noise reduction methods, \(S^-\) is the set of negative samples. This negative sample refers to a sample that is augmented from other sequences relative to the current sequence within the same batch.\( z^{s^{-}}\) is the negative sample. \( \tau \) is temperature coefficient.

Negative Contrastive Loss Function. Our negative samples are the samples we generated by noise augmentations. Our goal is to make noise-augmented samples that are close to each other, and noise-augmented samples that are far from noise-reduced samples.

$$\begin{aligned} \mathcal {L}_{c l}^-\left( s_{u}\right) =-\frac{1}{|A^{-}|} \sum _{s_{u^{\prime }}^{a} \in A^{-}} \log \frac{\exp \left( {\text {sim}}\left( z_{u}^{a-}, z_{u^{\prime }}^{a}\right) /{\tau }\right) }{\exp \left( {\text {sim}}\left( z_{u}^{a-}, z_{u^{\prime }}^{a}\right) /{\tau }\right) + \sum _{s \in A^{+} } \exp \left( {\text {sim}}\left( z_{u}^{a-}, z\right) /{\tau }\right) } \end{aligned}$$
(11)

where \(A^-\) is the set of sample generated by noise augmentations. \(A^+\) is the set of sample generated by noise reduction. \(z_u^{a^-}\) is the representation of user sequence from a noise augmentation method. \(z_{u^{\prime }}^{a}\) is a sample from noise augmentation and z is a sample from noise reduction.

Joint Training. Finally, the loss function of NCL4Rec is to jointly train the cross entropy with the positive loss function and the negative loss function.

$$\begin{aligned} \mathcal {L}_{NCL4Rec}=\mathcal {L}_{seq}+\lambda _{cl^{+}} \mathcal {L}^+_{\mathrm {cl^{+}}} +\lambda _{cl^{-}} \mathcal {L}^-_{\textrm{cl}} \end{aligned}$$
(12)

where \(\lambda _{cl^{+}}\) is the coefficient of positive loss function and \(\lambda _{cl^{-}}\) is the coefficient of negative loss function.

4 Experiment

In order to better compare our experiments, we mainly focus on the following questions.

  • Q1: How does our NCL4Rec perform compared to other sequential recommendation models?

  • Q2: How does our NCL4Rec compare to other models in terms of representation learning?

4.1 Setup

Dataset. The datasets we use for sequential recommendation are widely used datasets, namely the Amazon and the MovieLens.

Baselines. The following methods are used for comparison:

  • Sequential recommendation model: We use GRU4Rec [4] based on RNN, Caser [11] based on CNN, SASRec [5] based on Transformer.

  • Contrastive learning model for sequential recommendation: We use CL4Rec [15] and DuoRec [8].

Metrics. We use top-K Hit Ratio (HR@K) and top-K Normalized Discounted Cumulative Gain (NDCG@K), where K is selected from 5, 10.

Table 1. Overall performance. (The best results are bolded and the suboptimal ones are underlined. The last column represents the percentage improvement of our results compared to the best results.)

4.2 Overall Performance (Q1)

In general, NCL4Rec performs the best on all metrics and datasets. On ML-100K, it outperforms other algorithms by a significant margin in HR@5, HR@10, and NDCG@10, and achieves the highest NDCG@5. Similarly, on Beauty and Sports, NCL4Rec consistently achieves the best performance across all metrics, with improvements ranging from 3.47SASRec and DuoRec also show competitive performance on all datasets. SASRec performs well in HR@5 and NDCG@5 on ML-100K, while DuoRec excels in HR@10 and NDCG@10. Both algorithms perform well on Beauty and Sports. Caser and CL4Rec, however, exhibit relatively suboptimal performance compared to other algorithms across all datasets. Caser consistently performs poorly across all metrics, while CL4Rec has low rankings in HR@5, HR@10, and NDCG@10 on ML-100K and Beauty and Sports.

Overall, these results indicate that NCL4Rec is a promising recommendation algorithm that achieves superior performance across multiple datasets and evaluation metrics (Table 1).

4.3 Study of Ablation

Fig. 2.
figure 2

Performance comparison on DuoRec, NCL4Rec w/o \( \mathcal {L}_{c l}^ -\), NCL4Rec on HR@10, NDCG@10.

To verify the effectiveness of our proposed method, we test the performance of NCL4Rec with different loss functions on three datasets. Additionally, we include DuoRec as a comparison for better observation. Figure 2 shows our results, and it can be seen that when we only use the positive loss function, our method outperforms DuoRec in terms of HR@10 on all three datasets. When using the full loss function, our method shows further improvement. However, on ML-100K, where only positive contrast is used, our method’s performance is slightly lower than DuoRec.

Fig. 3.
figure 3

Item embeddings on ML-100K dataset.

4.4 Discussion About Item Representation (Q3)

Representation learning is always the focus of deep recommendation systems. The embedded representation of items directly determines the performance of recommendation models. Figure 3 show the item embedding representations learned by the four methods of SASRec, CL4Rec, DuoRec, and NCL4Rec on the datasets ML-100K. These four methods all use the transformer as the backbone network. It can be seen that the embedded representations of SASRec are very clustered, followed by CL4Rec. DuoRec uses a contrast regularization method to enhance the uniformity of sequence representation distribution, which has a greater improvement compared to CL4Rec. Our NLC4Rec constructs positive and negative data augmentation to make it easier to distinguish the noise and normal items in the sequence. NCL4Rec can make the embedded representation of items more uniform and more discriminative, and our embedded representation is further improved.

5 Conclusion

In this paper, we investigate how to address the inherently noisy data present in sequence data to optimize our recommendation performance. We introduce supervisory signals to identify noise in raw sequence data, and then design positive and negative augmentations. By pulling in the distance between the positive sample data and widening the distance between the positive sample and the negative sample, we can better learn the representation of the item. Experiments demonstrate that NCL4Rec outperforms state-of-the-art sequence recommendation models on multiple datasets. In future research, we will explore more accurate noise identification methods, so that the inherent noise data in the sequence can be better identified, and the generated samples have better representation capabilities.