Keywords

1 Introduction

Personalized recommendation is one of the most popular recommendation methods at present, which can tailor the recommended content to the users according to their unique preferences. Personalized recommendation algorithm mainly includes collaborative filtering-based recommendation algorithm, content-based recommendation algorithm, and sequential recommendation algorithm. The collaborative filtering-based recommendation algorithm [16] can recommend items according to a certain similarity (similarity between users or similarity between items) through the behavior of groups. The content-based recommendation algorithm [7] only utilizes the basic information (e.g., gender, age) of the user and the user-item interactions to forecast the user’s preferences without taking into account the information of other users. The sequential recommendation algorithm [2, 19] attempts to predict the user’s next new item by exploiting their historical behavior sequences. The sequential recommendation algorithm is particularly important in movie recommendation, since it can model the relationship between historically watched movies as a dynamic sequence to find the hidden information between movies.

Most of current sequential recommendation algorithms rank movies by interaction timestamps. However, these sequential recommendation algorithms only model the time series, ignoring the temporal information hidden in the timestamp itself. For example, Markov chain [9, 21, 22] assumes that the next movie is related to the previous movies, which only considers the time sequence, without considering time itself. Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) [21] model time series through hidden states. However, both CNN and RNN only compress temporal information into fixed hidden vectors, thus ignoring the temporal relationship between the various movies. The recently emerged “self-attention” mechanism (Self-Attention) can allocate different weights to the information according to their importance [1], but Self-Attention does not take the sequence of the series.

In this paper, we propose a Self-Attention Sequential Recommendation Algorithm based on Movie Genre Time Interval (SSR-MGTI). Specifically, we add absolute position to the multi-head self-attention mechanism to provide the sequential position of the movie. Then, we use the time interval between movies to represent the time information, and model the same genre of time interval of the user-movie interaction sequence to predict the next movie. In order to improve the model’s fitting ability and highlight the importance of local preferences, we add CNN to improve the model’s prediction ability. Finally, our contributions are summarized as follows:

  • We model the same genre of time interval information in the user-movie interaction sequence.

  • We use the multi-head self-attention mechanism to train the same genre of time interval by adding the absolute position information of the movie.

  • We add the CNN to improve the stability and generalization ability of the model structure and capture the local information of user-movie interaction sequences.

  • We carry out extensive experiments on MovieLens and Amazon datasets, which shows that our algorithm outperforms the state-of-the-art algorithms.

2 Related Work

During the past decades, extensive algorithms based on sequential recommendation have been proposed in the recommendation systems area. In general, existing methods can be categorized into three groups: general sequential recommendation method, deep learning-based sequential recommendation method, and self-attention-based sequential recommendation method.

General sequential recommendation algorithms include sequential pattern mining and Markov chain models. Yap et al. [23] proposed a recommendation framework based on personalized sequential pattern mining, which effectively learned important knowledge of user sequences. The FPMC model [15] proposed by Rendle et al. combines Matrix Factorization with the Markov Chain model, which incorporates both the common Markov chain and the normal matrix factorization model. It introduces modifications to the Bayesian personalized ranking framework recommended for the sequential basket. However, the Markov chain model can only capture the local information of the sequence, ignoring the global information related to the sequence.

In deep learning-based sequential recommendation method, RNN and CNN are most commonly used algorithms. RNN is inherently capable of processing sequence data. In order to solve the long-term dependency problem in RNN, two variants of RNN are generated, namely Long Short-Term Memory Neural Network (LSTM) [12, 17] and Gated Recurrent Unit (GRU) [5]. Duan et al. [6] proposed a new architecture based on LSTM for RNN ignoring collective dependencies due to the monotonous temporal relationship between items. The model adds the “Q-K-V” triplet to the recurrent unit to enhance the memory ability of LSTM, and proposes a “recovery gate” to solve the memory loss problem caused by the “forget gate”. However, RNN is only suitable for long-term sequences. CNN can treat sequence as one-dimensional space and extract features from local sequence convolution. Tang et al. [18] proposed a convolutional sequence embedding model (Caser) to embeds recent sequence items into the “image” of time and latent space, which can use a convolution filter to turn the sequence into a local feature of the image. However, CNN is only good at capturing short-term sequences, which is not suitable for long-term sequences.

In recent years, self-attention mechanism has attracted great attention in the fields of natural language processing and computer vision. Chiang et al. [4] proposed a stacked attention network model, which stacks contextual item attention modules with multi-head attention modules and improves recommendation performance by using additional time information to model contextual items. Kang et al. [13] proposed a sequence model based on self-attention (SAS-Rec), which can be used to balance its sparse and dense data sets. Li et al. [14] proposed time interval-aware self-attention sequential recommendation (TiSAS-Rec), which models time intervals in user interaction sequences and uses the time interval information to predict the next item.

Although all the above methods can use timestamps to model time series, they seldom use the time information of timestamp itself, and do not take into account the time interval characteristics of the same genre. However, the genres of movies watched by users are different, and the time interval between them can better reflect the interests of users. In this paper, we will model the same genre time interval information of movies as the relationship between movies, and add absolute position information to the multi-head self-attention mechanism.

3 Problem Description of Movie Genre Time Interval

Since our recommendation algorithm is based on the movie genre time interval, we give the definition of movie genre time interval firstly in this section, followed by the problem description.

Movie Genre Time Interval (MGTI) refers to the time length between movies with the same category in the user-movie interaction sequence. The time interval between the same type of movies in the user-movie interaction sequence can reflect the user’s recent preference for this type of movie. The smaller the time interval between two movies of the same type indicates that the user likes this type of movies more recently.

The MGTI can be modeled as following: Let \(U = \left\{ {u_{i} |1 \le i \le N} \right\}\), \(V = \left\{ {v_{j} |1 \le j \le M} \right\}\) and \(G = \left\{ {g_{k} |1 \le k \le H} \right\}\) represent the user set, the movie set and the genre set, respectively. Each movie in the movie set has a corresponding timestamp, which can be represented by \(T = \left\{ {t_{q} |1 \le q \le M} \right\}\). For a user ui, the user-movie interactive sequence can be denoted by \(S_{i} = \left( {s_{i1}^{{g_{1} }} ,s_{i2}^{{g_{2} }} , \ldots ,s_{ij}^{{g_{k} }} , \ldots ,s_{iM}^{{g_{H} }} } \right)\), \(i \in [1,N],j \in [1,M],k \in [1,H]\). MGTI can be denoted by \(r_{cd}^{{u_{i} }} = s_{id}^{{g_{a} }} - s_{ic}^{{g_{b} }}\), where ga and gb represent type set and \(g_{a} \cap g_{b} \ne\) ∅. The absolute position sequence refers to the position of the movie in the user-movie interaction sequence, defined as \(P = (1,2, \ldots ,M)\). At the time t, the model predicts the next movie based on the previous t − 1 movies and \(r_{cd}^{{u_{i} }}\). The input of our model is a user-movie interaction sequence (Si), an absolute position of the movie in the user-movie interaction sequence (P) and a genre time interval matrix of user-movie interaction sequence (Ri). The genre time interval matrix of user-movie interaction sequence can be denoted as below.

$$ R^{i} = \left[ {\begin{array}{*{20}c} {r_{11}^{i} } & {r_{12}^{i} } & \ldots & {r_{1n - 1}^{i} } & {r_{1n}^{i} } \\ {r_{21}^{i} } & {r_{22}^{i} } & \ldots & {r_{2n - 1}^{i} } & {r_{2n}^{i} } \\ {r_{31}^{i} } & {r_{32}^{i} } & \ldots & {r_{3n - 1}^{i} } & {r_{3n}^{i} } \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ {r_{n1}^{i} } & {r_{n2}^{i} } & \ldots & {r_{nn - 1}^{i} } & {r_{nn}^{i} } \\ \end{array} } \right] $$
(1)

4 Multi-head Self-attention Mechanism Based on Movie Genre Time Interval

The overall framework of the model is shown in Fig. 1. This model includes three parts: 1) Embedding Layer: This layer vectorizes the sequence (Si) and genre time interval matrix (Ri), and the absolute position of the movie in the user-movie interaction sequence (P), and embeds them in a low-dimensional space. 2) Multi-Head Self-Attention Mechanism Layer: This layer focuses on more relevant movies (i.e., genre time interval is shorter) and gives more weight to these movies. By assigning different weights to the movies, the recommendation results can be more personalized. 3) Convolutional Neural Network Layer: This layer can convert the model from linear to non-linear and capture local information of user-movie interaction sequences. So it can improve the fitting ability of the model and the stability and generalization ability of the model structure.

Fig. 1.
figure 1

Overall framework of Multi-Head Self-Attention Mechanism based on Movie Genre Time Interval.

4.1 Embedding Layer

We use the embedding layer to map the user-movie interaction sequence (Si) to a lower dimensional space. The embedding layer uses the user interaction movie ID (the serial number of the movie in the dataset) as a numerical index to create an embedding matrix \(E_{S} \in {\mathbf{R}}^{{{\mathbf{c}} \times {\mathbf{d}}}}\), where c is the dictionary size and d is the potential dimension. It maps user interaction movie ID to fixed-size vectors by embedding matrix (ES), where S represents user-movie interaction sequence. So we will get the mapped low-dimensional matrix \(O_{S} \in {\mathbf{R}}^{{{\mathbf{n}} \times {\mathbf{d}}}}\), where n is the maximum length of the sequence.

Similar to the user-movie interaction sequence, we use an embedding layer to map absolute position (P) to a lower dimensional space. The difference is that we will use two different embedding matrices to generate the keys and values in the multi-head self-attention mechanism without requiring additional linear transformations. Because the absolute position is a number, we use it as the numerical index to create embedding matrix \(E_{P}^{K} \in {\mathbf{R}}^{{{\mathbf{c}} \times {\mathbf{d}}}} \;{\text{and}}\;E_{P}^{V} \in {\mathbf{R}}^{{{\mathbf{c}} \times {\mathbf{d}}}}\) and map absolute position to fixed-size vectors by embedding matrix \(\left( {E_{P}^{K} ,E_{P}^{V} } \right)\), where P, K and V represent absolute position, keys and values of multi-head self-attention mechanism. Therefore, we can get the mapped low-dimensional matrices \(O_{P}^{K} \in {\mathbf{R}}^{{{\mathbf{n}} \times {\mathbf{d}}}} \;{\text{and}}\;O_{P}^{V} \in {\mathbf{R}}^{{{\mathbf{n}} \times {\mathbf{d}}}}\).

Likewise, we use genre time interval matrix (Ri) as numerical indexes to create embedding matrix \(E_{R}^{K} \in {\mathbf{R}}^{{{\mathbf{c}} \times {\mathbf{d}}}} \;{\text{and}}\;E_{R}^{V} \in {\mathbf{R}}^{{{\mathbf{c}} \times {\mathbf{d}}}}\) and map genre time interval matrix to fixed-size vectors by embedding matrix \(\left( {E_{R}^{K} ,E_{R}^{V} } \right)\), where R represents genre time interval matrix. Therefore, we can get the mapped low-dimensional matrices \(O_{R}^{K} \in {\mathbf{R}}^{{{\mathbf{n}} \times {\mathbf{d}}}} {\text{ and }}O_{R}^{V} \in {\mathbf{R}}^{{{\mathbf{n}} \times {\mathbf{d}}}}\).

4.2 Multi-head Self-attention Mechanism Layer

The multi-head self-attention mechanism can give different weights to the movies according to the importance information in the time sequence, which works as following. Firstly, we calculate the attention weight αij by the following softmax function:

$$ \alpha_{ij} = \frac{{e^{{v^{ij} }} }}{{\sum\nolimits_{k = 1}^{n} {e^{{v^{ik} }} } }}, $$
(2)

where vij is calculated using low-dimensional matrices \(\left( {O_{S} ,O_{R}^{K} {\text{ and }}O_{P}^{K} } \right)\).

$$ v^{ij} = \frac{{O_{S} W^{Q} \left( {O_{S} W^{K} + O_{R}^{K} + O_{P}^{K} } \right)}}{\sqrt d }, $$
(3)

where \(W^{Q} \in {\mathbf{R}}^{{{\text{d}} \times {\text{d}}}} \;{\text{and}}\;W^{K} \in {\mathbf{R}}^{{{\text{d}} \times {\mathbf{d}}}}\) are calculated by a fully connected layer, and WQ and WK are the coefficients of query and key. d is the dimension of the hidden layer. \(\sqrt d\) is used to avoid large values of softmax.

Secondly, according to the attention weight αij, we calculate the final result of the multi-head self-attention mechanism model, that is, the weighted sum of value:

$$ Z_{i} = \sum\limits_{j = 1}^{n} {\alpha_{ij} } \left( {O_{S} W^{V} + O_{R}^{V} + O_{P}^{V} } \right), $$
(4)

where \(W^{V} \in {\mathbf{R}}^{{{\mathbf{d}} \times {\mathbf{d}}}}\) is calculated by a fully connected layer and WV is the value of coefficient.

4.3 Convolutional Neural Network Layer

In order to improve the model’s fitting ability and highlight the importance of local preferences, we add CNN to improve the model’s prediction ability.

Firstly, we use a layer of 1D convolutional neural network for feature extraction:

$$ F_{i}^{1} = W^{1} Z_{i} + b^{1} $$
(5)

Secondly, The ReLU activation function is used after the first layer of convolutional neural network as follows:

$$ F_{i}^{2} = ReLU\left( {F_{i}^{1} } \right) $$
(6)

Finally, we use a layer of 1D convolutional neural network:

$$ F_{i}^{3} = F_{i}^{2} W^{2} + b^{2} $$
(7)

where W1 ∈ Rd×d and W2 ∈ Rd×d are the parameter matrices of the first and the second convolutional neural network layers, respectively. b1 ∈ Rd and b2 ∈ Rd are the bias terms. F 1, F 2 and F 3 are the output of each layer.

5 Model Prediction

5.1 Prediction Layer

In the multi-head self-attention mechanism layer and the convolutional neural network layer, the increase in the number of model layers will lead to problems of overfitting, gradient disappearing and long training time. So we use layer normalization and Dropout regularization techniques to solve these problems:

$$ g(x) = x + Dropout(g(LayerNorm(x))) $$
(8)

where g(x) is multi-head self-attention mechanism layer or the convolutional neural network layer, and x is the input. We use the layer normalization technique on the input (x). Then we use the Dropout technique on the output of the multi-head self-attention mechanism layer or the convolutional neural network layer (g(x)). At last, we incorporate the input (x) into this result.

5.2 Loss Function

The binary cross-entropy loss function is commonly used in recommendation systems, which measures the predictive accuracy of the model by calculating the difference between the real label and the predicted label. It allows the model to converge fast and can be updated in real time without retraining the entire model. Therefore, we adopt the binary cross-entropy loss function as following:

$$ - \sum\limits_{{S^{u} \in S}} {\sum\limits_{t \in [1,2, \ldots ,n]} {\left[ {\log \left( {\sigma \left( {r_{o_t,t} t} \right)} \right) + \log \left( {1 - \sigma \left( {r_{{o^{\prime }_t,t}} } \right)} \right)} \right]} } + \lambda \left\| \varTheta \right\|_{F}^{2} $$
(9)

where \(r_{o_t}\) represents positive output, \(r_{o^{\prime}_t}\) represents negative sampling, Θ =  \(\left\{ {O_{S} ,O_{P}^{K} ,O_{P}^{V} ,O_{R}^{K} ,O_{R}^{V} } \right\}\) is a low-dimensional matrix set of mapping, ∥·∥F represents Frobenius norm, and λ represents regularization coefficient.

6 Experimental Evaluation

In this section, we evaluate SSR-MGTI through extensive simulations. Firstly, we present the experimental setup, datasets and evaluation metrics in Sect. 6.1. Then, the results of SSR-MGTI and 7 recommendation baselines (GRU4Rec+ [11], NCF [10], Caser [18], SASRec [13], TiSASRec [14], LSPM [3], SSE-PT [20]) on Movielens and Amazon datasets are presented in Sect. 6.2. Finally, we also show the results of comparison under 3 different hyperparameters settings of SASRec, TiSASRec, and SSE-PT.

6.1 Experimental Configuration

Experimental Setup.

All the following experiments were performed on NVIDIA RTX3090Ti GPU, and the code was implemented based on the Pytorch. The dropout rate of the Movielens dataset is 0.2 and the dropout rate of the Amazon dataset is 0.8.

Datasets.

We evaluate our method on two datasets from two platforms. Dataset statistics of two datasets after preprocessing are shown in Table 1. MovieLens is the dense dataset which has more average actions with fewer users and movies. Amazon is the sparse dataset which has the fewer actions per user and movie.

  • MovieLens: This dataset is often used in the recommendation system competition. We will use a version with 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. (MovieLens-1M).

  • Amazon: This dataset records users’ comments on Amazon website. It is the classic dataset of the recommendation system and Amazon has been updating this dataset. We will use the Video_Games dataset from the 2014 release. (Amazon Video_Games)

Table 1. Dataset statistics (after preprocessing)

Evaluation Metrics.

We use two common Top-N metrics to evaluate the performance of our method: Hit Rate@10 and NDCG@10 [8, 10]. Hit Rate@10 is mainly concerned about whether the movie that users like is recommended, which emphasizes the “accuracy” of prediction. NDCG@10 is more concerned about the “order”, which emphasizes whether the recommended movie appears in a higher position in the recommended sequence.

6.2 Results and Analysis

Results on Different Recommendation Methods.

We study the performance of our proposed model SSR-MGTI with all baselines on two real-world datasets. Table 2 shows the experimental results of all the methods. It can be observed that:

  1. (1)

    SSR-MGTI can always achieve the best performance regardless of datasets and evaluation metrics, which can gain 20.13% Hit Rate and 41.06% NDCG improvements on average compared with other methods.

  2. (2)

    SSR-MGTI has a significant improvement in the NDCG metric on both sparse and dense datasets, which can achieve performance up to 25.65% on dense dataset and up to 56.46% on sparse dataset.

Table 2. Performance of different recommendation methods. The best performance in each row is boldfaced (higher is better), and the second best method in each row is underlined. Improvements are shown in the last column.

Ablation Study.

We conduct experiments by removing position, CNN, dropout and layernorm separately to demonstrate the role of each component of our model. Table 3 shows the performance of the two datasets with the best set of hyperparameters.

Table 3. Ablation analysis (NDCG@10) on two datasets. Performance better than the default version is boldfaced. ‘’ indicates performance drop.

For the Movielens dataset, removing the added components clearly shows that the recommendation performance has declined, especially for the dropout component. The position, CNN and layernorm components can modify the model to some extent. Adding the absolute position of the user-movie interaction sequence can be combined with the relative position of the type time interval to better capture the connection between movies. Adding the CNN component can not only convert the linear model into a non-linear model, but also extract short-term preferences. Removing the layernorm component will reduce the generalization ability of the model and may also cause gradient disappearance and gradient explosion problems, so it is lower than the default model metric. Removing the dropout component will cause the recommendation metric of the model to drop sharply, which shows that the dropout component greatly affects the recommendation performance of the model. It can also be seen from the output results of the model that the evaluation metric fluctuates around 0.6500, which indicates that the model overfitting problem is not obvious on the dense dataset.

For the Amazon dataset, removing the dropout and layernorm components will cause a severe performance drop, especially removing dropout will cause overfitting problems (evaluation metric gradually decrease). Removing the layernorm component will seriously hurts model performance for sparse datasets. Removing the position and CNN components is more suitable for sparse datasets. Since the sparse data interaction sequence is less, the absolute position plays a little role and there is little difference between the short-term feature and the long-term feature. Adding position and CNN may easily increase the noise of the data, so the recommendation performance of the model will be improved.

Comparison of 3 Different Hyperparameters Settings.

We compared 3 hyperparameters (i.e., maximum genre time interval, maximum sequence length n and number of heads of attention) based on the SASRec, TiSASRec, SSE-PT and SSR-MGTI models, and the maximum movie genre time interval was only compared with TiSASRec.

(1) Influence of the maximum genre time interval.

The maximum genre time interval refers to the maximum value of the set genre time interval. Since maximum genre time interval may replace the computed movie genre time interval for training, it has a great impact on the recommendation performance. Figure 2 shows the effect of maximum time interval between TiSASRec and SSR-MGTI on the two datasets. From Fig. 2, we can see that SSR-MGTI can always achieve better performance than TiSASRec under different datasets, which can gain 7.41% Hit Rate and 22.16% NDCG improvements on average. This is because our method adds movie genre features to the time interval, which enables the model to accurately capture temporal information between movie genres in user-movie interaction sequences.

Fig. 2.
figure 2

Effect of maximum genre time interval on ranking performance

(2) Influence of maximum sequence length

n. The maximum sequence length n refers to the length of the user-movie interaction sequence, which determines the number of data that can be added to the model for training. From the results, we can see that SSR-MGTI gains 7% Hit Rate and 18.99% NDCG improvements on average in the Movielens dataset. As shown in Fig. 3 (a) and (b), the recommendation performance of the TiSASRec, SSE-PT and SSR-MGTI models increases with the increase of n. But the SASRec rises firstly and then declines. It may because SASRec uses fewer features. So a large number of 0 are filled with n increases, resulting in a decline in recommendation performance. For the Amazon dataset, SSR-MGTI gains 8.95% Hit Rate and 28.04% NDCG improvements on average. As shown in Fig. 3 (c) and (d), the recommendation performance of the four models decreases slightly with the increase of n.

Fig. 3.
figure 3

Effect of maximum sequence length n on ranking performance

(3) Influence of the number of heads of multi-head self-attention.

Since number of heads of multi-head self-attention can enable the network to capture the user’s interests from multiple aspects, it has a great impact on the recommendation performance. As shown in Fig. 4 (a) and (b), the recommendation performance of the SASRec and SSE-PT model increases significantly. But the recommendation performance of the TiSASRec and SSR-MGTI models firstly increases then decreases. This is because when the number of heads is too large, a large number of parameters will be generated to cause overfitting problems. For the Movielens dataset, SSR-MGTI gains 8.21% Hit Rate and 21.74% NDCG improvements on average. For the Amazon dataset (Fig. 4 (c) and (d)), SSR-MGTI gains 7.37% Hit Rate and 25.12% NDCG improvements on average.

It can be seen that the performance of SSE-PT is slightly improved. But the performance of SSR-MGTI, TiSASRec and SASRec decreases as the number of heads increases. This is because the dataset is relatively sparse, the hidden information of the data is relatively fewer.

Fig. 4.
figure 4

Effect of the number of heads of multi-head self-attention on ranking performance

7 Conclusion

In this paper, we proposed a Self-Attention Sequential Recommendation Algorithm based on Movie Genre Time Interval (SSR-MGTI). Firstly, we give the definition of Movie Genre Time Interval (MGTI), based on which a multi-head self-attention mechanism is modeled. Then, we add the absolute position of the movie in the user-movie interaction sequence to make up for the deficiency of the multi-head self-attention mechanism. In addition, we use a convolutional neural network to convert the model to non-linear and extract local information of user-movie interaction sequences. The experiment results show that our proposed recommendation scheme can achieve 20.13% and 41.06% improvement in HR@10 and NDCG@10 respectively over other state-of-the-art schemes in terms of dense (MovieLens) and sparse (Amazon) datasets.