Keywords

1 Introduction

Material life is constantly improving, and watching movies one by one is an entertainment way for people to relax daily. In movie recommendation, the traditional collaborative filtering recommendation algorithm mainly uses users’ ratings for movies as the basis for recommendation. And other characteristics of users and movies, such as the user’s personal attributes, movie attribute categories, and the relationship between user history and movies. Most of them are not fully utilized, and the disadvantage of this is the lack of fine positioning of user behavior preferences.

The study of deep learning shows great potential in effective learning. Deep learning can be used to automatically learn features, such as learning effective feature representations from text content, and then extracting valid information from the features. However, many existing models and methods still fail to extract key information from the data.

For deep learning models, such as RNN [1], this type of method does not get rid of the limitation of timing, that is, it cannot be parallelized, which leads to speed and efficiency problems on large data sets. Another example is CNN [2], which is convenient for parallelism and easy to capture some global structural information, but only has the advantage of feature detection but lacks feature understanding.

In order to overcome these difficulties, this paper proposes a user preference learning model that integrates the attention mechanism. Mainly have the following advantages: (1) Use deep learning models to automatically learn deep-seam semantic information without relying on any manually designed feature extraction rules. (2) We introduce a multi-head attention mechanism into the model, which can adaptively combine context information and extract text features from it, thereby improving the accuracy of key information acquisition. (3) In multiple dimensions, try to combine the movie’s profile information to better extract features at a deep level.

The rest of this article is organized as follows. The related work is described in Sect. 2. Section 3 details the user preference learning model based on the attention mechanism. Section 4 demonstrates the model through experiments. The detailed conclusions and expectations are summarized in Sect. 5.

2 Related Work

In the current recommendation algorithm research, the collaborative filtering algorithm is one of the most enthusiastic and recommended algorithms by researchers, but the traditional collaborative filtering algorithm will inevitably face cold start and data sparsity [3]. In contrast, deep learning-based recommendation systems lack extensive attention and comment. Deep neural network models are now increasingly used in a variety of tasks, including text summaries, information retrieval, and relationship recognition [4, 5, 32].

In the study of deep learning, Socher et al. [1, 6] used recurrent neural networks to learn the logical relationships between sentences, and performed sentiment analysis and sentence semantic relationship recognition. Kim [2] uses the word vector trained by word2vec to map the sentence sequence into a two-dimensional feature matrix, and uses the convolutional neural network to extract the features of the sentence. Tang et al. [7] realized text-level sentiment classification based on the composition principle and relationship of text semantics, combined with convolutional neural networks and cyclic neural networks. Xie et al. [8] proposed the DKRL model, which uses entity description to represent entity vectors, and uses continuous word bag model and convolutional neural network to encode the semantics of entity descriptions, but this model does not consider the screening of entity information.

With the continuous development of deep learning, the use of attention mechanisms to optimize neural networks [9, 11] has become a hot topic. Bahdanau et al. [10] applied the attention mechanism to the NLP field for the first time, and simultaneously translated and aligned on the machine translation task, and finally achieved good performance. Later, Luong et al. [12] proposed a global attention mechanism and a local attention mechanism, providing a method for calculating the extended attention. Xu et al. [13] proposed the joint representation model of BLSTM, using the attention mechanism to select the relevant information in the entity description, and designing the gate mechanism to control the weight of the structural entity information. This method has a significant improvement in performance, but the hidden state of the BLSTM model needs to be generated in order, resulting in no parallel processing during training, thus reducing efficiency.

Although, deep learning has a good ability to capture features, but different models have different limitations. We propose a user preference learning model that integrates the attention mechanism. It not only considers the screening of the entity information, but also captures the feature information. It is enough to build a long-distance dependency between sentences, to solve the limitation of distance on features, and to have parallelization characteristics. Through experimental verification, our proposed method performs better than the previous method.

3 Model Building

This chapter proposes a user preference learning model that integrates the attention mechanism. The model mainly includes CNN-BLSTM feature extraction module, Multi-head attention module and recommendation table generation module. A schematic diagram of the model is shown in Fig. 1. Firstly, the user’s features are extracted by multi-layer neural network. Then the feature matrix is extracted by CNN-BLSTM, and the scoring matrix is reduced in dimension, placed together in the embedding layer. Next, the user features and the movie features processed by the attention mechanism are further processed in the fully connected layer. Finally, the vector obtained by the joint calculation is used to calculate the similarity, thereby implementing the recommended task.

Fig. 1.
figure 1

User preference learning model architecture diagram with attention mechanism.

3.1 Questions Raised

In the aspect of film recommendation, the user’s rating of the film as the basis for recommendation [14, 15] is the main processing method of the traditional collaborative filtering recommendation algorithm. A user who has seen the same or similar movie is generated by generating a preference model for the user’s history and then filtering through the system. We recommend a movie from other user history to a user who needs to be recommended to generate a recommendation list for that user.

The traditional recommendation algorithm often only uses the scoring data [16], and the user and other various feature data of the movie, such as user age, job or gender, as well as the introduction, type or label of the movie [17] are not able to make full use of it. The result of this is a lack of fine-grained description of user behavior preferences, and in actual recommendations, accurate analysis will greatly improve recommendation accuracy.

Deep learning can not only learn nonlinear multi-level abstract feature representation, but also acquire features that are often dense and low-dimensional [18], which is not available in the traditional collaborative filtering recommendation model. We capture key information by inputting text data into a convolutional neural network and extracting features [19]. At the same time, using the attention mechanism to automatically learn the hidden relationship between the user and the movie features, and combined with the relationship between the film scores, can more effectively reflect the behavior preferences of different users, and thus improve the accuracy of the recommendation system.

3.2 Feature Extraction

The user preference learning model that integrates the attention mechanism not only captures the key information of users and movies, but also considers the interdependence between users and movies, and realizes the division and mining of effective information between users and movies. In order to further capture the relevant information between users and movies, this paper also introduces the retrofit model of attention mechanism, multi-head attention mechanism, and integrates CNN-BLSTM model to solve this problem more effectively.

Feature Extraction Based on CNN-BLSTM.

We use the brief information of the movie as a sequence of sentences, and then use the CNN-BLSTM model to divide the brief information of a single movie into multiple sentences. In this way, compared to using an entire statement as input, the feature representation can be extracted sequentially through the convolutional layer neural network, and then these sentence features are sequentially integrated using LSTM [17, 18], and then the entire sentence feature representation is constructed, so that we can get the key information of the text more accurately. Figure 2 is a state diagram of the CNN-BLSTM model.

The first layer is the input layer. We use the input sequence of the movie represented by matrix \( M_{i}^{d} = \left[ {m_{1} \ldots m_{d} } \right] \). The matrix \( U_{i}^{d} = \left[ {u_{1} \ldots u_{d} } \right] \) represents the user’s input matrix, and then the movie input sequence is used as the input of the model. The user’s input matrix can be directly assigned to the convolutional neural network to extract features.

Fig. 2.
figure 2

CNN-BLSTM model diagram.

The second layer is the convolution layer. Drawing on the structure of the CNN model proposed by Kim [2, 19, 31], multiple convolution filters are used to extract multiple sets of local feature maps in the convolutional layer. Given the input matrix \( {\text{S}}_{ij} \), for each row vector m in the matrix, a convolution operation is performed using a filter with a window size of k. The result of the convolution mapping can be expressed as:

$$ y_{i} = { \mathcal{F}}\left( {W_{c} \; \otimes \;{\text{m}}_{i:i + k - 1} + b_{c} } \right) $$
(1)

Where \( y_{i} \) is the \( ith \) element of the feature map, \( W_{c} \) is the coefficient matrix, \( b_{c} \) is the bias term, and \( {\text{m}}_{i:i + k - 1} \) is the local word window composed of k words. When the word window is stepped from \( {\text{m}}_{i:k} \) to \( {\text{m}}_{n - k + 1:n} \), and we can get a feature map:

$$ {\text{c}} = \left\{ {y_{1} , \ldots ,y_{n + k - 1} } \right\} $$
(2)

The third layer is the sampling layer. The most representative features in each feature map are extracted at the sampling layer to obtain a feature representation at the sentence level. At the sampling layer, the feature map is sampled by the max-over-time pooling method proposed by Gollobert [20], and obtaining the feature values \( {\hat{\text{c}}} = {\text{max}}\left\{ c \right\} \). The convolutional layer makes one of the filter structures that extract a local feature by the size of the window. We use the structure of h kinds of filters, and take into account the information between contexts by extracting m feature maps in each filter. A feature map obtained by extracting features from all types of filters, after the maximum pooling operation of the sampling layer, obtain the feature value s of length \( {\text{h}} \times {\text{m}} \) as the output:

$$ {\text{s}} = \left\{ {{\hat{\text{c}}}_{{1,h_{1} }} ,\, \ldots ,\;{\hat{\text{c}}}_{{m,h_{1} }} ,\, \ldots ,\,{\hat{\text{c}}}_{{1,h_{k} }} ,\, \ldots ,\,{\hat{\text{c}}}_{{l,h_{k} }} ,\, \ldots } \right\} $$
(3)

Where \( c \) is the \( lth \) feature value \( \left( {1 \le l \le m} \right) \) produced by the filter of the \( kth \) type \( \left( {1 \le k \le h} \right) \).

The fourth layer is the LSTM layer. The main purpose of this layer is to control the historical information retained by each LSTM unit and to memorize the currently entered information [18, 21, 22, 30], retain important features and discard unimportant features. This layer mainly contains input gates, forgetting gates, cell states and output gates.

$$ i_{t} \;\text{ = }\;\upsigma\left( {W_{i} s_{t} + W_{i} h_{t - 1} + b_{i} } \right) $$
(4)
$$ f_{t} =\upsigma\left( {W_{f} s_{t} + W_{f} s_{t - 1} + b_{f} } \right) $$
(5)
$$ g_{t} = {\text{tanh}}\left( {W_{g} s_{t} + W_{g} s_{t - 1} + b_{g} } \right) $$
(6)
$$ o_{t} =\upsigma\left( {W_{o} s_{t} + W_{o} s_{t - 1} + b_{o} } \right) $$
(7)

Contains the previous cell state and based on current input and last hidden state information generated new information \( c_{t} = i_{t} g_{t} + f_{t} c_{t - 1} \) (The initial memory unit \( c_{0} \) is marked as 0). Finally, the current hidden state of the output is obtained by multiplying the current cell state by the weight matrix of the outputs:

$$ h_{t} = o_{t} { \tanh }\left( {c_{t} } \right) $$
(8)

Therefore, for a long sentence consisting of K clause sequences, the sentence feature vector \( {\text{h}}_{1} , h_{2} , \ldots ,h_{K} \) is obtained by CNN and bidirectional LSTM.

Multi-head Attention.

At present, with the deepening of deep learning research, the attention mechanism has been widely applied to various tasks of natural language processing based on deep learning [23]. The method is based on the following assumptions: the scores between different user relationships and the degree of relevance between movies are different; for important relationships or ratings, more attention is assigned, while others are less focused [24, 25]. The key is how to independently distribute attention without accepting other information [26].

Given a \( Query \) in the \( Target \), by calculating the correlation between it and the \( key \) of each data pair in the \( Sourse \), the weight coefficients of the Key and Value are obtained. We only need to weight the weights separately to the \( value \). Can finally calculate the value of Attention.

$$ {\text{Attention}}\left( {Query,Source} \right) = \sum\nolimits_{i = 1}^{l} {Similarity\left( {Query,Key_{i} } \right) \cdot Value_{i} } $$
(9)

In order to better learn global dependence information from internal text, we used the multi-head attention mechanism proposed by Vaswani et al. [27]. The multi-head attention mechanism takes parallel calculations of multiple scaled dot products, rather than just one calculation. Then, the independent attention calculation units are spliced together, and finally converted into a dimensional output of a desired size by a linear unit.

Fig. 3.
figure 3

Multi-head attention model.

As shown in Fig. 3. We do n linear mapping of Q, K and V matrices and learn different linear projection matrices \( d_{n} \times d_{q} \), \( d_{n} \times d_{k} \) and \( d_{n} \times d_{v} \). Then the resulting projection matrix will be executed separately “Scaled Dot-Product Attention” The self-attention mechanism is essentially \( X = Query = Key = Value \), which means that the Attention is used to find the interdependence within the sequence inside the sequence. The internal self-attention mechanism can solve the problem of weakening the dependence caused by the text being too long. Finally, the matrix of the output \( d_{n} \times d_{k} \), and connect these values and project again, we can get the final value.

We map Q, K, and V to subspaces of lower dimensions and then perform Attention calculations in different subspaces. A lower subspace dimension reduces the amount of computation, which facilitates parallelization and captures features that represent different subspaces at different locations. This is compared to directly using the linearly mapped Q and K dot product as weighting coefficients, and then weighting and summing V. The advantage is that there will not be a situation where the dot product is too large, and there will be no problem that the gradient is too small. We scale the dot product by \( \frac{1}{{\sqrt {d_{k} } }} \) to get:

$$ head_{i} = {\text{Attention}}\left( {QW_{i}^{Q} ,KW_{i}^{K} ,VW_{i}^{V} } \right) = softmax\left( {\frac{{QW_{i}^{Q} \left( {KW_{i}^{K} } \right)^{T} }}{{\sqrt {d_{k} } }}} \right)VW_{i}^{V} $$
(10)

Finally, splicing the results, we can get the output feature value \( {\text{s}} \):

$$ {\text{s}} = MultiHead\left( {Q,K,V} \right) = Concat\left( {head_{1} , \ldots ,head_{n} } \right)W^{O} $$
(11)

3.3 Recommendation Table Generation

This section describes the process of generating the pre-K recommendation tables. We extract the user feature representation through the convolutional neural network, so that we can get the user’s potential feature set \( S_{u} \). Then use the CNN-BLSTM plus the multi-head self-attention mechanism to calculate the set of potential features of the movie \( S_{m} \). Next, the movie scoring matrix is reduced in dimension using the PCA method [28] to obtain a movie score set \( S_{r} \).

First, calculate the similarity between the user preference feature and the unrated item \( Similarity_{1} \), and select the top K term as the candidate list \( C_{k1} \), and get the following formula:

$$ Similarity_{1} = \frac{{\left( {S_{u}^{i} } \right)^{T} S_{m}^{i} }}{{\left\| {S_{u}^{i} } \right\| \cdot \left\| {S_{m}^{i} } \right\|}} $$
(12)

Then, the similarity between the unrated items and the user’s preferences is calculated in turn \( Similarity_{2} \), and the top K items are also selected as the candidate list \( C_{k2} \), and the similarity calculation formula is obtained:

$$ Similarity_{2} = \left( {\frac{{\left( {S_{m}^{i} } \right)^{T} S_{m}^{j} }}{{\left\| {S_{m}^{i} } \right\| \cdot \left\| {S_{m}^{j} } \right\|}}} \right) + \left( {\frac{{\left( {S_{r}^{i} } \right)^{T} S_{r}^{j} }}{{\left\| {S_{r}^{i} } \right\| \cdot \left\| {S_{r}^{j} } \right\|}}} \right) $$
(13)

Among them, \( S_{r}^{i} \) and \( S_{m}^{i} \) together constitute the potential representation of project \( i \), \( S_{r}^{j} \) and \( S_{m}^{j} \) together constitute the potential representation of project \( j \).

By merging the candidate lists \( C_{k1} \) and \( C_{k2} \), the recommendation list \( C_{k} \) and the user relevance \( Similarity \) can be calculated by the user preference feature \( Similarity_{1} \) and the user preference average project \( \overline{{ Similarity_{2} }} \). Assuming that the user has \( q \) preference items, the similarity can be calculated:

$$ \overline{{ Similarity_{2} }} = \frac{1}{q}\sum\nolimits_{i = 1}^{q} {Similarity_{2} } $$
(14)

And then, get user relevance:

$$ Similarity = \varepsilon \cdot \left( {\overline{{ Similarity_{2} }} } \right) + \left( {1 - \varepsilon } \right) \cdot Similarity_{1} $$
(15)

Where ε is an adjustable parameter. The last calculated \( Similarity \) set, select the top K item as the recommended item, and recommend it to the user.

4 Experiments and Analysis

4.1 Dataset and Preprocessing

This article uses MovieLens as the experimental data set and enhances the MovieLens data set. There are several different versions of MovieLens, and we mainly select 1M and 10M data sets. We spliced the ID field in the movies.dat table in the 1M dataset with the 10M dataset in the tags.dat table, which is the newly tagged field in the movies.dat table in the 1M dataset. Add a movie profile field to the movie in the movies.dat table to make it easier to grab features from more dimensions in the movie collection.

Since the length and type of the original data are inconsistent, the data needs to be processed before the fields are entered into the embedded layer.

user.dat table:

  • gender field: In the gender field, convert ‘F’ and ‘M’ to 0 and 1.

  • age field: Change to a continuous number for each age group (one age range every ten years old).

    movies.dat table:

  • type field: First, create a dictionary of text to numbers, because a movie can be of multiple types, so we can convert the categories in type into strings, and store it in a dictionary of numbers, finally, convert the type field of each movie into a list of numbers.

  • label field: Same as the type field. The description in the label is converted into a list of numbers.

  • introduction field: Unsupervised word vector learning with word2vec [29], the learned word vector is stored in the vocabulary. The word is used as the basic unit of the sentence, and the word is expressed as the corresponding word vector form.

4.2 Experimental Parameters

When training the model, we selected 80% of the MovieLens data set as the training set and the remaining 20% as the test set. In the experiment, we set the learning rate to 0.0001. In the training phase, the filter window size of the convolutional neural network is set to 3, 4, and 5 respectively. To prevent over-fitting, add dropout to the neural network of the text processing module, and dropout is set 0.4. The iteration training number epoch is set to 10 and the mini-batch size is set to 128.

4.3 Experimental Results and Analysis

The experiment uses F1, MAP and NDCG to evaluate user-based collaborative filtering algorithm (UBCF), project-based collaborative filtering algorithm (IBCF), CNN-AT, bidirectional LSTM, deep learning model without multi-head attention, and a deep learning model with the multi-head attention mechanism. A total of six models were used to test the error. The results of the comparative test are shown in Table 1.

Table 1. Recommended performance comparison table

As can be seen from Table 1, the overall performance of UBCF and IBCF is poor, mainly due to the lack of understanding of the features. The CNN-AT and the BLSTM model have relatively good effects, and both of them can be well in the entity. The CNN-LSTM model combines the advantages of the two models, so it is slightly better than the general deep learning model. Finally, the CNN-BLSTM model with the multi-head attention mechanism is better than the score prediction results produced by the former. In addition, we also compare the different k values, and found that with the increase of the recommended items, there is a better recommendation effect, and when the recommended items are increased to a certain number, the improvement of the recommendation effect is relatively small.

5 Conclusion and Expectation

In this paper, we propose a user preference learning model that incorporates attention mechanisms to recommend movies to users. We try to use the combination of convolutional neural network, BLSTM and attention mechanism to process user data and movie data, extract features from it, and finally combine movie scores to jointly generate recommendation tables. In the study of attention mechanisms, we also tried to use the multi-head attention mechanism to process and capture features of different subspaces in different locations in parallel, and there is no limit on distance dependence. A comprehensive experimental study demonstrates the effectiveness of our proposed model in film recommendation tasks.

But because our model only considers getting valid information from users and in some dimensions of the movie, it does not consider capturing features from more dimensions, such as movie posters or movie plots. Therefore, in the future we may consider improving our model, considering different features and different models to obtain more effective information in different dimensions, and finally combining to better understand user preferences. Thus providing users with higher quality and more precise recommendations. We will also try to recommend the application of the model in other areas.