1 Introduction

In the past decades, with the rapid growth of information, recommendation systems [1] have played a key role in numerous domains, such as news [2], e-commerce [3,4,5], and online advertising [6, 7]. Inspired by the successful applications of deep learning in computer vision [8] and natural language processing [9, 10], deep neural network-based approaches have also been extended to the field of recommendation systems.

Deep neural network-based models have been proposed to learn feature interactions and the representative models include Wide & Deep, PNN, DIN, and other models. Wide & Deep [11] combines a linear model and a nonlinear model to learn both low-order and high-order feature interactions, but the linear part still relies on manual learning which leads to poor model performance. PNN [12] is proposed to better capture feature interactions by the product layer. Since user behavior sequence is important for mining user interests, models such as DIN [13], DIEN [14], and DMR [15] obtain user representation from the user’s historical behavior to reflect user interests. On the one hand, they pay little attention to the relationship between target items in the user behavior sequence when these models learn user representation. On the other hand, when using context information to reflect the variations of user interests, the attention units used in these models are hard to express the diversity of user preferences. As a result, these models fail to obtain the real interests of the user, which in turn makes the CTR prediction results inaccurate.

In order to extract user’s real interests more accurately, a new model named interest extraction method based on multi-head attention mechanism (IEN) is proposed in this paper. Specifically, we design an interest extraction module which consists of two sub-modules: the item representation module (IRM) and the context–item interaction module (CIM). In the IRM, the relationship between target items in the user behavior sequence is learned by using the multi-head attention mechanism which helps to obtain refined item representations. Then, by integrating refined item representations and position information, user representation is gained. At last, the correlation between the user and the target item is calculated by the inner product. In the CIM, a multi-head attention mechanism is used to learn feature interaction between the context and the target item to further get the user interest.

In summary, the main contributions of this paper are summarized as: We propose a new model named interest extraction method based on multi-head attention mechanism (IEN) to capture user interests in this paper. On the one hand, item representation module (IRM) is introduced to learn the relationship between target items in the user behavior sequence and the refined item representation is acquired. On the other hand, the context–item interaction module (CIM) is designed to capture the feature interaction between the context and the target item by utilizing the multi-head attention mechanism.

The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 details our model structure. We perform experimental validation and analysis in Sect. 4. Finally, in Sect. 5, we summarize our work and point out the direction of future work.

2 Related work

The previous CTR prediction models are based on LR and its variants [16]. LR is a linear model and lacks the ability to learn complex feature interactions. This makes its feature representation generally weak. To overcome this drawback, the FM [17] model is proposed, which can learn second-order feature interactions. Meanwhile, FFM [18] and FWFM [19] are proposed based on FM. FFM introduces the concept of “field” into feature interactions, while FWFM learns different feature interactions by the inner product of embedding vectors and field weights. However, these methods cannot fully apply the data features which can be grouped by certain rules in real scenarios. Since clustering [20,21,22,23] and classification [24] can divide the features, they are applied to solve the above problem in CTR prediction. In addition, NFM [25] combines FM and neural networks to improve the model’s performance. Learning the interactions of second-order features can improve the performance of the model, but some redundant feature interactions may lead to noises. So, AFM [26] is proposed to learn feature interactions by attention mechanism.

Recently, recommendation models based on deep neural networks (DNN) have received much attention and achieved remarkable results. Among them, Wide & Deep [11] combines linear and nonlinear deep models to learn low-order and high-order feature interactions, but the linear part still relies on manual learning and results in inferior model performance. To address this problem, DeepFM [27] combines the power of DNN and FM for feature representation. In addition, DCN [28] applies a novel cross-network to automatically learn high-order features. PNN [12] is proposed to better capture high-order feature interactions by the product layer. Besides, xDeepFM [29] is proposed to learn feature interactions by compressed interaction network. In general, the above deep models improve recommendation performance by capturing low-order or high-order features.

Because the user’s historical behavior contains items viewed by the user, it is crucial to capture user interests. DIN [13] is proposed to learn user interests by activation unit, but DIN rarely considers the changing trend of interest. Thus, DIEN [14] learns the dependencies between sequential behaviors by designing an auxiliary loss and AUGRU. And the model not only extracts user interests but also captures the temporal evolution of user interests. Meanwhile, BST [30] is used to capture the neglected sequential nature in DIN by utilizing a transformer. DSIN [31] learns user interests by using session information from the user behavior sequence. DMR [15] is proposed to obtain the relevance between the user and the target item through user-to-item network and item-to-item network. DMIN [32] learns potential multiple user interests from the user behavior sequence. MIAN [33] extracts feature interactions between multi-field features by utilizing a multi-interaction layer and a global interaction module.

Referring to the literature [34], we know that the above methods focus on learning user representations in the user behavior sequence but they cannot fully learning the contextual information. In addition, we agree with the literature [35] that learning refined item representations is important for learning user representations. We propose a model called interest extraction method based on multi-head attention mechanism (IEN) in this paper. It can better learn context information and the refined representation of items. User interests can be captured more precisely, which is helpful to improve the accuracy of CTR prediction.

Fig. 1
figure 1

Interest extraction method model

3 Interest extraction method based on multi-head attention mechanism

Since the user behavior sequence is a list of items visited by the user, the user behavior can reflect the user’s interests, and extracting good user representation from the user behavior sequence is beneficial to acquire user interests. In addition, relying only on the user behavior sequence may lead to outdated recommendations. Moreover, valuable features in context information (e.g., review time, ratings) are meaningful for deriving the user’s current interests. So, learning the feature interaction between the context and the target item is advantageous for obtaining user interests.

The multi-head attention mechanism [36] can learn the relationship between different features. Thus, the relationship between target items in the user behavior sequence and feature interaction between the context and the target item can be learned by the multi-head attention mechanism. Therefore, a new model called interest extraction method based on multi-head attention mechanism (IEN) is designed in this section. The framework of IEN is shown in Fig. 1, and we design an interest extraction module, which is composed of two sub-modules: the item representation module (IRM) and the context–item interaction module (CIM).

In the IRM, to capture user representation, we integrate position information with refined item representations learned by a multi-head attention mechanism. After that, the relevance between the user and the target item is derived by the inner product. In the CIM, the feature interaction between the context and the target item is learned via the multi-head attention mechanism.

3.1 Item representation module

Four types of features are used in our model processing: User Profile, Target Item, User Behavior Sequence, and Context. User Profile \((x_p)\) contains features related to the user, such as user ID, consumption level, and gender. The Target Item \((x_t)\) contains item ID, category ID, etc. And the item ID can be expressed as an embedding matrix \(V=[v_1;v_2;\ldots ;v_K]\in R^{K\times d_v}\), where K is the total number of items and \(d_v\) is the embedding dimension of the jth item \(v_j\). The User Behavior Sequence contains multiple items and can be denoted as \(x_b=[e_1;e_2;\ldots ;e_T]\in R^{T\times d_e}\), where \(d_e\) is the embedding dimension of the tth item \(e_t\) and T is the length of user behavior sequence. Context \((x_c)\) includes the time, the method of matching, the corresponding match score, and other valuable information. As there is sequential information in the user behavior sequence, the position information \([p_1;p_2;\ldots ;p_T]\in R^{T\times d_p}\) is represented to capture the sequential information in the user behavior sequence, where \(d_p\) is the dimension of the tth position.

It is known that user’s historical behavior contains some items visited by the user, so user interests denoted by the user representation can be extracted from user behaviors. Thus, to better extract user interests, we require a good method to capture the user representation from the user behavior sequence.

The item representation module (IRM) is proposed to learn the user interests implied the user behavior sequence. In the IRM, the refined representation of each item is acquired via a multi-head attention mechanism. After that, the user representation is obtained by integrating refined item representations and position information. Lastly, the inner product is used to get the correlation between the user and the target item to denote the user’s interest in the target item. The input of the multi-head attention mechanism used in this paper consists of the query (Q), key (K), and value (V). The specific calculation equations are as follows:

$$\begin{aligned} \text {Attention}(Q,K,V)= & {} \text {softmax}(\frac{Q K^\textrm{T}}{\sqrt{d}})V \end{aligned}$$
(1)
$$\begin{aligned} \text {head}_h= & {} \text {Attention}(x_b W_h^Q,x_b W_h^K,x_b W_h^V) \nonumber \\= & {} \text {softmax}(\frac{x_b W_h^Q \cdot (x_b W_h^K)^\textrm{T}}{\sqrt{d_k}})\cdot x_b W_h^V \end{aligned}$$
(2)
$$\begin{aligned} M= & {} \text {MultiHead}(x_b)=\text {Concat}(\text {head}_1,\text {head}_2,\ldots ,\text {head}_H)W^O \end{aligned}$$
(3)

where \(W_h^Q,W_h^K,W_h^V\in R^{d \times d}\) is the weight matrix, d and \(d_k\) are the scale factors, \(W^O\) is the linear matrix, and H is the number of the head. The output of each head is concatenated to get the high-order item representations, denoted as \(M=(m_1;m_2;\ldots ;m_t;\ldots ;m_T)^\textrm{T}\), where \(m_t\) is the tth item.

In addition, this paper will use auxiliary loss [14] to supervise the learning of the refined representation of each item. The auxiliary loss uses the \((t+1)\)th item \(e^i_{t+1}\) to supervise the learning of the tth item representation. The real next action of the user is used as the positive sample, while the negative sample is sampled from the set of items that were not clicked. The formulation of auxiliary loss is expressed as,

$$\begin{aligned} L_\text {aux}=-\frac{1}{N}(\sum _{i=1}^N\sum _t(\text {log}\sigma (m_t^i,e_{t+1}^i)+\text {log}(1-\sigma (m_t^i,e{\hat{Z}}_{t+1}^i)))) \end{aligned}$$
(4)

where N is the number of training samples.

The user representation can be learned by integrating the refined item representation \(m_t\) and position information \(p_t\). The calculation formulas are expressed as,

$$\begin{aligned} \alpha _t= & {} \frac{\text {exp}(\text {tan}h(W_p p_t+W_m m_t+b))}{\sum \nolimits _{t=1}^T \text {exp}((\text {tan}h(W_p p_t+W_m m_t+b))} \end{aligned}$$
(5)
$$\begin{aligned} u= & {} \sum _{t=1}^T(\alpha _t m_t)=\sum _{t=1}^T(h_t) \end{aligned}$$
(6)

where \(p_t\in R^{d_p}\) is the tth position embedding and \(W_p\in R^{d_h \times d_p}\), \(W_m\in R^{d_h \times d_m}\), \(b\in R^{d_h}\) are learning parameters. And \(\alpha _t\) is the normalized weight of the tth item. The feature vector of the user behavior sequence can be mapped into a fixed-length feature vector \(u\in R^{d_v}\) by weighted sum pooling.

Finally, the item representation \(v\in R^{d_v}\) of the target item is queried by the embedding matrix. After that, the relevance between the user and the target item is gained by the inner product: \(r=u^\textrm{T} v\).

3.2 Context–item interaction module

The valuable temporal features in the context information are critical to derive the user’s current interests. However, the IRM focuses on learning the relationship between the user behavior sequence and the target item. It does not sufficiently learn the context information and thus lacks learning of the user’s current interests. Thus, IRM may lead to outdated recommendations. Therefore, we propose the context–item interaction module (CIM). In the CIM, we learn the feature interaction between the context and the target item by applying the multi-head attention mechanism. In this way, the interests of the user are gained. The specific steps of this module are as follows. Firstly, the context representation \(x_c\) is concatenated with the target item representation \(x_t\) and \(Z=Concat(x_t,x_c)\) is gained. Secondly, the feature interaction (R) between the context and the target item is learned using Eqs. 78.

$$\begin{aligned} \text {head}^{\prime }_h= & {} \text {Attention}(Z W_h^{\prime Q},Z W_h^{\prime K},Z W_h^{\prime V})\nonumber \\= & {} \text {softmax}\left( \frac{Z W_h^{\prime Q} \cdot (Z W_h^{\prime K})^\textrm{T}}{\sqrt{d_k}}\right) \cdot Z W_h^{\prime V} \end{aligned}$$
(7)
$$\begin{aligned} R= & {} \text {MultiHead}(Z)=\text {Concat}(\text {head}^{\prime }_1,\text {head}^{\prime }_2,\ldots ,\text {head}^{\prime }_H)W^{\prime O} \end{aligned}$$
(8)

3.3 The overall structure of interest extraction method

The interest extraction method based on multi-head attention mechanism (IEN) mainly consists of the embedding layer, the interest extraction module, and the multi-layer perceptron. The specific structure is shown in Fig. 1. Among them, the interest extraction module mainly includes IRM and CIM.

Most features can be encoded as high-dimensional one-hot vectors. To begin with, in the embedding layer, the one-hot vectors are transformed into low-dimensional dense features. After that, in the interest extraction module, IRM and CIM can sufficiently capture the user interests in the target item. Finally, the relevance between the user and the target item, feature interaction, the user profile, context information, user behavior sequence, and target item are concatenated together and fed into the MLP for the final CTR prediction. Since CTR prediction is a binary classification task, the widely used cross-entropy loss function is chosen for the loss function. It uses the label of the target item to supervise the whole prediction. So, the cross-entropy loss function is defined as,

$$\begin{aligned} L_\text {target}=-\frac{1}{N}\sum _{(x,y)\in D}^N(y \text {log}f(x)+(1-y) \text {log}(1-f(x))) \end{aligned}$$
(9)

where \(x=[x_p,x_t,x_b,x_c]\), D is the training set with the total number of N, \(y=\left\{ 0,1\right\} \) indicates whether the user clicked the target item, and f(x) is the prediction result of the MLP output.

In this paper, we uses cross-entropy loss and auxiliary loss to supervise the overall prediction and the extraction of refined item representations, respectively. So, the final prediction loss function is defined as,

$$\begin{aligned} L_\text {final}=L_\text {target}+\beta L_\text {aux} \end{aligned}$$
(10)

where \(\beta \) is a hyperparameter that balances the losses of the two parts.

Table 1 Datasets used in this paper

4 Experiments

In this section, we conduct comparison experiments on four Amazon datasets with the proposed method IEN and seven existing popular algorithms, and the experimental results are analyzed and evaluated. In addition, we validate the effectiveness of each part in the IEN model with the ablation experiments.

4.1 Datasets

The four real datasets derived from the Amazon datasetsFootnote 1. are used to evaluate the model’s performance. All Amazon datasets contain abundant user behavior, user profile, and context information. We select Electronics, Beauty, CDS &Vinyl, and Book datasets which are already extensively used in CTR prediction task in the Amazon datasets. We get training set and testing set by random sampling from the original dataset with split rate of 80% and 20%, respectively. Table 1 lists the statistics of the four datasets.

4.2 The comparison models

To evaluate the performance of the IEN model, we compare it with the most popular methods based on deep neural network frameworks. The methods include DNN, Wide & Deep, PNN, DIN, DIEN, DMR, and DMIN. Among them, DIN, DIEN, DMR, and DMIN acquire user interests from the user behavior sequence.

DNN [37]: DNN is a standard deep neural network, which consists of the embedding layer and MLP. It is also a prototype for other DNN-based models for CTR prediction task.

Wide &Deep [11]: This method combines linear and nonlinear models to learn feature interactions and further capture user interests.

PNN [12]: This method introduces the product layer to learn high-order feature interactions.

DIN [13]: DIN uses attention units to learn user interests by combining attention mechanisms with DNN.

DIEN [14]: In this method, the disadvantages of DIN are solved by applying the auxiliary loss and AUGRU to get the evolving process of user interests.

DMR [15]: This method learns the relevance between the user and the target item by designing a user-to-item network and an item-to-item network.

DMIN [32]: DMIN utilizes the behavior refiner layer and the multiple interest extraction layer to learn the multiple interests of the user.

Fig. 2
figure 2

Experimental results for different numbers of heads in the attention mechanism

4.3 Experimental setups

In this subsection, the parameters in our model are set the same as the ones in the references [14, 15]. The learning rate is set to 0.001, the batch size is set to 256, and the weight of the auxiliary loss is set to 1.

To check the effects of heads number in the multi-head attention mechanism on the model’s performance, an experiment was conducted. The results are displayed in Fig. 2. It is interesting to notice that the model performs best when H is 4 on all datasets. So, H in IEN is set to 4, and the setting of H in the comparison algorithms refers to the literature.

Fig. 3
figure 3

Results of the effect of user behavior sequence length on model’s performance

Furthermore, since the user behavior sequence contains valuable information, choosing the appropriate length of the behavior sequence is critical to the model’s performance. We compare the IEN with models that make predictions based on the user behavior sequence, such as DIN, DIEN, DMR, and DMIN. A quick look from Fig. 3, we can see the impact of different user behavior sequence lengths on the model’s performance. It is clear that IEN shows the best effect when the sequence length is set to 10 on the Electronic, Beauty, CDS &Vinyl datasets. And on the Book dataset, IEN performs the best when the sequence length is set to 20. In addition, DIN and DIEN show the best performance on the four datasets with sequence length set to 10. Additionally, DMR fares the best when the sequence length is 10 on the Electronics, Beauty, CDS &Vinyl datasets and 20 on the Book dataset. The model performs well when DMIN is set to 10 on the top two datasets and 20 on the bottom two datasets.

Finally, we use Adam [38] as the optimizer of IEN. Besides, we use the area under the ROC curve (AUC) [39] and DNN-based RelaImpr (RI) [40] as evaluation indexes. The experiments are repeated five times, and the average results are recorded.

Fig. 4
figure 4

Training curves on the four datasets

Table 2 AUC experimental results on four datasets
Table 3 RI based on DNN for all models on each dataset

4.4 Experimental results

In this subsection, to demonstrate the validity of IEN, we analyze the experimental results of IEN and comparison algorithms (DNN [37], Wide & Deep [11], PNN [12], DIN [13], DIEN [14], DMR [15], DMIN [32]) on four datasets Electronics, Beauty, CDS &Vinyl, Book, where DNN is used as the baseline. Figure 4 presents the training process of all algorithms on four datasets, where the red line represents IEN. Taking a close look at the results, the IEN model far exceeds other algorithms on four datasets. Besides, our method performs best on the Book dataset.

The comparison experimental results on the four datasets are given in Table 2, where the best experimental results are indicated in blue. It is observed that IEN beats the other models on all metrics on the four datasets. Concretely, the AUC values of IEN proposed in this paper are 0.7701, 0.7241, 0.8090, and 0.8471 on the four datasets. Observing Table 3, the AUC of IEN increased by 12.87\(\%\), 9.48\(\%\), 8.31\(\%\), and 23.17\(\%\) on the four datasets, respectively, compared to DNN. Meanwhile, compared with the suboptimal model, the AUC value of the IEN model is improved by 2.12\(\%\), 1.22\(\%\), 2.15\(\%\), and 4.08\(\%\), respectively. All in all, the effectiveness of the IEN is much improved compared to the other algorithms.

The reasons for the good performance of our approach can be attributed to the following aspects. On the one hand, the refined representation of each item is obtained by learning the relationship between target items in the user behavior sequence. It is beneficial to gain a better user representation to reflect user interests. In addition, learning the feature interaction between the context and the target item by multi-head attention mechanism is beneficial to capture the current interests of the user.

On the other hand, even though models such as DIN, DIEN, DMR, and DMIN all acquire user interests from the user behavior sequence. DIN fails to capture the evolving process of user interests due to ignoring position information. DIEN pays little attention to the relevance between the user and the target item. Finally, DMR and DMIN ignore the effect of context information on the user interests; thus, these models are inferior to IEN.

Besides, DNN, Wide & Deep, and PNN both improve on the feature interactions. But they focus more on the feature interactions without modeling user behavior sequence features. It results in ignoring the user interests in the user behavior sequence. In particular, DNN does not learn feature interactions sufficiently. Therefore, it performs worse in the comparison algorithms. In addition, Wide & Deep performs not well because the “wide” part relies on manually designed feature interactions. Additionally, PNN focuses on the learning of high-order features and ignores the valuable information in the original features.

Table 4 AUC results of the ablation experiment on four datasets
Table 5 IEN-based RI in ablation experiments

4.5 Ablation experiments

In this subsection, to check the effectiveness of each component in the IEN, we conducted ablation experiments. The ablation results are summarized in Tables 4 and 5. As expected, the results of the ablation experiments illustrate the validity of each part. Moreover, the experimental results demonstrate that each module improves the performance of CTR prediction to a certain degree.

Firstly, to verify the significance of the relevance, we compare the IEN with the IEN w/o UI after removing the correlation between the user and the target item. The AUC value of IEN w/o UI on the Electronics dataset is 0.7591, which is a 4.07\(\%\) decrease compared to IEN. Its performance is also worse on all other datasets. It is worth mentioning that it drops mostly on the Book dataset. This is because the correlation between the user and the item reflects the user’s preference for the item.

Secondly, to confirm the refined representation of each item is essential in IEN, we compare the IEN with the IEN w/o BI after deleting the refined representation of each item. On the Electronics dataset, the AUC value of IEN w/o BI is 0.7598, which is 3.81\(\%\) lower than the IEN. In a straightforward view, it validates the necessity of learning the high-order item representation by applying a multi-head attention mechanism.

Finally, to check the validity of the feature interaction module between the context and the target item, the IEN is compared against the IEN w/o IC with the feature interaction of the context and the target item removed. The IEN w/o IC on the Electronics dataset has an AUC value of 0.7616. It decreases by 3.15\(\%\) compared to IEN and the performance is inferior to IEN on the other datasets. Therefore, learning the feature interaction between the context and the target item is proven to be reasonable.

In summary, we found that removing the correlation between the user and the target item had the largest impact on the performance of CTR prediction. Removing the learning of the refined item representations has a close second impact on the performance of CTR prediction. Deleting the feature interaction between the context and the target item has the least impact on the performance of CTR prediction. Nonetheless, each part contributes to the performance of CTR prediction.

5 Conclusion

In this paper, we propose a new model called interest extraction method based on multi-head attention mechanism (IEN). We design an interest extraction module, which consists of the item representation module (IRM) and the context–item interaction module (CIM). In the IRM, the relationship between the items in the user behavior sequence is learned to obtain the refined representation of each item. This operation assists to acquire a good user representation which is used to reflect user interests. Besides, in the CIM, the feature interaction between the context and the target item is learned by a multi-head attention mechanism. Furthermore, experimental results on four public datasets demonstrate that our proposed method IEN helps to improve the performance of CTR prediction. The ablation experiments further illustrate the effectiveness and necessity of each part in the IEN. As far as we know, the time interval information is useful to reflect the variations of user interests, and this model does not take time interval information into account. Therefore, how to better explore the time interval information will be the task in the next stage.