1 Introduction

Click-through rate (CTR) prediction is a well-known recommendation task that aims to predict the probability of a user clicking on recommended items and ads [26, 30]. Since CTR prediction directly influences the revenues of advertising platforms and the satisfaction of users, it has become an actively investigated topic in both industry and academic communities for recommender system research [5, 12]. Thus, designing an effective and efficient model for improving the accuracy of CTR prediction has received much attention.

The key challenge in CTR prediction is how to effectively model feature engineering. In traditional methods, Logistic Regression (LR) [1] and Factorization Machine (FM) [24] are two popular models for feature learning. LR is a linear model and encodes features through a linear combination. FM utilizes factorized parameters to model second-order feature interactions for feature learning. There are many variants of FM for improving the performance of CTR prediction [12, 22, 34]. However, the limitation of these methods is that they cannot obtain high-order feature interactions. Recently, some deep learning based models are proposed for CTR prediction as these methods show great improvements over traditional methods on recommender systems. Factorization machine supported Neural Network (FNN) [40] and Factorization-Machine based neural network (deepFM) [6] combine FM and multilayer perception (MLP) through different aggregations. Additionally, there are some researches focusing on modelling different feature interactions. Attentional Factorization Machine (AFM) [34] introduces the attention mechanism [29] into second-order feature interactions to automatically learn weights. Feature importance and bilinear feature interaction network (FiBiNET) [10] dynamically learns the feature importance and fine-grained second-order feature interactions. These methods achieve remarkable improvements and verify that low- (first-order and second-order) and high-order feature interactions are important for feature learning.

Several works devote to automatic feature engineering with deep neural networks (DNNs) and discover that the quality of low- and high-order feature interactions directly influences the performance of CTR prediction. However, there is not any work jointly focusing on capturing informative feature interactions from both low- and high-order. First, it is widely accepted that different features have different importances for a target task. For example, feature occupation is more important than the feature age when predicting a user’s income. Second, not all pair-wise feature interactions are equally useful for prediction. For example, the interaction of feature occupation and home address is more useful than the interaction of feature age and gender when predicting a user’s income. Third, high-order feature interactions are built from low-order feature interactions, and the generated feature space would be huge and require extremely heavy computation. Different layer feature interactions represent different semantic information for feature learning, and the less useful interactions should be assigned lower weights since their contributions are limited, which can significantly reduce the computation. Furthermore, the data involved in CTR prediction are typically categorical and very sparse [28], and existing methods usually use an inner product or a Hadamard product for modelling second-order feature interactions. However, it may limit the feature learning and hurt the overall performance since the inner product and Hadamard product are too simple to effectively calculate the interactions of feature interactions in sparse datasets. Although FiBiNET [10] designs three kinds of bilinear function for second-order feature interactions and gains some improvement, it merely considers the projection from the i-th feature to the j-th feature while ignoring the projection from the j-th feature to the i-th feature when building pair-wise feature interactions. We argue that the incomplete projection has bias and crucially limits the performance.

To address the above mentioned problems, we propose an effective framework called Hierarchical Attention and Feature Projection neural network (HAFP) to fully exploit relevant information from different orders of feature interactions and extract more fine-grained second-order feature interactions from the comprehensive projection. Specifically, inspired by the knowledge that the importance of different features differs greatly for the final task, HAFP first retrieves the salient features through the designed attentive global-local contexts module. Next, HAFP computes the weighted score for each second-order feature interaction according to its contribution, which is the second-level attention. Finally, high-order feature interaction of each layer is selectively aggregated for feature learning and a more meaningful and informative feature representation can be learnt. Furthermore, as the quality of second-order feature interactions directly influences the performance of high-order feature interactions, a projective bilinear function is designed to learn more fine-grained feature interactions.

Comparing with existing methods, the proposed HAFP is able to not only encode more informative features and feature interactions but also capture more comprehensive interaction information, thus facilitating the feature representation which is built from our model owns meaningful information and has good explanations. We implement experiments on two public datasets, Criteo and Avazu. The results indicate that HAFP is able to achieve an accurate CTR prediction and outperforms the state-of-the-art methods for CTR prediction on the two datasets in terms of AUC and Logloss. This work provides a new feasible way to improve the accuracy of CTR prediction. Besides, our proposed HAFP can be directly used in practice and helps to increase the revenue of advertising platforms.

The contributions of our HAFP model can be summarized as follows.

  • The proposed hierarchical attention mechanism fully exploits relevant contexts for the feature learning, and the weights of new features can be trained in the same way. It improves the extensibility of our model and consistency with practice. To the best of our knowledge, we are the first to jointly capture relevant information from both low- and high-order feature interactions for CTR prediction in an end-to-end manner.

  • Inspired by the success of bilinear-interaction layer in [10], we introduce a projective bilinear function which employs an inner product to form a co-projection matrix and a Hadamard product to generate the interaction embedding. It dynamically learns feature interactions in a more fine-grained way.

  • An attentive global-local contexts module is designed to adaptively select meaningful features, which can simultaneously emphasize common information that distributes more globally and highlight characterized information that distributes more locally.

  • We conduct comprehensive experiments on the two public datasets. The results show that our model can improve the CTR prediction performance with 0.3% and 0.2% in terms of AUC respectively, compared to the best performance baseline.

The rest of this paper is structured as follows. Section 2 discusses some related work of CTR prediction. Section 3 presents the HAFP model in detail. We conduct comprehensive experiments and present the experimental setups with the corresponding results in Section 4. Finally, we conclude the paper and point out some future work in Section 5.

2 Related work

CTR prediction is usually studied as a binary classification task, and accurate feature engineering is helpful for improving the performance of CTR prediction task [15, 32]. In order to improve the prediction performance, some models pay attention to feature interactions[17, 36, 37], and the other models [8, 27, 31, 35, 38] argue that behavior sequences can benefit model learning and are useful in performance. Since our work focuses on modelling information from features, we briefly review traditional methods and deep learning based methods which are related to feature interactions.

2.1 Traditional methods

In traditional methods, LR is the foundation of many popular models and it is widely used in both industrial and academic areas for CTR prediction, in which the weights corresponding to the features are considered as their importance degree or their influence on the click rate. However, they belong to the linear model and lack the ability to build sophisticated feature interactions. Additionally, FM [24] is another well-known model for CTR prediction. It projects sparse features into low-dimensional dense vectors and builds second-order feature interactions by using an inner product on the dense vectors. Therefore, FM based methods [12, 34] can deal with the problem of data sparsity better than LR based methods. Afterwards, some variants of FM are proposed for improving the performance of the final task. Field-aware Factorization Machine (FFM) [12] introduces field information into the FM model. AFM [34] extends the FFM by adding an attention mechanism to capture the feature interaction importance, and it owns good interpretability. However, these traditional methods only have the capability to model low-order feature interactions, and they have no power to model high-order feature interactions. In addition, a linear combination of feature interactions limits their performances for the final task.

2.2 Deep learning based methods

In deep learning based methods, DNNs are introduced into CTR prediction since they can effectively capture high-order feature interactions for better performances [20, 25]. FNN uses pre-trained embedding from FM and then models high-order feature interactions via MLP. Product-based neural network (PNN) [23] takes an inner product and an outer product for feature embedding instead of FM. Compared to the previous methods which attribute to shallow structures, FNN and PNN obtain better performance. However, the limitation of FNN and PNN is that they focus less on low-order feature interactions, which is insufficient to make accurate feature learning. To jointly encode low- and high-order feature interactions, Wide&Deep [4] and Deep&Cross [33] integrate a wide/cross part and a deep part to individually build low- and high-order feature interactions. DeepFM [6] introduces FM into the wide part of Wide&Deep model and introduces raw features to the deep part. Deep Field Relation Neural Network (DFRNN) [42] takes a 3-dimensional relation tensor to model the feature interactions. The performances of these works verify that jointly modelling low-order and high-order feature interactions is beneficial for extracting comprehensive and representative information. However, such works cannot learn effective interactions since the contributions of different feature interactions to the CTR prediction result may be different. To alleviate this problem, Interpretable CTR prediction model with Hierarchical Attention Mechanism (InterHAt) [16] considers the interaction order and builds second-order feature interactions through a multi-head self-attention based transformer on raw feature embeddings. To further automatically learn indispensable feature interactions, a High-order Attentive Factorization Machine (HoAFM) [28] method introduces a bit-wise attention mechanism to determine the different importance of low- and high-order feature interactions. A Multi-order interactive features aware Factorization Machine (MoFM) [37] approach integrates three different types of prediction models to effectively capture low-order and high-order interactive features. Attentive Capsule Network (ACN) [13] uses transformers to automatically learn the meaningful feature interaction. Cai et al. [2] propose an effective CTR prediction method called CAN, which explicitly exploits the benefits of attention mechanism and DNNs in modelling low-order and high-order feature interactions. Besides, since the residual module is verified to have the capability to retrieve powerful and discriminant representations [21], the research [18] introduces the residual network into the layer of learning high-order feature interactions. It forms a structure of ResNet-CTR which can explore complex feature interactions at different layers. Compared to the previous works, these approaches that consider the importance of each feature interaction outperform the previous methods. However, they still extract the feature representation from the raw data directly and ignore the impact of different features for CTR prediction. Thus, there is room for improvements because unnecessary features are modelled without considering their importance. Recently, since not all features are equally useful for modelling feature interactions and a better feature representation makes the feature interactions easier, FiBiNET [10] constructs the embedding vectors of multi-field features and feature interactions through the SENET layer and bilinear function. In addition, Yang et al. [39] focus on improving the feature representation and propose an embedding method called operation-aware embedding. It can learn different representations for each feature when taking different operations. Jiang et al. [11] divides the data into different groups according to their important characteristics. However, they enumerate all feature interactions equally for feature learning, which always requires large memory. In addition, useless feature interactions can introduce unnecessary noise and negatively impact the prediction accuracy.

In summary, the key limitations of existing approaches for CTR prediction, which exploit salient features, meaningful second-order feature interactions, or dominant high-order feature interactions in feature engineering, is that they generally have difficulty to effectively build an accurate and representative feature representation. To improve the prediction performance, it is useful to jointly consider the different contributions of features and feature interactions. Here, we introduce a hierarchical attention mechanism to learn the informative features and feature interactions at both low-order and high-order feature interactions, which provide an interpretable capability of the prediction results. Further, we design a projective bilinear function to effectively learn more fine-grained and comprehensive second-order feature interactions, which can enrich the information for modelling high-order feature interactions and further improve the prediction accuracy.

3 The proposed algorithm

We aim to automatically learn the relevant low- and high-order feature interactions in an end-to-end manner. As a result, we propose a Hierarchical Attention and Feature Projection neural network (HAFP) for CTR prediction.

In this section, we mainly describe the framework of HAFP. As shown in Fig. 1, HAFP has three main components: embedding layer, feature learning layer, and prediction layer. First, the embedding layer is used to convert each raw feature into a dense low-dimensional vector. Second, in order to derive the meaningful and representative feature representation, we employ a feature learning layer to encode features and feature interactions based on the output of the embedding layer. The feature learning layer consists of three parts: salient feature encoder, meaningful second-order interaction encoder, and dominant high-order interaction encoder. The salient feature encoder transforms the dense low-dimensional feature vector into salient feature embedding with the help of an attentive global-local contexts module. This process pays more attention to the feature importance which dynamically places higher weights on the important features and decreases the weights of uninformative features. The encoder helps to boost reliability of feature embedding. The meaningful second-order interaction encoder transforms the interaction information of salient feature embeddings with the help of a projective bilinear function and a self-attention mechanism. This process not only fully explores the information of second-order feature interactions but also considers their importance for the target task. This encoder has capable to build more fine-grained second-order feature interactions and enhance interaction discriminability. The dominant high-order interaction encoder transforms feature interactions at different layers by using the attention mechanism. It can select dominant and irrelevant ones and assign weights for each layer dynamically. This encoder can build more representative and efficient feature representation. The prediction layer takes the output of the feature learning layer to compute the prediction score which represents the probability of the user clicking on the recommended product. The following sections introduce each part in detail.

Fig. 1
figure 1

The framework of Hierarchical Attention and Feature Projection neural network (HAFP)

3.1 Embedding layer

In the task of CTR prediction, data is always aggregated from different fields, and usually contains categorical and numerical features which cannot be directly used for numerical computations. Table 1 is an example of real-world multi-field data which is used for CTR prediction. To represent these kinds of features, they are often converted into high-dimensional sparse vectors by using one-hot encoding. However, since the embedding generated from one-hot encoder is always sparse and is hard to be processed, a lookup table processing approach is applied to transform each raw feature into a corresponding dense low-dimensional vector and form a field embedding vector. Finally, we use E = [e1,e2,...,em] to denote the field embedding vector, where m denotes the number of fields, and eiRd denotes the embedding of the i-th field feature, and d is the embedding size.

Table 1 An example of multi-field data for CTR prediction. Each of the columns is a field. Gender and Occupation are categorical features, and age is numerical feature

3.2 Feature learning layer

3.2.1 Salient feature encoder

Squeeze and excitation network (SENET) [9] is efficient for learning feature importance in CTR prediction as it can effectively learn the relationships between each feature and the global context. However, we argue that it focuses on the common information and ignores the characterized information of each feature. Inspired by the success of attentional feature fusion [14] in computer vision, we design an attentive global-local contexts module in a salient feature encoder, which simultaneously takes global and local contexts into consideration to build salient feature embedding. Thus, it consists two sub-parts to separately retrieve the influences from the global context and the local context. The framework of the attentive global-local contexts module is shown in Fig. 2.

Fig. 2
figure 2

The attentive global-local contexts module

Attentive global-local contexts module. Given field embedding vector E = [e1,e2,...,em], the feature embedding learned from global context requires global context information. Therefore, we apply mean pooling on each feature embedding ei to calculate global information ai, and form a global weight vector A = [a1,a2,...,am]. Then, we learn the weight of each feature embedding according to the global weight vector by using widely used dimensionality-reduction and dimensionality-increase method. Finally, a global feature embedding Vg is built based on the field embedding vector and weight vector by using a reweight method. The detailed calculations of these steps are shown as follows.

$$ {\boldsymbol{a}_{i}} = \frac{1}{d}\sum\limits_{j = 1}^{d} {{\boldsymbol{e}_{i}^{j}}} $$
(1)
$$ {\textbf{\textit{G}} = [{\textbf{\textit{g}}_{1}},{\textbf{\textit{g}}_{2}},...,{\textbf{\textit{g}}_{i}},...,{\textbf{\textit{g}}_{m}}] = {\sigma_{1}}({\textbf{\textit{W}}_{g1}}{\sigma_{2}}({\textbf{\textit{W}}_{g2}}\textbf{\textit{A}}))} $$
(2)
$$ \begin{array}{@{}rcl@{}} \boldsymbol{V}_{g} &=& [{\boldsymbol{v}_{g1}},{\boldsymbol{v}_{g2}},...,{\boldsymbol{v}_{gi}},...,{\boldsymbol{v}_{gm}}]\\ &=& [{\boldsymbol{g}_{1}} \cdot {\boldsymbol{e}_{1}},{\boldsymbol{g}_{2}} \cdot {\boldsymbol{e}_{2}},...,{\boldsymbol{g}_{i}} \cdot {\boldsymbol{e}_{i}},...{\boldsymbol{g}_{m}} \cdot {\boldsymbol{e}_{m}}] \end{array} $$
(3)

where \({\textit {\textbf {e}}_{i}^{j}}\) denotes j-th value of the embedding of the i-th field feature. G is the global gate, and gi denotes the global gate of the i-th field feature. \({\textit {\textbf {W}}_{g1}} \in {R^{\frac {m}{r} \times m}}\) and \({\textit {\textbf {W}}_{g2}} \in {R^{m \times \frac {m}{r}}}\) are learning parameters, in which r is the scaling factor and it is used to control the reduction and increases degree in computing weight vector. σ1 and σ2 are nonlinear activation functions. vgi denotes the embedding of the i-th global feature embedding.

Additionally, in order to capture the characterized information of each feature in modelling feature importance, we also take the local context into consideration. Specifically, given a field embedding vector E = [e1,e2,...,em], we directly employ dimensionality-reduction and dimensionality-increase mechanism in the field embedding to compute the contribution of individual feature for the target task.

$$ {\boldsymbol{L} = [{\boldsymbol{l}_{1}},{\boldsymbol{l}_{2}},...,{\boldsymbol{l}_{i}},...,{\boldsymbol{l}_{m}}] = {\sigma_{1}}({\boldsymbol{W}_{l1}}{\sigma_{2}}({\boldsymbol{W}_{l2}}\boldsymbol{E}))} $$
(4)

where L is the local gate, and li denotes the local gate of the i-th field feature. The function of \({\textit {\textbf {W}}_{l1}} \in {R^{\frac {m}{r} \times m}}\) and \({\textit {\textbf {W}}_{l2}} \in {R^{m \times \frac {m}{r}}}\) is similar to \({\textit {\textbf {W}}_{g1}} \in {R^{\frac {m}{r} \times m}}\) and \({\textit {\textbf {W}}_{g2}} \in {R^{m \times \frac {m}{r}}}\), and they are also used for dimension reduction and dimension increase. It is noteworthy that L has the same shape as the input field embedding vector E, which preserves and highlights the subtle details of the local information. Then, we form a local feature embedding Vl by assigning a local weight on the field embedding vector.

$$ {\textbf{\textit{V}}_{l} = [{\textbf{\textit{v}}_{l1}},{\textbf{\textit{v}}_{l2}},...,{\textbf{\textit{v}}_{li}},...,{\textbf{\textit{v}}_{lm}}] = [{\textbf{\textit{l}}_{1}} \cdot {\textbf{\textit{e}}_{1}},{\textbf{\textit{l}}_{2}} \cdot {\textbf{\textit{e}}_{2}},...,{\textbf{\textit{l}}_{i}}\cdot {\textbf{\textit{e}}_{i}},...,{\textbf{\textit{l}}_{m}} \cdot {\textbf{\textit{e}}_{m}}]} $$
(5)

where vli denotes the embedding of the i-th local feature embedding Vl.

Given the global feature embedding Vg and the local feature embedding Vl, the salient feature embedding V can be obtained as follows:

$$ {\boldsymbol{V} = [{\boldsymbol{v}_{1}},{\boldsymbol{v}_{2}},...,{\boldsymbol{v}_{i}},...,{\boldsymbol{v}_{m}}] = {(\textbf{\textit{E}} \otimes \sigma ({\textbf{\textit{V}}_{l}} \oplus {\textbf{\textit{V}}_{g}}))}\oplus \textbf{\textit{E}}} $$
(6)

where vi denotes the i-th salient feature embedding. ⊗ and ⊕ denote the element-wise multiplication and addition. σ is nonlinear activation function. Comparing Eq. (2) with Eq. (4), we can observe that the global context can emphasize common information that distributes more globally, and the local context can highlight characterized information that distributes more locally. Thus, with the help of the attentive global-local contexts module, the salient feature encoder comprehensively emphasizes the features that distribute globally and locally, and the weight of each feature is dynamically adjusted according to its contribution.

3.2.2 Meaningful second-order interaction encoder

The meaningful second-order interaction encoder in our manuscript models the second-order feature interactions in a precise and effective way. An inner product and a Hadamard product are commonly used in existing works for modelling the second-order feature interactions. However, they are too simple to effectively calculate the feature interactions in sparse datasets [10]. To alleviate this limitation, FiBiNET proposes a field-interaction type for modelling second-order feature interactions by integrating an inner product and a Hadamard product, and it achieves good performance. However, we argue that it does not fully consider the relationships between pair-wise features. Specifically, the field-interaction type transforms the i-th feature into the j-th feature through an inner product and then models the interaction via a Hadamard product, which ignores the mapping relation from the j-th feature to the i-th feature. Therefore, we propose a more fine-grained approach called projective bilinear function which takes the overall mapping relations between two features through two inner products and obtains interaction relations on mapping features via the Hadamard product. Compared to the widely used inner product and Hadamard product, the projective bilinear function can encode more informative and inherent relations between different features. In addition, it facilitates the following encoder to learn meaningful information. The structure of the projective bilinear function is shown in Fig. 3, taking the i-th salient feature embedding vi and the j-th salient feature embedding vj as an example, their feature interaction \({\textit {\textbf {p}}^{\prime }_{ij}}\) is calculated by:

$$ {{\textbf{\textit{p}}^{\prime}_{ij}} = ({\boldsymbol{v}_{i}} \cdot {\boldsymbol{W}_{pi}}) \odot ({\boldsymbol{v}_{j}}} \cdot {\boldsymbol{W}_{pj}}) $$
(7)

where ⊙ is the element-wise product of vectors. WpiRd×d and WpjRd×d are learning parameters. The ranges of i and j are 1 ≤ im and i < jm. Compared to the field-interaction type, it has a stronger expression in modelling second-order feature interactions and forms a more fine-grained interaction.

Fig. 3
figure 3

The projective bilinear function for feature interaction

Generally, not all of the feature interactions are relevant to the final task. Irrelevant feature interactions are considered as noise and may deteriorate the model generalization performance. Therefore, we introduce the attention mechanism to compute corresponding attention score with an MLP. The input is the vector of feature interaction. Formally, the attention score \(\alpha _{ij}^{p}\) and weighted feature interaction pij are defined as:

$$ {\alpha_{ij}^{p} = \frac{{\exp (\textbf{\textit{h}}_{p}^{T}{\mathop\text{Re}\nolimits} LU(\textbf{\textit{W}}_{ij}^{p}{\boldsymbol{p}^{\prime}_{ij}} + \textit{b}_{p}))}}{{\sum\limits_{i,j} {\textbf{\textit{h}}_{p}^{T}{\mathop\text{Re}\nolimits} LU(\textbf{\textit{W}}_{ij}^{p}{\boldsymbol{p}^{\prime}_{ij}} + \textit{b}_{p})} }}} $$
(8)
$$ {{\textbf{\textit{p}}_{ij}} = \alpha_{ij}^{p}{\textbf{\textit{p}}^{\prime}_{ij}}} $$
(9)

where \({\textit {\textbf {W}}_{ij}^{p}} \in {R^{d \times d}} \), hp,bpRd are learning parameters, and \({\textit {\textbf {h}}_{p}^{T}}\) is the transposition of hp. Inspired by the success of Densely connected convolutional Networks (DenseNet) [19], the residual network [41] has the capability to provide more comprehensive deep features and to alleviate vanishing-gradient problems. Thus, we also implement our proposed second-order feature interaction method on the field embedding vector to strengthen and enrich interactions. Besides, we also compute attention scores for each feature interaction with an MLP to distinguish its importance. Namely, taking ei and ej as an example, the result of weighted field feature interaction qij can be computed as follows:

$$ {{\textbf{\textit{q}}^{\prime}_{ij}} = ({\boldsymbol{e}_{i}} \cdot {\boldsymbol{W}_{qi}}) \odot ({\boldsymbol{e}_{j}} \cdot {\boldsymbol{W}_{qj}})} $$
(10)
$$ {\alpha_{ij}^{q} = \frac{{\exp (\textbf{\textit{h}}_{q}^{T}{\mathop\text{Re}\nolimits} LU(\textbf{\textit{W}}_{ij}^{q}{\textbf{\textit{q}}^{\prime}_{ij}} + \textit{b}_{q}))}}{{\sum\limits_{i,j} {\textbf{\textit{h}}_{q}^{T}{\mathop\text{Re}\nolimits} LU(\textbf{\textit{W}}_{ij}^{q}{\textbf{\textit{q}}^{\prime}_{ij}} + \textit{b}_{q})} }}} $$
(11)
$$ {{\boldsymbol{q}_{ij}} = \alpha_{ij}^{q}{\boldsymbol{q}^{\prime}_{ij}}} $$
(12)

where \({\textit {\textbf {q}}^{\prime }_{ij}}\) is the feature interaction of the i-th feature ei and the j-th feature ej. \(\alpha _{ij}^{q}\) is attention score of ei and ej. Wqi, Wqj, \({\textit {\textbf {W}}_{ij}^{q}} \in {R^{d \times d}} \), hq,bqRd are learning parameters, and \({\textit {\textbf {h}}_{q}^{T}}\) is the transposition of hq. Finally, we also employ concatenation and fully-connected layers to learn the comprehensive second-order feature interactions hs and endow them with a richer expressive ability.

$$ \begin{array}{@{}rcl@{}} {\boldsymbol{h}_{\textit{s}}} &=& [\boldsymbol{h}^{1},\boldsymbol{h}^{2},...,\boldsymbol{h}^{i},...,\boldsymbol{h}^{n}]\\ &=& FC(Concat({\boldsymbol{p}_{1}},{\boldsymbol{p}_{2}},...,{\boldsymbol{p}_{i}},...,{\boldsymbol{p}_{n}},{\boldsymbol{q}_{1}},{\boldsymbol{q}_{2}},...,{\boldsymbol{q}_{i}},...,{\boldsymbol{q}_{n}}){\boldsymbol{W}_{fc}})\\ \end{array} $$
(13)

where hi is the i-th embedding of the hs. pi and qi are vectors. FC(⋅) is fully-connected layers, and Wfc is linear matrix, and n is the number of feature interactions and is equal to \(\frac {{m(m - 1)}}{2}\). Figure 4 shows the processing of meaningful second-order interaction encoder.

Fig. 4
figure 4

The meaningful second-order interaction encoder

3.2.3 Dominant high-order interaction encoder

Deep learning networks with several fully-connected layers are widely used in various works to extract high-order feature interactions, and a hierarchical structure has the capability to build more representative and efficient features. Normally, existing works merely take the output of the last layer as a dense real-value feature vector to make predictions. However, we argue that these methods ignore the relations among layers. Intuitively, features from different layers have different information for feature learning. Thus, we introduce an attention mechanism into the layers, and try to select dominant and irrelevant ones and assign weights for each layer dynamically, which can improve the feature extraction and is less costly.

The structure of the dominant high-order interaction encoder is shown in Fig. 5. Firstly, the comprehensive second-order feature

Fig. 5
figure 5

The dominant high-order interaction encoder

interactions hs is fed into a feed-forward neural network in which hidden layers have the same units. It is denoted as:

$$ {{\boldsymbol{H}_{1}} = \sigma ({\boldsymbol{W}_{0}}{\boldsymbol{H}_{0}} + {b_{0}})} $$
(14)
$$ {{\textbf{\textit{H}}_{L}} = \sigma ({\textbf{\textit{W}}_{L - 1}}{\textbf{\textit{H}}_{L - 1}} + {b_{L - 1}})} $$
(15)

where W0 and b0 are the weight matrix and bias vector of the 0-th layer. Similarly, WL− 1 and bL− 1 are the weight matrix and bias vector of the L-1th layer. H0 = hs, and the L denotes the hidden layer depth and σ is the activation function. H1 is the output of the 1-th layer. HL is the output of the L-th layer. Thus, compared to the 0-th layer, the output of the L-th layer contains more comprehensive information. Then, in order to build more detailed and useful information, we take attention mechanism to aggregate these high-order features into a dense real-value feature vector h.

$$ {{\alpha_{k}} = \frac{{\exp ({\textbf{\textit{h}}_{k}}{\mathop\text{Re}\nolimits} LU({\boldsymbol{W}_{k}}{\boldsymbol{H}_{k}} + b_{k}))}}{{\sum\limits_{k = 0}^{L} {\exp ({\textbf{\textit{h}}_{k}}{\mathop\text{Re}\nolimits} LU({\boldsymbol{W}_{k}}{\boldsymbol{H}_{k}} + b_{k}))} }}} $$
(16)
$$ {\textbf{\textit{h}} = [{\alpha_{0}}{\textbf{\textit{H}}_{0}},{\alpha_{1}}{\textbf{\textit{H}}_{1}},...,{\alpha_{L}}{\textbf{\textit{H}}_{L}}]} $$
(17)

where αk is the weight of k-th layer, which represents the different relations among hierarchical layers. Hk is the output of the k-th layer. Wk, hk, and bk are learning parameters. Through Eq. (16) and Eq. (17), hierarchical features are not equally aggregated for feature learning.

3.3 Prediction layer

Finally, h is fed into the sigmoid function for CTR prediction. It is formulated as:

$$ {\hat y = \sigma (\boldsymbol{W}_{L+1}{\boldsymbol{h}} + b_{L+1})} $$
(18)

where WL+ 1 and bL+ 1 are the model weight and the bias vector respectively. \({\hat y \in (0,1)}\) is the predicted value. Furthermore, we use the widely used cross-entropy loss function to train our proposed HAFP model:

$$ {loss = \sum\limits_{j \in N} {({y_{j}}\log ({{\hat y}_{j}}) + (1 - {y_{j}})\log (1 - {{\hat y}_{j}})) + \lambda {{\left\| \theta \right\|}_{2}}}} $$
(19)

where N and 𝜃 are the total number of samples and parameter set of the model, respectively. yj is the ground truth of the j-th instance. Additionally, we introduce L2 regularization weighted by λ to prevent overfitting, and we use Adam gradient descent optimizer to optimize Eq. (19).

4 Experiments and analysis

4.1 Research questions

We conduct experiments which aim to answer the following research questions:

  1. (RQ1)

    What is the performance of HAFP in CTR prediction? Does it outperform the state-of-the-art models in terms of Logloss and AUC? (See Section 4.3)

  2. (RQ2)

    How well does HAFP perform with the hierarchical attention mechanism? (See Section 4.4)

  3. (RQ3)

    How well does HAFP perform with different types of bilinear interactions? (See Section 4.5)

  4. (RQ4)

    How well does HAFP perform with single context instead of global-local contexts in the salient feature encoder? (See Section 4.6)

Before implementing extensive experiments, we first present the experimental settings including datasets, evaluation metrics, baselines, and parameter settings.

4.2 Experiment settings

4.2.1 Datasets

We use two datasets which are commonly adopted in CTR prediction, CriteoFootnote 1 and AvazuFootnote 2, to evaluate the efficiency of the proposed model. The first dataset is released by the Display Advertising Challenge 2014. It contains 39 anonymous fields about displayed ads which consist of 26 categorical fields and 13 continuous fields. The second dataset is released by the Feature Prediction Competition 2014. Its data relates to users’ click behaviors on displayed mobile ads, and it has 24 fields about user/device features and ad attributes. In experiments, we split each dataset into three parts: 80% for training, 10% for validation, and 10% for testing.

4.2.2 Evaluation metrics

To evaluate the performance of the HAFP in CTR prediction, we take Area Under Curve (AUC) and Logloss as evaluation strategies which have been widely adopted in related works. Please note that an improvement of 1% in AUC or Logloss brings a large increase of revenue for advertising platforms [3, 4].

AUC::

AUC is the primary evaluation, and is used to reflect the ranking performance between clicked and non-click instances. The upper bound of AUC is 1, and a higher value of AUC represents a better performance.

Logloss::

Logloss measures the overall likelihood of test data. It has been widely used in the classification tasks, and a lower value of Logloss represents a better performance.

4.2.3 Baselines

We compare our model HAFP with different baseline methods for CTR prediction. The baselines include:

LR: :

LR employs linear combination with each feature to compute CTR prediction.

FM: :

[24] FM uses inner products on first-order and second-order feature interactions to compute CTR prediction.

AFM: :

[34] AFM extends FM by introducing an attention mechanism which distinguishes different weights of second-order feature interactions.

NFM: :

[7] Neural Factorization Machine(NFM) builds the feature interactions via a bi-interaction pooling layer before DNNs.

DeepFM: :

[6] DeepFM extracts feature interactions by combining an FM part and a deep MLP part.

InterHAt: :

[16] InterHAt models feature interactions by using a multi-head transformer and a hierarchical attention layers.

FiBiNET: :

[10] FiBiNET employs a squeeze-and-excitation network layer and a bilinear-interaction layer to explore salient features and feature interactions for CTR prediction.

4.2.4 Implementation details

We implement HAFP and baselines with Tensorflow on a GPU Tesla T4. In the embedding layer, the dimension of each feature in our work is set to 8 for the Avazu dataset and to 10 for the Criteo dataset. The scaling factor r in the salient feature encoder is set to 3, and the activation functions in the attentive global-local contexts module is RELU. In the dominant high-order interaction encoder, the hidden layer depth L is set to 4, and the activation functions in this encoder are RELU. Additionally, for optimizing the HAFP model, we use Adam to update parameters in the training stage with a mini-batch size of 512 for Criteo and Avazu datasets, and we set the learning rate to 0.0001 on the two datasets. Additionally, to ensure the reliability of model performances, reported results are the average value of 5 iterations of the experiments. Moreover, the parameters of all of the baseline models follow the experimental settings reported in their works for fair comparisons.

4.3 Performance comparison

The results for CTR prediction between HAFP and baselines on Criteo and Avazu datasets are shown in Table 2. From the table, we can find the following observations:

Table 2 Performance comparison on two datasets for Logloss and AUC, respectively. The statistical significance between each pair of our proposed HAFP and the best baseline at p< 0.05 level

LR performs worse than other methods, which indeed shows the power of feature interactions in feature learning. AFM achieves a significantly better performance than FM, which demonstrates the benefits of an attention mechanism in learning the weights of feature interactions. Furthermore, LR and FM based methods perform worse than the methods which incorporate a deep learning network in CTR prediction. It demonstrates the effectiveness of non-linear transformation and deep neural network in modelling feature interactions. InterHAt benefits from second-order feature interactions more than NFM and DeepFM, and it achieves better performance. Capturing feature importance is important for feature learning because different features have various contributions for the final task. InterHAt represents second-order feature interactions with a multi-head transformer, but it neglects to consider the feature importance in feature embedding. FiBiNET performs better than InterHAt. That is mainly due to the salient features with a SENET structure which captures more accurate embeddings of each feature.

Compared to InterHAt and FiBiNET, HAFP further builds feature learning from the hierarchical attention mechanism, which captures the real-value feature vector from both low- and high-order feature interactions simultaneously. HAFP takes full account of relevant information and outperforms the baselines. Generally, HAFP obtains improvements of 0.3% and 0.2% in AUC on two datasets over the best baseline FiBiNET, respectively.

Furthermore, to verify whether the relative improvement rates of HAFP are statistically significant, we conduct a paired t-test here and results of the p-values are shown in Table 2 with different markers. In this table, the p-value refers to the comparison between HAFP and the best baseline. The p-values in Table 2 are all less than 0.05, which validates the improvements of HAFP are statistically significant.

4.4 Influence of hierarchical attention mechanism

To illustrate the influence of the hierarchical attention mechanism for CTR prediction, we compare the performance of HAFP and three variants of HAFP. FP-0 refers to HAFP without the salient feature encoder and without the attention mechanism in the meaningful second-order interaction encoder and the dominant high-order interaction encoder. FP-1 refers to HAFP without the attention mechanism in the meaningful second-order interaction encoder and the dominant high-order interaction encoder. FP-12 refers to HAFP without the attention mechanism in the dominant high-order interaction encoder. The results on Criteo and Avazu are summarized in Table 3.

Table 3 Performance comparison of HAFP with different encoders

First, although FP-0 is the worst method for CTR prediction in the variants of HAFP, compared to the similar structure of FiBiNET, FP-0 gains a better performance. Considering the difference between FP-0 and FiBiNET, the results show that the projective bilinear function plays an important role for feature learning, and the performance verifies the effectiveness of the designed projective bilinear function. Second, compared to FP-0, FP-1 obtains some improvements, which indicates the effect of considering feature importance in feature learning, and the performance verifies that building accurate feature embedding has the capability to improve CTR prediction. Third, FP-12 achieves a better performance than FP-0 and FP-1 in terms of Logloss and AUC on the two datasets. This confirms that using the attention mechanism in the second-order interaction encoder can capture the relevant contexts for the feature learning. In addition, it means that not all pair-wise feature interactions are equally useful for CTR prediction tasks. Finally, compared to the above three variants of HAFP, HAFP receives the best performance, which indicates that the hierarchical attention mechanism contributes to our model. Additionally, HAFP outperforms FP-0 with 0.2% and 0.2% respectively for AUC on the two datasets. The results demonstrate our hierarchical attention mechanism can capture the positive features and feature interactions for CTR prediction tasks, which verify the rationality and feasibility of our contribution in this work. In summary, the performance comparison of HAFP with different encoders demonstrates that with the designed strategy, better feature representation can be achieved and further help CTR prediction.

4.5 Influence of bilinear function

To study the effectiveness of our projective bilinear function for learning the second-order feature interactions, we conduct the ablation experiments in this section to study the impact of the projective bilinear function in HAFP. HAFI refers to that field-interaction type is used to encode the feature interaction in the meaningful second-order interaction encoder. HAT refers to that multi-head transformer is used to encode the feature interaction in the meaningful second-order interaction encoder. As Fig. 6 shows, HAFI obtains a better performance in both AUC and Logloss than HAT on the two datasets. Additionally, the computation of HAT is heavier than HAFI since HAT owns more parameters and requires more time for training. Thus, bilinear interaction function with attention mechanism is beneficial and suitable for modelling second-order feature interactions. Furthermore, HAFP outperforms HAFI and HAT in both the datasets, which verifies the effectiveness of the projective bilinear function. The main reason is that it fully considers mapping relation between two features and the interaction results are more comprehensive.

Fig. 6
figure 6

Performance comparison of HAFP with different bilinear functions

4.6 Influence of attentive global-local context module

To study the impact of our attentive global-local module (AGLM), we construct its two ablation modules, attentive global module (AGM) and attentive local module (ALM), in which the other parts of HAFP are set the same. The only difference in these three modules is the contextual information used in building feature importance. The comparison results are shown in Fig. 7. It can be seen that:

  1. (1)

    HAFP with AGM perform slightly better than HAFP with ALM in Logloss, while the latter gains better performance in terms of AUC on Avazu. Since the global context is used to extract common information and the local context is used to retrieve characterized information, we argue that both global information and local information play an important role for building feature importance.

    Fig. 7
    figure 7

    Performance comparison of HAFP with different contexts in attentive global-local module

  2. (2)

    HAFP with AGLM achieves better performance than AGM and ALM in all settings. It suggests that merely focusing on global or local information is too biased and weakens the effect of feature learning. In summary, integrating global and local information is vital for dynamically assigning feature weight and is an effective way for improving CTR prediction.

5 Conclusion

In this paper, we highlight the relevant information in different order feature interactions for CTR prediction. We propose a novel Hierarchical Attention and Feature Projection (HAFP) neural network. There are two major parts in HAFP: 1) It employs a three-level attention mechanism to strengthen the weight of relevant features and decrease the weight of irrelevant features. 2) It designs a projective bilinear function to learn more fine-grained second-order feature interactions. Compared to the existing methods, our model fully utilizes the interactions of different field pairs and automatically selects dominant features and feature interactions for feature learning. We conduct extensive experiments on two public datasets. The results show that HAFP outperforms state-of-the-art baselines for CTR prediction, and ablation experiments which analyze the effect of hierarchical attention, the bilinear function, and the attentive global-local context module demonstrate the rationality and effectiveness of our contributions.

Although the prediction performance is improved in our proposed method, it models the feature interactions in an implicit way, and unstructured combination of features will inevitably limit the capability to model sophisticated interactions among different features in a sufficiently flexible fashion. This limitation opens up new research possibilities. In future work, we plan to build sophisticated interactions among different features in an explicit manner. Moreover, inspired by the power of graph neural network (GNN), we are going to attempt to extend our work with GNN to further improve the performance.