Keywords

1 Introduction

The Click-through rate (CTR) prediction is one of the important basic tasks in recommendation system. By predict the click rate of user, the web or application can sort the candidate item list and push them to target user, so as to provide personalized recommendation service for users. Early CTR prediction models output CTR through Logistic Regression, and use automatic feature engineering methods such as Factorization Machine (FM) [10] and Gradient Boosting Decision Tree (GBDT) [3] for business implementation. With the development of deep learning, CTR prediction model based on neural network like PNN [9] has gradually become the mainstream application model in the real application.

Wide &deep [2] CTR predict model structure have used parallel structures of different depths to consider both memorization and generalization. In subsequent studies, FNN [18], DeepCrossing [11], DeepFM [4], DCN [16], xDeepFM [7], DCN-V2 [17], EDCN [1], NFM [5] and other models have similar parallel structure like Wide &deep, and were utilized to analyze public embedding through different modules. Moreover, the generalization of this structure depended on the effectiveness of parallel structures.

There is no comprehensive analysis of feature interaction in previous parallel structure models. Therefore, the generalization of the model is limited. At the end of CTR model, the click rate prediction output have been achieved by linear layer with activation function. During training phase, the activation values between parallel layers tend to fall into the weak gradient interval. This phenomenon will weaken the training effect of each parallel module, and can not improve the generalization while improving the complexity of the model.

In this paper, we propose a new deep CTR model, named Parallel Heterogeneous Network (PHN). For PHN model, three parallel feature interaction structures were included to analyze CTR features from different perspectives. In order to enhance the independent analysis ability of each parallel module, Soft Select Gating module was constructed after public embedding to enhance the original embedding expression. We also added residual connections with trainable parameters to the model to reduce the weak gradient phenomenon by accumulating gradients during the back propagation process.

This paper mainly contributions are as follow:

  • In order to strengthen the expression ability of CTR prediction model, this paper constructed three different linear feature interaction methods from nonlinear interaction, bite-wise interaction and vector-wise interaction based on parallel structure.

  • Soft Selection Gating is constructed before the parallel structure, and the features of original embedding are enhanced by self-attention and soft gate structure while retaining the high order crossover characteristics, which improves the ability of the model to express data.

  • To solve the weak gradient phenomenon in the parallel model, the residual link with trainable parameters are used in the parallel structure to reinforce the model training process.

  • The effectiveness of Soft Selection Gating and the weak gradient phenomenon are visualized, and the effectiveness of PHN is verified by comparison experiments.

2 Proposed Method

The Parallel Heterogeneous Network (PHN) consists of two main structures. One of them is Soft Selection Gating (SSG) module based on self-attention to enhanced embedding features for different structures, and another one is Heterogeneous Interaction Layer (HIL), using different interaction method to analyze the enhanced features, and finally using Logistic Regression to output the confidence of sample. Figure 1(a) illustrates the main structure and the detail of model.

Fig. 1.
figure 1

The overview structure of our proposed PHN model, which consists of Soft Selection Gating (SSG) module and Heterogeneous Interaction Layer (HIL)

2.1 Heterogeneous Interaction Layer

In PHN, the parallel layers use three kinds of interaction layers to improve model interaction capability: 1) cross layer is the basic part of DCN [16] and DCNV2 [17], which focuses on the element-aware feature interaction; 2) field interaction layer is the basic part of FINT [19], which focuses on the vector-aware field interaction; 3) feed forward layer is used to fitting the non-polynomial information.

Cross Interaction Layer. Feature interaction is a main key point in study of mainly CTR prediction model. As a previous study, the DCN [16] and the DCNV2 [17] proposed two kinds of explicit interaction methods, which achieved the data mode of high-order interaction by realizing the intersection of multi-layer hidden features and original features.

$$\begin{aligned} y_{dcnv2} = x_0 \odot (W \times x_i + b) + x_i \end{aligned}$$
(1)
$$\begin{aligned} y_{dcn} = x_0 \odot x_i^T * w + x_i + b \end{aligned}$$
(2)

where, \(x_0\) is the input feature of the first cross layer; \(x_i\) is the output feature of the i-th cross layer, W and w represent trainable parameter vectors and matrices; \(\odot \) is Hadamard product and \(\times \) is matrix multiplication. In Eqs. 1 & 2 the DCN and DCNV2 used different parameter forms to interact features, but in general, it achieves element-aware feature interaction. The PHN combines the formulas of DCN and DCNV2 to construct the bit-aware interaction module. As mentioned in Fig. 2, we use the parameter part of the two crossover model and bias to construct cross layer in PHN.

Fig. 2.
figure 2

The calculation diagram of cross layers in PHN

Field Interaction Layer. Besides element-aware interaction, vector-aware interaction is also a key part of the model construction. Field Interaction is mentioned in FiBiNet [6] and FINT [19], which using the cross method to implement the vector-aware interaction. PHN use the Field Interaction layer in FRNet as a parallel part to enhance the generalization effect of the whole network on the feature crossing pattern.

Fig. 3.
figure 3

The calculation diagram of field interaction layers in PHN

As shown in Fig. 3, field interaction layer uses the residual link with trainable parameter vector, which product on different fields feature to screen the output features of the upper layer. In the subsequent experiments, we will discuss and experiment residual link forms in all parallel layers.

Feed Forward Layer. The third part of parallel network is composed of Feed Forward Network. By alternating linear and nonlinear analysis of the original features, FFN complements the analysis of the previous two crossover modes to improve the overall network generalization function.

$$\begin{aligned} x_{i+1} = \sigma (\omega x_i+b) \end{aligned}$$
(3)

where, \(\sigma \) is the activation function, which is LeakyReLU in PHN.

2.2 Soft Selection Gating

The ideal of multiple structure parallelism raise up a new question: whether different structures require different input dense feature. The traditional Multi-head self-attention(MSA) mechanism [14] used weight based query vector and key vector to aggregate information in a sequence, which was an ideal method to process feature information.

$$\begin{aligned} Q_E, \;K_E, \;V_E = W_QE, \;W_KE, \;W_VE \end{aligned}$$
(4)
$$\begin{aligned} MSA(Q_E,K_E,V_E) = Softmax(\frac{Q_EK_E}{\sqrt{d_k}})V_E \end{aligned}$$
(5)

From the Eqs. 4 & 5, MSA has considered different field weights through the second-order intersection of query vector \(Q_E\) and key vector \(K_E\). However, based on the feature interaction in the CTR prediction model, the direct using by the traditional MSA may over-focus on the feature activation value of the second-order crossover, thus losing the performance of the feature at the higher-order crossover. Inspired by the FRNet [15], an information selecting method named Soft Selection Gating (SSG) is used after the sharing embedding \(E\in R^n\) in PHN. This soft-gating information selection is designed for choosing activation between MSA result and sharing raw embedding.

$$\begin{aligned} E_{sg}=G_{sg}\odot E_{sa} + [I-G_{sg}]\odot E_{se} \end{aligned}$$
(6)

where \(G_{sg}\in R^n\) is the trainable gating vector, \(E_{sa}\in R^n\) is the sharing self-attention embedding, \(E_{se}\in R^n\) is the sharing embedding, and \(I\in R^n\) is a unit vector. As the Fig. 1(b) shows, the SSG considers both sharing embedding and self-attention embedding. By using weighting parameters, the model can select the raw feature or the feature enhanced by MSA for different parallel structures. Subsequent experiments will discuss whether to share the weight of self-attention and the gating mechanism, and confirm the effectiveness of SSG.

2.3 Weak Gradient Problem

Basic on the CTR prediction task definition, the key point of improving AUC value is increasing the confidence of positive label samples and decreasing the confidence of negative label samples, which makes model more robust. In the last stage of CTR prediction model, the traditional model usually constructs confidence coefficient of click by using Sigmoid activation function.

Fig. 4.
figure 4

Gradients of different interval in Sigmoid (Color figure online)

The Fig. 4 shows that, we can think of the entire Sigmoid function as two interval, the effective gradient interval (blue) with a normal gradient, and the Weak gradient interval (red) with a gradient approaching zero. When the output value of the parallel model is accumulated at the last linear layer, it is easy to make the samples originally in the effective gradient interval to fall into the Weak gradient interval, thus weakening the learning of each part for valid samples. In this paper, this phenomenon is referred to weak gradient in parallel structures.

To mitigate this phenomenon and achieve effective training, the PHN using residual link in each substructure, enhancing the gradient accumulation of each parallel structure in the process of back propagation. To further accommodate this phenomenon, we also tried to add gating parameters to residual links and used batch normalization of different modes in the final linear layer. We will discuss this further in Sect. 3.

3 Experiments

In this section, we evaluate PHN on two benchmark data sets. We aim to answer the following research questions:

  • RQ1: Will parallel structure-based PHN perform better than previous CTR prediction models over different classical data sets?

  • RQ2: Under what circumstances can the Soft Selection Gating reasonably enhance the function of feature representation.

  • RQ3: Is the parallel structure actually caused the weak gradient phenomenon, and the problem is effectively alleviated by residual connection or batch normalization?

3.1 Datasets Description

To evaluate the effectiveness of the model in this paper, two benchmark off-line datasets are selected for experiment: CriteoFootnote 1 and AvazuFootnote 2. The two data sets were divided into training, validation and testing set according to the ratio of 8:1:1. Detailed information on the two benchmark datasets is shown in Table 1.

Table 1. Statistics of the benchmark datasets

For the CTR prediction task, two classical evaluation metrics such as: Logloss and AUC, were used to verify the effective generalization and robustness of the suggested model.

3.2 Compared Models

To verify the effectiveness of the proposed PHN model, we compare it with linear model (LR, FM [10], FwFM [8], FmFM [13]), deep model (DNN, W &D [2], DeepFM [4], xDeepFM [7], AutoInt [12]), and interaction model (DCN [16], DCNV2 [17], FiBiNet [6], FINT [19]) on CTR task. All models and experiments are implemented on Huawei FuxiCTR deep learning framework [20].

3.3 Performance Comparison (RQ1)

Effectiveness of PHN. To verify the validity of the model, we followed the original structure of each comparison model, controlled the interaction layer of all models in three layers, and recorded the testing results of each model on two benchmark datasets. The specific experimental results were reported in Table 2.

Table 2. Experiment result of different CTR prediction models on Criteo and Avazu

The experiment shows that, PHN achieved a good performance in both validation and testing experiments with two benchmark large datasets.

Grid Search. Figure 5 shows the grid search experiment result of PHN. With the increase of the number of cross layers, the robustness of the model also increases, which may benefit from the improvement of the expression ability of cross layers for higher-order crosses. However, as the number of layers increases, the model complexity also increase, which slow down the training process of the model. The values of AUC and Logloss shown in Fig. 5 tend to be stable when the number of cross layers is five.

3.4 Selection Information (RQ2)

Data Skew Visualization. Based on the trained PHN structure, we visualized the tensor amplification ratio after SSG output. As shown in Fig. 6, the characteristics of the three cross-layers have some similarity, such as the high proportion of field 13 and the low proportion of field 39. At the same time, the feature scaling of the three parts is somewhat different, as in field 8 and field 24.

Selection Pattern. The SSG module is designed for enhance the representation of embedding feature, which select the feature from raw embedding and self-attention embedding. Depending on the design, the selection pattern of self-attention layer and gating layer in this module can be classified as public or private. To further validate the effectiveness of SSG, we also conducted a comparison experiment. Different subscripts represent different selection patterns: embed (public embedding feature), sa (public self-attention), sg (public soft gating). the subscripts with a prefix “P” means that the PHN contains private layers for each parallel layer. Table 3 shows the results of comparison experiment.

This experiments show that, single self-attention layer (public of private) cannot replace the embedding feature represent, but it can help to enhance the feature by using the soft selection gating, and the AUC value of the algorithm increases by 0.123% on average. From a theoretical point of view, a feature without a high activation value in the first-order feature cannot be completely transferred, because it may show a high activation value in the high-order interaction with other features.

3.5 Weak Gradient Phenomenon (RQ3)

Efficiency Analysis. To reduce the impact of weak gradient phenomenon, there has been an attempt in PHN to enhance the data gradient flow in training through residual links (RL) or batch normalization (BN) to reduce the training pressure of each parallel part. In experiments, we tried to introduce gating parameters for RL and discussed the independence of BN in different parallel modules in the last linear layer.

Fig. 5.
figure 5

The grid search performance of PHN with different interaction layers on two benchmark datasets.

Fig. 6.
figure 6

The heatmap of different parallel layers input feature scaling ratio

Table 3. Experiment result of different schemes of information selection module on Criteo and Avazu

Different model subscripts represent different structures in PHN: base (basic model), rl (normal RL), Prl (parameter RL), bn (BN), and pbn (private BN). Table 4 shows the result of comparison experiments.

Table 4. Experiment result of different solutions to the weak gradient problem on Criteo and Avazu

The experiments show that, In the back propagation process, the weak gradient problem can be improved by using gradient accumulation in the RL. In the case of parameters, the overall network can better fit the data flow in the feed forward and reverse process, and strengthen the fitting effect of different parallel structures. However, the subsequent addition of BN has not been very effective. This may be due to the uneven distribution of data flows in different parallel structures, but forced unification with normalization weakens the representation of data. This also explains to some extent that BN layer is not separable from linear layer. The last two groups of experiments also showed that the specificity of the data stream fitted was enhanced when the RL strengthened different parallel structures, while BN had certain side effects. Therefore, in PHN, the RL and BN had better be realized in an independent parallel structure.

Visualization of Activation Value. A more robust model should output more closely to the confidence of the label worthiness. Figure 7 shows the confidence curve of PHN in 200 samples after training one epoch in different configurations, to show the changes of different structures during training phase.

Fig. 7.
figure 7

Experiments on the last activation value of PHN. (Color figure online)

The red line in Fig. 7 represents the PHN model without RL and BN as the benchmark. Figure 7(a) shows that, the single interaction module showed higher negative confidence and lower positive confidence than PHN. This means that PHN is superior to partial cross structure in sample resolution. Figure 7(b) shows that, the sigmoid calculation after summing up the activation values of the parallel models can show more robustness than PHN, which means the fitting effect of a single PHN model on the data set is weakened by weak gradient phenomenon. Figure 7(c) shows that, RL can enhance the high confidence of negative samples, but also reduce the confidence of positive samples, and the RL with parameters can effectively improve the performance of the model on the PHN infrastructure.

4 Conclusion

In this paper, we described the parallel structure of the current mainstream CTR model and the weak gradient phenomenon in the parallel structure, and introduce a parallel structure model named Parallel Heterogeneous Network (PHN) in response to these phenomena. PHN model used Soft Selecting Gating (SSG) structure to isomerize features, and used Feed Forward network, cross interaction layers and field interaction layers to build the subsequent parallel part. The performance experiment results show that PHN shows the State of the Art on two large benchmark data sets, and explores the interaction layer num of the model. The comparative experimental results show that SSG can effectively improve the representation based on public embedding, and the residual link with trainable parameters can improve the representation ability of the model while maintaining the robustness of the results. Based on the overall experimental results, this work brings us to one step closer to being able to determine the optimal structure of PHN.