1 Introduction

Online advertising system has attracted considerable attention in industry and academia due to the high yield. A good online advertising system can generate tens of billions of revenue yearly in some companies, such as Google and Alibaba. Click-through rate (CTR) prediction, which is defined as predicting the probability of a user clicking at the target item, is an essential task in online advertising systems. Hence, CTR prediction has been a popular topic in data mining and machine learning fields, Moreover, numerous methods have been developed for CTR prediction.

One popular approach for CTR prediction is to learn the feature combinations. Factorization Machine(FM) [1], which is one of the most popular models, attempts to learn the embedding of each feature and predict CTR by linear combination of the first and second-order feature interaction. In recent years, deep learning has shown powerful capability to learn the high-order feature interaction. The methods with deep learning for CTR prediction, such as Wide&Deep [2] and DeepFM [3], are widely developed. Wide&Deep jointly trains logistic regression and deep neural networks to combine the strength of memorization and generalization for prediction. DeepFM replaces the linear model in Wide&Deep with FM for learning second-order feature interaction and the feature engineering can also be simplified. However, even though the rich information can be learned by informative feature interaction, it only focuses on user preference, because features usually imply the prior information regarding whether the user is interested in a target item. The historical behaviors of users, which not only reflects user preference but also their requirements [4], are not fully explored in these methods. Hence, the behaviors are important resources to improve the CTR prediction.

With the increasing availability and quality of users’ behavior data, the use of this data for prediction becomes more necessary [5, 6]. Collaborative filtering (CF) approaches are adopted by assuming that the users with similar behaviors share similar preferences on items to explore the information in behaviors completely. In most collaborative filtering models, CF learns the embedding of users and items via historical interactions. One prominent model is Matrix Factorization(MF), which uses the inner product between the embedding of users and items to model user-item interaction. The embedding is represented with a dense vector. Many methods are developed on the basis of MF for CTR prediction. In [7], the CTR prediction is regarded as a special matrix completion problem. This study learns embedding of both users and items with logistic loss for binary data instead of square loss in MF and predicts the probability via inner product of embedding. The method in [8] proposed to use attention mechanism to learn the embedding of the users, and then the embeddings are incorporated into MF to capture the relevance between the user and the target item.

Although these methods are successfully applied in the advertising system, they mainly attempt to predict the items that may match the historical preference of users and user requirements are ineffectively learned. Hence, discovering the logical behaviors in the sequential behaviors is necessary for the inference of users intention. Figure 1 shows the logical behaviors. In Figure 1, when user A bought a macbook online, instead of browsing another laptop, he/she may prefer to click accessories such as air pods. This scenario indicates that the current decision of a user does not heavily depend on the similarity to his/her previous behaviors. Meanwhile, the collaborative information is not fully explored in the CF methods due to the lack of explicit connections, which may result in the matching of irrelevant items. For a good recommendation, the item that is recommended to the users should generally satisfy the preferences and the requirements of the users simultaneously. Hence, the collaborative capability via the user-item interaction also should be further improved.

A graph-aware collaborative reasoning for CTR prediction, termed as GACR, is proposed in this paper to address the above challenges. This method learns deep collaborative embedding based on user-item interaction graph and obtains the logical reasoning by integrating the logical operators into neural networks based on the embedding of users and items. In the proposed method, the collaborative information is explicitly presented by message passing and message propagation between the nodes in the graph. The information is learned recursively via the paths in the graph to guarantee that deep collaborative information can be endowed into the embedding of users and items. The interaction information between users and items is then learned on the basis of embedding with an interaction layer. The interaction information is transformed into logical representation based on the logic laws to capture the logical behaviors in the behavior sequence. A logical neural network is adopted to learn the logical behaviors with the logical representations. Therefore, the proposed method can learn the collaborative information and logical behaviors simultaneously with an end-to-end manner, and the preference and the requirements of the users can be captured effectively in CTR prediction.

The main contributions of the paper can be summarized as follows:

  • The proposed GACR is the first work that attempts to learn the collaborative information and logical reasoning jointly in one network architecture for cognitive learning in CTR prediction.

  • This paper highlights that the preferences and the requirements are both essential for CTR prediction, and the proposed method can capture both of them effectively.

  • The results of extensive experiments on several real-world datasets demonstrate that our GACR outperforms several state-of-the-art models.

The following sections of this paper is organized as follows. Section 2 provides some recent studies that are related to the current work. In Section 3, the methodology of the proposed method is introduced in detail. Extensive experiments are reported in Section 4. The conclusion of the paper is shown in Section 5.

Figure 1
figure 1

The illustration of logical behaviors. User A has bought macbook, ipad and airpods in the behavior sequence while user B, C, D, E and F have bought at least one of these three items. There exist many items that user A may need according to the user B, C, D, E and F. To infer which one is the requirement for A, the items that user A has bought are connected with logical operation AND(∧). With the logical inference, the prediction of that whether user A will click an item (such as mouse) can be learned

2 Related work

This section focuses on introducing the related works on both graph-based recommendation and CTR prediction.

2.1 Graph-based recommendation

Graph is a data structure with rich information and it had been applied in multiple areas [9]. ItemRank [10] designs a random-walk algorithm based on user-item interactions to extract user preferences in the user-item interaction graph. However, the algorithm is a model-based collaborative filtering(CF) method and lacks optimization capability. HOP-Rec [11] presents a unified and efficient framework that incorporates graph- and embedding-based methods. RecWalk [12], which is also based on random walk, utilizes the spectral properties in Markov chains to explore interaction graph effectively. Graph neural network is proven to be efficient in various task [13]. On the basis of paths in graph, RKGE [14] designs a recurrent network architecture for capturing user preference and then learns additional informative embedding to recommendation. MCRec [15] defines meta-path and leveraging co-attention mechanism to improve the representations of both entities and meta-paths. These path-based models are limited due to their sensitive effect on path selection despite their promising performance. The heterogeneous information network(HIN) has recently become a hot topic in graph-based recommendation. NIRec [16] emphasizes the interaction in HIN and proposes meta-path based module to learn interactive patterns. Finding appropriate paths requires large experiences and time, NGCF [17] focus on collaborative signal in graph structure and utilizes a propagation layer for extraction to alleviate the aforementioned problems. Similarly, KGAT [18] combines recursive embedding propagation and attention mechanism to extract high-order connectivity. An innovative model equipped with embedding propagation and logical reasoning is introduced in this paper to learn the comprehensive interest of users.

2.2 CTR prediction

CTR prediction is the core technology in recommendation, searching and advertising systems. Historically, logistic regression(LR) had been widespread in CTR prediction. However, using LR to predict CTR requires a huge workload for manual feature engineering. Facebook proposed [19] using gradient boost decision tree(GBDT) and LR to explore the combination of features automatically to reduce human consumption feature engineering. Despite the strengths in auto feature engineering, this approach did not consider the presence of many high dimensional sparse features in CTR prediction. Using decision tree to handle highly sparse data or online scenery is difficult.

FM [1] uses inner product between latent vectors to learn the weight of each second-order feature combination and provides predictions via linear aggregation of all first and second-order features. Some feature combinations have minimal influence on target task. AFM [20] introduced attention network to learn the influence of each second-order feature combination to enhance the expressiveness of the model. Similarly, FFM [21] referred to the concept of field to discriminate different importance of diverse feature cross.

FNN[22] was proposed on the basis of FM. FNN adopts FM to train the embedding in advance and then feeds the embedding into multi layer perceptron(MLP) to learn the high-order feature correlation. This two-phase structure with Embedding and MLP can be summarized as Embedding&MLP model. PNN [23] introduced the product layer to extract addtional complex feature interaction. Instead, NFM [24] proposed bi-interaction pooling layer to learn the feature interaction and enhance the expressiveness of the model.

Many recent studies [2, 3, 25] have shown that deep learning based CTR prediction models have achieved remarkable effectiveness. Wide&Deep Learning(WDL) [2] was first proposed by Google and employed in Google play application. WDL comprises a generalized linear module (wide part) and an MLP (deep part). WDL also combines the memorization of the wide part and the generalization of the deep part to model the user behavior. However, this approach is limited to designing feature cross manually. DeepFM [3] introduced the Deep Neural Network (DNN) based on FM model to improve the capability of information extraction and Deep Crossing [26] applies a deep residual network to learn cross features. Since DNN can only conduct implicit feature cross, Deep&Cross [27] designed Cross Net to replace FM. Thus the model can also learn bound-degree feature cross explicitly.

The attention mechanism learns a function to assign large weights to closely correlated items. This mechanism is originally proposed for the neural machine translation (NMT) [28], but has been widely used in diverse domains. Considering CTR prediction, DeepIntent [29] applies attention in the context and utilizes RNN to model text, and then learns the global hidden vector to allocate the weights of keys in each query. DIN [30] uses attention mechanism to learn the representation of historical behaviors of users. DIEN [31] adopts the interest extractor layer and interest evolving layer to learn the representation of user behaviors and capture the dynamic changes in user interests respectively. DSIN [32] leverages multiple historical sessions of users in behavior sequences with the attention mechanism to extract accurate interest representations of each session. In DMR [8], the model calculates the item-to-item similarities between user-target item interaction using the attention mechanism.

Figure 2
figure 2

Illustration of the GACR architecture. GACR takes the embedding of users, their historical items and the target item as inputs, and outputs the probability that the user will click the target item

3 Methodology

The proposed GACR model will be comprehensively introduced in this section. First, an embedding layer is adopted to initialize the embedding of users and items with the lookup-table technique. Then a graph layer is adopted to propagate the collaborative information with the interaction graph and enrich the information in the embedding of users and items. Following the graph layer, an interaction layer is used to learn the interaction information between a user and an item from the embedding. A logical reasoning layer is used by propagating the interaction information into logical representation to learn the logical behaviors in the sequential behaviors. Finally, a prediction layer is adopted to predict the probability that user clicks target item. The architecture of GACR is shown in Figure 2.

3.1 Embedding layer

One-hot encoding is a popular technique used to generate continuous feature representation for discrete data. This technique has been widely used in recommendation and CTR prediction. The user ID, item ID and context features are usually transformed into a sparse vector with one-hot encoding technique to learn the embedding of users and items in recommendation and CTR prediction, and then the sparse vector is transformed into a low dense vector via the lookup-table technique. Without loss of generality, the IDs of users and items are similarly transformed into dense vectors, which are then treated as the initialized embedding of users and items.

Mathematically, \({e_{u}^{0}} \in \mathbb {R}^{d}\) and \({e_{i}^{0}} \in \mathbb {R}^{d}\) are defined as the initialized embedding of user u and item i respectively, where d represents the embedding dimension. All the embeddings of users and items are organized as the embedding lookup table \(\mathbf {E}_{\mathbf {u}} \in \mathbb {R}^{n_{u} \times d}\), \(\mathbf {E}_{\mathbf {i}}\in \mathbb {R}^{n_{i} \times d}\) and \(\mathbf {E}_{\mathbf {f}} \in \mathbb {R}^{n_{f} \times d}\), where nu, ni and nf denotes the number of users, items and context features respectively. Thus, the embedding for each user, item and feature can be represented as follows:

$$\begin{array}{@{}rcl@{}} {e_{u}^{0}} &=& t_{u} \times \mathbf{E_{u}}\\ {e_{i}^{0}} &=& t_{i} \times \mathbf{E_{i}}\\ e_{f} &=& t_{f} \times \mathbf{E_{f}}\\ Cf &=& concat(e_{f}), \quad \forall f \in F\\ \end{array}$$
(1)

where \(t_{u} \in \mathbb {R}^{n_{u}}\) and \(t_{i} \in \mathbb {R}^{n_{i}}\) represent the one-hot encoding of user and item respectively. \(t_{f} \in \mathbb {R}^{n_{u}}\) represents the one-hot encoding of one field feature and Cf is the concatenated vector of context features set F. The user-item interaction graph is constructed on the basis of these initialized embedding to endow them with collaborative information where concat indicates the concatenate operation.

Figure 3
figure 3

Overview of the graph layer. (left) is a sample user-item interaction graph and (right) is the illustration of message propagation with (a)

3.2 Graph layer

The collaborative information with graph is explicitly explored by following to capture the collaborative information between the users and items effectively [17]. The endowment of collaborative information to the embedding of users and items by the graph layer will be subsequently described. This approach mainly contains message passing and message propagation.

3.2.1 Message passing

The message passing is achieved via the paths in the graph. For example, in a user-item interaction graph, as shown in Figure 1, if user A has purchased item a and this item is also purchased by user B, then a path A → a → B can be obtained. This condition indicates potential similarity between A and B. If B has purchased c, then the path A → a → B → c implies that A may be interested in c. The message passing between one user-item pair can be then represented as:

$$m_{h \rightarrow t} = \frac{1}{\sqrt{D_{h} D_{t}}} (W_{s} e_{h} + b_{s} + W_{i} (e_{h} \odot e_{t}) + b_{i})$$
(2)

where h and t in (2) respectively represent the head and tail nodes in an interaction path. \(m_{h \rightarrow t} \in \mathbb {R}^{d}\) represents the message passing from head node to tail node. Dh and Dt are the degrees of head and tail nodes respectively, and the degree of a node is the number of nodes that are connected with it in the graph. \(W_{s}, W_{i} \in \mathbb {R}^{d \times d}\) and \(b_{s}, b_{i} \in \mathbb {R}^{d}\) are trainable parameters for core message extraction. Notably, active users or popular items usually connect with an excessive number of nodes and may propagate numerous messages to neighbors. While the information from those cold users or items may be missed. Therefore, bias learning may appear. \(1/ {\sqrt {D_{h} D_{t}}}\) in (2) is served as decay factor to solve this problem. The Element-wise product ⊙ in (2) facilitates the dependence of message passing on affinity between nodes according to [17].

3.2.2 Message propagation

A node t in this graph can obtain various messages from its neighbor nodes. These messages are integrated together recursively for collaborative information learning, and the embedding of node t at each iteration can be calculated as:

$$\begin{array}{@{}rcl@{}} {e^{l}_{t}} &=& \sigma_{1} ({W_{s}^{l}} e_{t}^{l-1} + {b_{s}^{l}} + \sum\limits_{h \in N_{t}} m_{h \rightarrow t}^{l})\\ m_{h \rightarrow t}^{l} &=& \frac{1}{\sqrt{D_{h} D_{t}}} ({W_{s}^{l}} e_{h}^{l-1} + {b_{s}^{l}} + {W_{i}^{l}} (e_{h}^{l-1} \odot e_{t}^{l-1}) + {b_{i}^{l}}) \end{array}$$
(3)

where \({e^{l}_{t}}\) represents the embedding of node at lth iteration, and Nt represents the set of nodes that are connected to t in graph. \({W_{s}^{l}}, {W_{i}^{l}} \in \mathbb {R}^{d \times d}\) and \({b_{s}^{l}}, {b_{i}^{l}} \in \mathbb {R}^{d}\) are trainable parameters at lth iteration. LeakyReLU[33] is chosen as activation function σ1 in (3).

A simple example is shown in Figure 3 to understand the graph layer. Figure 3(a) shows a user-item interaction graph, and Figure 3(b) shows the message propagation based on the graph. Figure 3(a) reveals a path from the graph, such as \(b \rightarrow B \rightarrow a \rightarrow A\). The collaborative information in this path is then learned With three iterations in Figure 3(b) to obtain the embedding of user A, that is, \({e_{A}^{3}}\). In recursive learning procedure, \({e_{A}^{3}}\) mainly depends on the \({e}_{a}^{2}\) and \({e}_{c}^{2}\) in the second iteration. While \({e}_{a}^{2}\) contains the messages from \({e}_{B}^{1}\) and \({e}_{b}^{0}\).

With the message propagation in the graph, many embedding can be obtained from e0 to el for each user and item. The embedding at different iterations may contain different level collaborative information. Therefore, the final embedding for each node can be calculated as:

$$e = concat(e^{0}, e^{1}, ... , e^{l})$$
(4)

3.3 Interaction layer

The interaction layer is adopted on the basis of collaborative embedding of users and items to learn the interaction information between the users and items. Specifically, suppose a user u has interacted with several items i1,i2,...,in,itar, where n is the number of items which the user has clicked and itar is the latest item that the user has clicked. An MLP is used to obtain the interaction embedding between user u and the items that the user has clicked. The concatenation of user embedding and the corresponding item embedding can be learned as follows:

$$\begin{array}{@{}rcl@{}} {I^{u}_{i}} &=& concat(e_{u}, e_{i})\\ {E^{u}_{i}} &=& W_{2}\sigma_{2}(W_{1} {I^{u}_{i}} + b_{1}) + b_{2} \end{array}$$
(5)

where eu,ei are the embeddings of user u and item i respectively, \({I^{u}_{i}}\) represents the interaction information between user u and item i. W1,b1,W2,b2 are trainable parameters in MLP. Meanwhile, the ReLU is set as the activation function in MLP.

3.4 Logical reasoning layer

The behaviors can imply the potential and dynamic preference of the user and can be treated as an important feature to describe the user. With this feature, the use of logical learning is attempted to learn the user requirements. Through the following logical operation, a logical conditional statement can be transferred into logical representation with only disjunction ∧ and negation ¬, that is, \(a \rightarrow b \Longleftrightarrow \neg a \vee b\). The problem can be transformed as follows: \((a \wedge b \wedge c)\rightarrow t\) to ¬a ∨¬b ∨¬ct.

Specifically, the logical reasoning layer comprises a recurrent neural module based on idea of distributed representation and neural symbolic framework [34, 35], to learn logical operation “OR” and a module to learn the logical operation “NOT”. The logical representation for inferring can be easily defined with the interaction information learned from interaction layer. For example, if user u has bought items i1,i2,...,in, then the logical representation to predict whether u will click the item itar can be represented as:

$$\neg {E^{u}_{1}} \vee \neg {E^{u}_{2}} \vee {\cdots} \vee \neg {E^{u}_{n}} \vee E^{u}_{tar}$$
(6)

The operation “NOT” is represented as a two-layer MLP, and can be formulated as follows:

$$\begin{array}{@{}rcl@{}} \neg {E^{u}_{i}} &=& NOT({E^{u}_{i}})\\ NOT(x) &=& {W_{2}^{n}} \mathbf{ReLU}({W_{1}^{n}} x + {b_{1}^{n}}) + {b_{2}^{n}} \end{array}$$
(7)

where \({W_{1}^{n}}, {W_{2}^{n}}, {b_{1}^{n}}, {b_{2}^{n}}\) are trainable parameters in NOT module. An MLP module simulates “NOT”. Therefore, the embedding of the interaction at negation, that is, \(\neg {E^{u}_{i}}\), can be easily obtained. Equation (7) shows that logical operation ∨ is continuous and requires two behaviors as input simultaneously. The OR network is designed as a recurrent MLP network. Mathematically, this network can be defined similarly to Recurrent Neural Network(RNN):

$$\begin{array}{@{}rcl@{}} x_{t} &=& {E^{u}_{t}}\\ h_{t} &=& {W_{2}^{o}} \mathbf{ReLU}({W_{1}^{o}}(h_{t-1} + \neg x_{t}) + {b_{1}^{o}}) + {b_{2}^{o}} \end{array}$$
(8)

where \({W_{1}^{0}}, {W_{2}^{o}}, {b_{1}^{o}}, {b_{2}^{o}}\) are trainable parameters and ht represents a temporary state after t behaviors of a user. The initial hidden state h1 as \(\neg {E^{1}_{1}}\) is set and the negation of each interaction embedding is fed into the OR module recursively until the entire OR operation is learned. Thus, \({E^{u}_{1}} \vee \neg {E^{u}_{2}} \vee {\cdots } \vee \neg {E^{u}_{n}}\) can be finally transformed to hn. The embedding of the logical behaviors in behavior sequence can then be calculated by:

$$\mathbf{O} = {W_{2}^{o}} \mathbf{ReLU}({W_{1}^{o}}(h_{n} + E^{u}_{tar}) + {b_{1}^{o}}) + {b_{2}^{o}}$$
(9)

Notably, the input is the embedding of the latest interaction \(E^{u}_{tar}\) instead of its negation \(\neg E^{u}_{tar}\). O represents the embedding of the logical behaviors. Similar to [35], an extra regularizer is adopted to learn the modules with logical functions.

3.5 Prediction layer

A considerable amount of collaborative information can be endowed to the embedding of users and items with the graph layer. The interaction layer can learn the interaction information effectively with the collaborative embedding. Since the goal is to calculate the occurrence probability of the behavior. Thus, a constant vector \(\mathbf {T} \in \mathbb {R}^{d}\) is defined to represent “True behavior”, which indicates the occurrence of the behavior. The calculation of similarity between O and T can directly imply the prediction of probability. The cosine similarity is chosen in this paper to measure the proximity between decision vector and True vector. Additionally, since contextual information assists models with stronger expressive [36], we incorporate the embedding of context features Cf and combine it with the cosine similarity for prediction via MLP. Cosine similarity between the two vectors and the prediction of CTR can be calculated as:

$$\begin{array}{@{}rcl@{}} sim(\mathbf{O},\mathbf{T}) &=& \frac{\mathbf{OT}}{\left\|\mathbf{O}\right\|\left\|\mathbf{T}\right\|}\\ p &=& MLP(concat(sim(\mathbf{O}, \mathbf{T}), \textbf{O}, \textbf{T}, Cf)) \end{array}$$
(10)

Notably, p is the predicted probability of user clicks target item and MLP represents a two-layer fully connected network with ReLU activation.

Table 1 Correspondence between logical laws and expressions applied in loss function

3.6 Learning algorithm

The loss function has two components. The first one is the log-likelihood function which is widely applied in CTR models [30, 32]. This function considers both the label of a sample and the corresponding prediction score. Specifically, an ideal model should assign positive and negative samples with higher and lower scores respectively. The loss function is presented as:

$$L_{log} = -\frac{1}{N} \sum\limits_{(x,y)\in\mathbb{D}} (y \log p + (1-y)\log(1-p)).$$
(11)

where \(\mathbb {D}\) is the training dataset and x,y and p are the inputs, labels and prediction probabilities respectively.

Second, logical equation is introduced on the basis of the idea w.r.t logical regularizers [35] to guarantee that NOT and OR modules can be learned with the property of propositional logic. The motivation of logical regularizer is that the modules, which have the capability of logical operation, should satisfy the basic logical operation laws. Thus, the interactions are transformed into logical representation and added to the loss function. The logical laws and the corresponding logical representation are listed in Table 1.

The loss function is organized by combining the log likelihood function and the logical regularizers together which can be represented as:

$${ L = L_{log} + \frac{\lambda_{1}}{\| E^{*}\|} {\sum\limits_{i=1}^{6}l_{i} + \lambda_{2}\left\|{\Theta}\right\|^{2}_{2}} }$$
(12)

where ∥E∥ represents the total number of interactions. λ1,λ2 are the penalty coefficients of logical regularizers and Frobenius norm respectively.

4 Experiments

Several popular real-world datasets are used in the experiments to demonstrate the effectiveness of the proposed method, and several state-of-the-art and baseline methods are compared.

Table 2 Statistics of five Amazon datasets

4.1 Datasets

Five large-scale datasets are adopted, and these datasets are collected by Amazon from their websites [37], including Video Games, Digital Music, Cell Phones and Accessories, Toys and Games and CDs and Vinyl. These datasets are all from Amazon 5-core. Footnote 1

The rating scores in these datasets are from a candidate {1,2, 3, 4, 5}. The data are transformed into binary classification data to apply data in CTR prediction task by labeling samples with ratings of 4 and 5 to be positive and the rest to be negative. CTR prediction is a binary problem, which indicates whether the user has clicked on the item; thus the highest score is set as 1. Meanwhile, each dataset is divided into training data and testing data based on the time stamp. For each user, The latest behavior of each user in the sequential behaviors according to his/her historical behaviors is used for testing. The latest behavior in the training sequential behaviors is used during the training procedure as the target item for training. For example, if the behaviors of a user are represented as B = [b1,b2,...,bn], then the latest behavior bn is used for testing while the behavior bn− 1 is used to train the model for prediction. A total of 10% of users in the training set with their historical behaviors are randomly selected as validation set.

4.2 Experimental settings

The proposed GACR is compared with several popular CTR models, including graph-based CF models and logical reasoning models.

  • Wide&Deep [2]: Wide & Deep Learning framework is widely used in modern industrial applications. This framework combines two technological parts: the wide part used the linear model with the capability of memorization, and the deep part can extract non-linear correlations among features via deep neural networks.

  • DeepFM [3]: DeepFM is derived from Wide & Deep and utilizes an FM to replace the linear model for learning two-order feature interaction.

  • xDeepFM [25]: A combination of the deep neural network and a novel compress information module to extract additional implicit information behind features for CTR prediction.

  • NGCF [17]: A graph-based collaborative filtering model that exploits the graph structure by propagating embedding and generates expressive representations with high-order connectivity.

  • NLR [35]: A cognition recommendation focused model, which integrates embedding learning technology and logical reasoning.

  • DIFM [38]: A FM based model that combine the self-attention and deep neural network to learn both bit-wise and vector-wise level feature interaction simultaneously.

The learning rate is set as 0.001 and the optimizer with Adam, which is widely used in deep learning models. Meanwhile, the batch size for training with 1024 is chosen. For the proposed method, the weight of logical regularizer λ1 is set to 0.4 (except 0.1 in Video Games), the penalty coefficient of Frobenius norm regularizer λ2 is set to 1 × 10− 5 and the embedding size is set to 16 (except 8 in Digital Music and 32 in Toys and games). Moreover, the number of iterations in graph layer is set to 4.

These models are assessed with two popular evaluations, namely, Area Under ROC Curve (AUC) to show the superiority of the proposed method. AUC, which reflects the ranking capability is suitable for imbalanced data [39]. A high AUC indicates ranking of additional positive samples before negative samples. AUC and logloss are defined as follows:

$$AUC=\frac{\sum I(f(x^{+}),f(x^{-}))}{M\times N}\\$$
(13)

where M, N indicates the total number of positive samples x+ and negative samples x respectively, and f(x) is the prediction results of x. I(x,y) are equal to 0, 0.5, 1 for x < y, x = y and x > y respectively.

In addition, RelaImpr metric [40] is introduced for measuring relative improvement of the proposed model. RelaImpr is defined as follows:

$$RelaImpr = \left( \frac{AUC_{t} - 0.5}{AUC_{c} - 0.5} - 1\right) \times 100\%$$
(14)

where AUCt,AUCc represent the AUC achieved by the proposed model and the best result in the compared methods respectively.

Table 3 The AUC results of the comparison methods on five Amazon datasets

4.3 Results

The results of the methods in the experiments are reported in Table 3. All the results are obtained with the average results in five runs. In the compared methods, WDL, DeepFM and xDeepFM are three methods that attempt to extract high-order feature interaction. These methods achieve the second-best performance of AUC on most datasets. Therefore, learning interaction information from embedding is crucial.

NGCF and NLR are two state-of-the-art methods for graph learning and logical reasoning respectively in recommendation. Notably, the performance of the two methods is worse than that of other feature based models on AUC. This finding may be due to the excessive number of similar items, which makes it difficult for NGCF and NLR to capture the specific item. DIFM is the state-of-the-art method designed based on FM architecture. As shown in Table 3, DIFM achieves worse performance at most cases. The explanation for this phenomenon may be that though DIFM utilizes complex network to learn the weight for feature interaction, it needs more data of feature interaction for training.

Meanwhile, the proposed method achieves the best performance on all datasets on AUC value. In particular, the proposed model is superior the second-best results and RelaImpr reaches 11.05% and 6.2% on Digital Music and CDs and Vinyl respectively. This finding demonstrates that the proposed method can effectively capture the preference and the requirement of the users, and recommend additional specific items for the users. Hence, the combination of collaborative information and the logical behaviors is crucial to designing ideal advertising systems, and the proposed method can effectively achieve this goal.

Figure 4
figure 4

Effect of embedding dimension on (left) Video Games; (middle) Digital Music; (right) Cell Phones

Figure 5
figure 5

Effect of the iterations in graph layer on (left) Video Games; (middle) Digital Music; (right) Cell Phones

Figure 6
figure 6

Effect of the weight λ1 about logical regularizer on (left) Video Games; (middle) Digital Music; (right) Cell Phones

4.4 Parameters analysis

The following effects of parameters in GACR are studied in this section: (1) the dimension of embedding; (2) the number of propagation iterations in the graph layer; (3) the weight of logical loss λ1 in (13).

4.4.1 Effect of embedding dimension

The embedding dimension from the candidate set (4,8,16,32,64) to study the influence of embedding dimension, and results on video game, digital music, and Cell Phones are reported in Figure 4. In Figure 4, the red lines represent the AUC values while the blue lines represent the results of logloss. The results indicate that the proposed method is sensitive to the embedding dimension. The performance decreases when the embedding dimension is larger than 16 on video game, 8 on digital music and 32 on Cell Phones dataset. At Video Games and Cell Phones dataset, 8 or 16 is a small dimension for the optimal settings compared to former work. It is due to the message propagation mechanism in Graph layer. It could learn the collaborative information on different level in forms of embedding vectors and concatenate them together according to (4). Thus a small embedding dimension can lead to expressive embedding vectors. With the increase of embedding dimension, the overfitting problem leads to worse performance. Notably, at Cell Phones dataset with the most sparse interactions according to Table 2, the collaborative information is less and thus the optimal dimension increases to 32.

4.4.2 Effect of propagation iterations l in the graph layer

The collaborative information in the graph layer is learned by message propagation on graph. The message propagates recursively to learn the propagation layer effectively. Hence, the propagation iteration l is an important parameter in the proposed method. The propagation results on video game, digital music, and Cell Phones are presented by changing l from 1 to 5 in Figure 5. This figure shows that the results are sensitive to the l. AUC and logloss can obtain improved results when l is increased. However, the results decrease quickly on all datasets when l is larger than 4. This phenomenon may result in the overfitting problem due to the presence of many iterations. However, the best performance is achieved on all datasets when l is equal to 4. Hence, l can be set as 4 in real applications.

4.4.3 Effect of λ 1

The performance with different weights on three datasets, including video game, digital music, and Cell Phones, is presented to evaluate the importance of logical reasoning and the results are reported in Figure 6. This Figure reveals that the results are sensitive to weight. The best performance is achieved on Video Games when λ1 is equal to 0.1 while the two other datasets achieve the best performance when λ1 is equal to 0.4. Compared without logical reasoning, the improvements are significant. Hence, the logical reasoning is essential for CTR prediction. Meanwhile, although the results begin to descend when λ1 is larger than 0.1 on Video Games, the descending is almost can be ignored. Hence, λ1 can be set as 0.4 for practical application.

Figure 7
figure 7

Visualization of the item embeddings in Video Games dataset via t-SNE. Color of points corresponds to the different categories

4.5 Visualization of embedding

Five main categories of items in Video Games dataset are selected to show the effective learning of embedding, and several items from these categories are randomly sampled. Figure 7 shows the visualization of embedding vectors of items with t-SNE [41]. This figure indicates that although all the items that belong to the same category are not closed, several items in the same categories are always closed with each other because several subcategories may exist in one category. Hence, the consistency and the difference among the items can be learned effectively. Moreover, considering the embeddings cross categories, the items in different categories also have significant differences. These results confirm that the proposed method can learn cognitive embedding for the items, which is favorable to logical reasoning.

5 Conclusion

A graph-aware collaborative reasoning method for CTR prediction is proposed in this paper. The proposed method uses the graph to propagate messages and endow deep collaborative information for the embedding of users and items. The preference of the user can be effectively learned. A logical reasoning network is then adopted to learn the logical behaviors in the sequential behaviors from the interaction information. The intention of the user can be inferred from the logical behaviors. The proposed method can simultaneously learn user preference and user requirements by the proposed architecture. Extensive experiments on five large-scale datasets demonstrate that the proposed method outperforms the state-of-the-art methods for CTR prediction.