1 Introduction

Accurate targeting of commercial recommender systems is of great importance, in which Click-Through Rate (CTR) prediction plays a key role. CTR prediction aims to estimate the ratio of clicks to the impression of a recommended item for a user. Therefore, we consider users have negative preferences instead of implicit feedbacks on those un-clicked items [25, 26]. In common display advertising systems, advertisers expect lower costs to achieve a higher return on investment. The ad exchange platforms usually trade with advertisers and publishers according to the generalized second price of the maximum effective Cost Per Mille (eCPM). If CTR is overestimated, advertisers could waste campaign budgets on the useless impression; On the other hand, if CTR is underestimated, advertisers would lose some valuable impressions and the campaigns may under deliver. With multi-billion dollar business on commercial recommendation today, CTR prediction has received growing interest from communities of both academia and industry [3, 13, 21].

In web-scale commercial recommender systems, the inputs of users’ characteristics are in two kinds of structures. The first kind of structure is described by numerical or dense parameters, e.g., “Age_years=22, Height_cm=165”. Each of such characteristics is formalized as a value associated with a numerical field, while the values are named as dense features. The second kind of structure is described by categorical or sparse parameters, e.g.,“Gender=Female, Relationship=In love”. Each of such characteristics is formalized as a vector of one-hot encoding associated with a categorical field, while the vectors are named as sparse features. Research shows an important property of recommendation datasets for industrial use cases is the availability of both dense features and sparse features [22]. Thus, Criteo Kaggle datasetFootnote 1 is usually regarded as representative of real production use cases. Moreover, the number of dense and sparse features for industrial use cases are often 100s to 1000 with a 50:50 splitFootnote 2.

Data scientists usually spend much time on interactions of raw features to generate better predictive models. Among these feature interactions, cross features, previously focused more on the product of sparse features, show a promising way to enhance the performance of prediction [15]. Owing to the fact that correct cross features are mostly task-specific and difficult to identify a priori, the crucial challenge is in automatically extracting sophisticated cross features hidden in high-dimensional data. Research on feature crossing as the mainline of CTR prediction has attracted widespread attention in recent years. Shallow models are simple, interpretable, and easy to scale, but limited in expressive ability. Alternatively, deep learning has shown powerful expressive capabilities, nevertheless, deep neural networks (DNNs) require many more parameters than tensor factorization to approximate high-order cross features. Besides, almost all deep models leverage multilayer perceptron (MLP) to learn high-order feature interactions, however, whether plain DNNs indeed effectively represent right functions of cross features remains an open question [10, 21].

In addition, most methods neglect to represent cross dense features. There are three major patterns for handling dense features. First, dense features are discarded when crossing features, that is, dense features only participate in the linear part of the model [20]. Second, dense features are directly concatenated with the embeddings of sparse features, which could cause an important feature dimensionality imbalance problem [21]. Third, dense features are converted into sparse features through bucketing, which could introduce hyper-parameters and loss information of dense features [12].

Based on all these observations, we propose a novel Extreme Cross Network (XCrossNet), to represent feature structure-oriented interactions. Modeling with XCrossNet consists of three stages. In the Feature Crossing stage, we separately propose a cross layer for crossing dense features and a product layer for crossing sparse features. In the Feature Concatenation stage, cross dense features and cross sparse features interact through a concatenate layer and a cross layer. Lastly, in the Feature Selection stage, we employ an MLP for capturing non-linear interactions and their relative importance. Experimental results on Criteo Kaggle dataset demonstrate the superior performance of XCrossNet over the state-of-the-art baselines.

2 Related Work

Studies on CTR prediction can be categorized into five classes which will be respectively introduced below.

Generalized Linear Models. Logistic Regression (LR) models such as FTRL are widely used in CTR prediction for their simplicity and efficiency [9, 16]. Ling Yan et al. argue that LR cannot capture nonlinear feature interactions and propose Coupled Group Lasso (CGL) to solve it [24]. Human efforts are usually needed for LR models. Gradient Boosting Decision Tree (GBDT) is a method to automatically do feature engineering and search interactions [4], then the transformed feature interactions can be fed into LR. In practice, tree-based models are more suitable for dense features but not for sparse features.

Quadratic Polynomial Mappings and Factorization Machines. Poly2 enumerates all pairwise feature interactions to avoid feature engineering which works well on dense features [2]. For sparse features, Factorization Machine (FM) and its variants project each feature into a low-dimensional vector and model cross features by inner product [20]. SFM introduces Laplace distribution to model the parameters and better fit the sparse data with a higher ratio of zero elements [17]. FFM enables each feature to have multiple latent vectors to interact with features from different fields [8]. As FM and its variants can only model order-2nd cross features. An efficient algorithm Higher-Order FM (HOFM) for training arbitrary-order cross features was proposed by introducing the ANOVA kernel [1]. As reported in [23], HOFM achieves marginal improvement over FM whereas using many more parameters and only its low-order (usually less than 5) form can be practically used.

Implicit Deep Learning Models. As deep learning has shown promising representation capabilities, several models use deep learning to improve FM. Attention FM (AFM) enhances the importance of different order-2nd cross features via attention networks [23]. Neural FM (NFM) stacks deep neural networks on top of the output of the order-2nd cross features to model higher-order cross features [6]. FNN uses FM to pre-train low-order features and then feeds feature embeddings into an MLP [27]. In contrast, DSL uses MLP to pre-train high-order non-linear features and then feeds them with basis features into an FM layer [7]. Moreover, CCPM uses convolutional layers to explore local-global dependencies of cross features [14]. IPNN (also known as PNN) feeds the interaction results of the FM layer and feature embeddings into an MLP [18]. PIN introduces a micro-network for each pair of fields to model pairwise cross features [19]. FGCNN combines a CNN and MLP to generate new features for feature augmentation [11]. However, all these approaches learn the high-order cross features in an implicit manner, therefore lack good model explainability.

Wide&Deep Based Models. Jianxun Lian et al. argue that implicit deep learning models focus more on high-order cross features but capture little low-order cross features [10]. To overcome this problem, there has been proposing a hybrid network structure, namely Wide&Deep, which combines a shallow component and a deep component with the purpose of learning both memorization and generalization [3]. Wide&Deep framework revolutionizes the development of CTR prediction, and attracts industry partners a lot from the beginning. As for the first Wide&Deep model proposed by Google [3], it combines a linear model (wide part) and DNN, while the input of the wide part still relies on feature engineering. Later on, DeepFM uses an FM layer to replace the wide component. Deep&Cross [21] and xDeepFM [10] take outer product of features at the bit- and vector-wise level respectively. However, xDeepFM uses so many parameters that great challenges are posed to identify important cross features in the huge combination space.

AutoML Based Models. There exist some pre-trained approaches using AutoML techniques to deal with cross features. AutoCross is proposed to search over subsets of candidate features to identify effective interactions [15]. This requires training the whole model to evaluate the selected feature interactions, but the candidate sets are incredibly many. AutoGroup treats the selection process of high-order feature interactions as a structural optimization problem, and solves it with Neural Architecture Search [12]. It achieves state-of-the-art performance on various datasets, but is too complex to be applied in industrial applications.

3 Extreme Cross Network (XCrossNet)

In this section, we will introduce the problem statement and describe the details of Extreme Cross Network (XCrossNet) in the following three steps: Feature Crossing, Feature Concatenation, and Feature Selection. The complete XCrossNet model is depicted in Fig. 1.

Fig. 1.
figure 1

The structure of XCrossNet.

3.1 Problem Statement

In web-scale commercial recommender systems, the inputs of users’ characteristics are in two kinds of structures. The first kind of structure is described by numerical or dense parameters, denoted as \(\varvec{D}\). The second kind of structure is described by categorical or sparse parameters, denoted as \(\varvec{S}\). Suppose that the dataset for training consists of n instances \(([\varvec{D}; \varvec{S}], y)\), where \(\varvec{D} = [D_1, D_2, \cdots , D_M]\) indicates dense features including M numerical fields, and \(\varvec{S} = [S_1, S_2, \cdots , S_N]\) indicates sparse features including N categorical fields, and \(y \in \{0, 1\}\) indicates the user’s click behaviors (\(y=1\) means the user clicked the item, and \(y=0\) otherwise). The task of CTR prediction is to build a prediction model \(\hat{y}=pCTR\_Model([\varvec{D}; \varvec{S}])\) to estimate the ratio of clicks to impressions of a given feature context.

3.2 Feature Crossing

A cross feature is defined as a synthetic feature formed by multiplying (crossing) two features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually. Based on the definition, cross features can be generalized to high-order cases. If we consider individual features as order-1st features, an order-kth cross feature is formed by multiplying k individual features.

Cross Layers on Dense Features. First we introduce a novel cross layer for crossing dense features (see in Fig. 2). Cross layers have the following formula:

$$\begin{aligned} \begin{aligned} \varvec{C_1}&= \varvec{D} \cdot \varvec{D^\mathsf {T}} \cdot \varvec{W_{C,0}} + \varvec{b_{C,0}}, \quad \varvec{O^{C}_{1}} = [\varvec{D}; \varvec{C_1}], \\ \varvec{C_{l+1}}&= \varvec{D} \cdot \varvec{C_{l}^\mathsf {T}} \cdot \varvec{W_{C,l}} + \varvec{b_{C,l}}, \quad \varvec{O^{C}_{l+1}} = [\varvec{O^{C}_{l}}; \varvec{C_{l+1}}], \end{aligned} \end{aligned}$$
(1)

where \(\varvec{D} \in \mathbb {R}^M\) indicates the input dense features, and \(\varvec{C_l} \in \mathbb {R}^M\) is a column vector denoting the order-\((l+1)\)th cross features. Later we prove how \(\varvec{C_l}\) expresses multivariate polynomials of degree \((l+1)\) after weighted mapping. \(\varvec{W_{C,l}}, \varvec{b_{C,l}} \in \mathbb {R}^M\) are the weight and bias parameters respectively, and \(\varvec{O^{C}_{l}}, \varvec{O^{C}_{l+1}}\) denote the outputs from the l-th and the \((l+1)\)-th cross layers.

Fig. 2.
figure 2

The structure of cross layers.

We denote \(\varvec{\alpha } = [\alpha _{1}, \cdots , \alpha _{M}]\). If our proposed cross layer expresses any cross features of order-\((l+1)\)th, it could approximate to any multivariate polynomials of degree \((l+1)\), denoted as \(P_{l+1}(\varvec{D})\):

$$\begin{aligned} P_{l+1}(\varvec{D}) = \bigg \{ \sum _{\varvec{\alpha }} W_{\varvec{\alpha }} D_1^{\alpha _1} D_2^{\alpha _2} \cdots D_M^{\alpha _M} \, \bigg | \,|\varvec{\alpha }|=l+1 \bigg \}, \end{aligned}$$
(2)

where \(|\varvec{\alpha }|=\sum _{i=1}^{M}\alpha _{i}\). For simplicity, here we use \(\varvec{W^{i}} = [W^{i}_1, W^{i}_2, \cdots , W^{i}_M]\) to denote the original subscript of \(\varvec{W_{C,i}}\). We study the coefficient \(\hat{W_{\varvec{\alpha }}}\) given by \(\varvec{C_l^{\mathsf {T}}} \cdot \varvec{W^{l}}\) from cross layers, since it constitutes the output \(\varvec{O_{l+1}^{C}}\) from the \((l+1)\)-th cross layer. Besides, the following derivations do not include bias terms. Then:

$$\begin{aligned} \begin{array}{l} \varvec{C_l^{\mathsf {T}}} \cdot \varvec{W^{l}} = \Big (\varvec{C_{l-1}^{\mathsf {T}}} \cdot \varvec{W^{l-1}} \Big ) \cdot \Big ( \varvec{D^\mathsf {T}} \cdot \varvec{W^{l}} \Big ) = \prod _{i=0}^{l} \varvec{D^{\mathsf {T}}} \cdot \varvec{W^{i}} \\ = \prod _{i=0}^{l} [D_1, D_2, \cdots , D_M]^{\mathsf {T}} \cdot [W^{i}_1, W^{i}_2, \cdots , W^{i}_M]. \end{array} \end{aligned}$$
(3)

Afterwards, let \(\varvec{I}\) denotes the multi-index vectors of orders \([0, 1, \cdots , l]\), and \(I_j\) denotes the order of field j. Clearly \(\varvec{C_l^{\mathsf {T}}} \cdot \varvec{W^{l}}\) from cross layers approaches the coefficient \(\hat{W_{\varvec{\alpha }}}\) as:

$$\begin{aligned} \hat{W_{\varvec{\alpha }}} = \sum _{k=1}^{M} \sum _{|\varvec{I}|=\alpha _k} \prod _{j=1}^{M} W_{j}^{I_j}. \end{aligned}$$
(4)

With \(\varvec{C_l^{\mathsf {T}}} \cdot \varvec{W^{l}}\) approximate to multivariate polynomials of degree \((l+1)\), the output \(\varvec{O^{C}_{l+1}}\) from the \((l+1)\)-th cross layer that includes all cross features to order-\((l+1)\)th could approximate polynomials in the following class:

$$\begin{aligned} P_{l+1}(\varvec{D}) = \bigg \{ \sum _{\varvec{\alpha }} W_{\varvec{\alpha }} D_1^{\alpha _1} D_2^{\alpha _2} \cdots D_M^{\alpha _M} \, \bigg | \,0 \le |\varvec{\alpha } | \le l+1 \bigg \}. \end{aligned}$$
(5)

Embedding and Product Layers on Sparse Features. Here we introduce the embedding layer and product layer for crossing sparse features (see in Fig. 3). As sparse features \(\varvec{S}\) are represented as vectors of one-hot encoding of high-dimensional spaces, we employ an embedding layer to transform these one-hot encoding vectors into dense vectors \(\varvec{E}\) as:

$$\begin{aligned} \begin{aligned} \varvec{E}&= [\varvec{E_1}, \cdots , \varvec{E_i}, \cdots , \varvec{E_N}], \\ \varvec{E_i}&= \mathrm{{embed}} (\varvec{S_i}), \, \big ( \varvec{E_i} \in \mathbb {R}^K, i=1, \cdots , N \big ) \end{aligned} \end{aligned}$$
(6)

where \(\varvec{S_i}\) indicates the input sparse feature of field i, K denotes the embedding size, and \(\varvec{E_i}\) denotes the feature embedding of field i.

Fig. 3.
figure 3

The structure of embedding layer and product layer.

Afterwards, we can propose a product layer for cross sparse features. First, we donate order-2nd cross sparse features as \(\varvec{P_2}\), and order-1st sparse features as \(\varvec{P_1}\), thus the output of product layer is \(\varvec{O^P} = [\varvec{P_1}; \varvec{P_2}]\).

The cross feature of two sparse features of field i and field j equals the inner product of two embedding vectors as \(\langle \varvec{E_i}, \varvec{E_j} \rangle \). Intuitively, we expect cross features to be vectors, so we concatenate the weighted sums of inner products to formulate order-2nd cross features as:

$$\begin{aligned} \varvec{P_2} = [P_2^1, \cdots , P_2^t, \cdots , P_2^T], \end{aligned}$$
(7)

where T is the size of product layer, and \(\varvec{P_2}\) is a T dimensional vector, of each dimension \(P_2^t\) denotes a weighted sum of inner products of two sparse features. Thus, we have \(P_2^t = \sum _{i=1}^{N} \sum _{j=1}^{N} W^{2,t}_{i,j} \langle \varvec{E_i}, \varvec{E_j} \rangle \). We assume that the weighted parameter \(W^{2,t}_{i,j} = \varTheta _{i}^{t} \cdot \varTheta _{j}^{t}\) for reduction, so \(P_2^t\) can be given as:

$$\begin{aligned} P_2^t= \sum _{i=1}^{N} \sum _{j=1}^{N} \varTheta _{i}^{t} \cdot \varTheta _{j}^{t} \langle \varvec{E_i}, \varvec{E_j} \rangle = \bigg \langle \sum _{i=1}^{N} \varTheta _{i}^{t} \cdot \varvec{E_i}, \sum _{j=1}^{N} \varTheta _{j}^{t} \cdot \varvec{E_j} \bigg \rangle . \end{aligned}$$
(8)

The feature vector of order-1st features has a similar formula as follows:

$$\begin{aligned} \varvec{P_1} = [P_1^1, \cdots , P_1^t, \cdots , P_1^T], \end{aligned}$$
(9)

where \(\varvec{P_1}\) is a T dimensional vector, of each dimension \(P_1^t\) denotes a weighted sum of sparse features. The weighted feature can be expressed as inner product \(\langle \varvec{W^{1,t}_{i}}, \varvec{E_i} \rangle \). Thus, we have \(P_1^t = \sum _{i=1}^{N} \langle \varvec{W^{1,t}_{i}}, \varvec{E_i} \rangle \).

3.3 Feature Concatenation

In the Feature Concatenation stage, in order to learn feature interactions of different structures, cross dense features \(\varvec{O^C}\) and cross sparse features \(\varvec{O^P}\) are concatenated as a vector through a concatenate layer, then the concatenated feature vector is fed into a cross layer, which can be expressed as:

$$\begin{aligned} \begin{aligned} \varvec{X_0}&= [\varvec{O^C}; \varvec{O^P}], \\ \varvec{X_1}&= \varvec{X_0} \cdot \varvec{X_0^\mathsf {T}} \cdot \varvec{W_{X,0}} + \varvec{b_{X,0}}, \quad \varvec{H_0} = [\varvec{X_0}; \varvec{X_1}], \end{aligned} \end{aligned}$$
(10)

where \(\varvec{X_0}\) denotes the concatenated feature of cross dense features and cross sparse features, \(\varvec{X_1}\) denotes the cross features between two kinds of feature structures, \(\varvec{H_0}\) denotes the output from this cross layer, and \(\varvec{W_{X,0}}, \varvec{b_{X,0}}\) are the weight and bias parameters of this cross layer.

3.4 Feature Selection

In the Feature Selection stage, we employ an MLP to capture non-linear interactions and the relative importance of cross features. The deep layers and the output layer respectively have the following formula:

$$\begin{aligned} \begin{aligned} \varvec{H_i}&= \mathrm{{ReLU}} ( \varvec{W_{H,i-1} } \cdot \varvec{H_{i-1}} + \varvec{ b_{H,i-1} } ), \\ O^G&= \mathrm{{Sigmoid}} (\varvec{W_{H,i}} \cdot \varvec{H_i} + \varvec{b_{H,i}} ), \end{aligned} \end{aligned}$$
(11)

where \(\varvec{H_i}, \varvec{H_{i-1}}\) are hidden layers, \(\mathrm{{ReLU}} (\cdot )\) and \(\mathrm{{Sigmoid}}(\cdot )\) are activation functions, \(\varvec{W_{H,i}}, \varvec{W_{H,i-1} }\) are weights, and \(\varvec{b_{H,i}}, \varvec{ b_{H,i-1} }\) are biases, and \(O^G\) is the output result.

For CTR prediction, the loss function is the Logloss as follows:

$$\begin{aligned} \varvec{\mathcal {L}} = - \frac{1}{n} \sum ^n_{i=1} \, y_i \log (O^G) + (1-y_i) \log (1-O^G), \end{aligned}$$
(12)

where n is the total number of training instances. The optimization process is to minimize the following objective function:

$$\begin{aligned} \varvec{\mathcal {J}} = \varvec{\mathcal {L}} + \lambda ||\varvec{\varTheta }||, \end{aligned}$$
(13)

where \(\lambda \) denotes the regularization term, and \(\varvec{\varTheta }\) denotes the set of learning parameters, including cross layers, embedding layer, product layer, deep layers and output layer.

4 Experiments

In this section, extensive experiments are conducted to answer the following research questionsFootnote 3:

  • RQ1: How does XCrossNet perform compared with the state-of-the-art CTR prediction models?

  • RQ2: How does the feature dimensionality imbalance impact CTR prediction?

  • RQ3: How do hyper-parameter settings impact the performance of XCrossNet?

Table 1. Performance comparison of different CTR prediction models.
Fig. 4.
figure 4

Training time comparison of different CTR prediction models.

Fig. 5.
figure 5

Impact of feature dimensionality imbalance.

4.1 Experimental Setup

Dataset. Experiments are conducted on Criteo Kaggle dataset, which is from a world-wide famous Demand-Side Platforms. Criteo Kaggle dataset contains one month of 45, 840, 617 ad click instances. It has 13 integer feature fields and 26 categorical feature fields. We select 7 consecutive days of samples as the training set while the next one day for evaluation.

Baselines. As aforementioned, we use following highly related state-of-the-art models as baselines: LR [9], GBDT [4], FM [20], AFM [23], FFM [8], CCPM [14], Wide&Deep [3], Deep&Cross [21] and its shallow part Cross network, FNN [27], DeepFM [5], IPNN [18], PIN [19], xDeepFM [10] and its shallow part CIN, FGCNN [11], and AutoGroup [12].

Hyper-parameter Settings. For model optimization, we use Adam with a mini-batch size of 4096, and the learning rate is set as 0.001. We use the L2 regularization with \(\lambda = 0.0001\) for all neural network models. For Wide&Deep, Deep&Cross, FNN, DeepFM, IPNN, PIN, xDeepFM, and XCrossNet, the numbers of neurons per deep layer are 400, and the depths of deep layers are set as 2. For our XCrossNet, the number of cross layers on dense features is set as l=4. In the main experiments, we set the embedding size for all models be a fixed value of 20.

4.2 Overall Performance (RQ1)

Table 1 summarizes the performance of all compared methods on Criteo Kaggle datasets, while the training time on Tesla K80 GPUs is shown in Fig. 4 for comparison of efficiency. From the experimental results, we have the following key observations: Firstly, most neural network models outperform linear models (i.e., LR), tree-based models (i.e., GBDT), and FM variants (i.e., FM, FFM, AFM), which indicates MLP can learn non-linear feature interactions and endow better expressive ability. Meanwhile, comparing IPNN, PIN with FNN, Wide&Deep based models, we find that explicitly modeling low-order feature interactions can simplify the training of MLP and boost the performance. Secondly, XCrossNet achieves the best performance. Statistically, XCrossNet significantly outperforms the best baseline in terms of AUC and Logloss on p-value \(<0.05\) level, which indicates feature structure-oriented learning can provide better predictive abilities. Thirdly, from the training time comparison, we can observe XCrossNet is very efficient, especially compared to field-aware models, mainly because these models further allow each feature to learn several vectors where each vector is associated with a field, which leads to huge parameter consumption and time consumption.

Fig. 6.
figure 6

Impact of network hyper-parameters on AUC performance.

Fig. 7.
figure 7

Impact of network hyper-parameters on Logloss performance.

4.3 Feature Dimensionality Imbalance Study (RQ2)

In XCrossNet, we denote \(\left. { \frac{\dim (O^C)}{\dim (O^P) } } \bigg / { \frac{M}{N} }\right. \) as the balance index of dimensions of dense and sparse features. Noted that, the dimension of cross dense features \(O^C\) equals \(M\cdot l\), increasing with the depth of cross layers. As for Criteo Kaggle dataset, \(M=13\) and \(N=26\), we set the depths of cross layers from 1 to 8, while the corresponding dimensions of cross dense features are from 13 to 104. Experimental results are shown in Fig. 5 in terms of AUC. We can observe that increasing the depth of cross layers benefits XCrossNet to achieve stable improvements on AUC performance, mainly because the higher dimensions of cross dense features are able to boost the balance index, which results in relatively balanced impacts of dense and sparse features on prediction.

4.4 Hyper-parameter Study (RQ3)

We study the impact of hyper-parameters of XCrossNet, including (1) embedding size; (2) number of deep layers; (3) activation function; (4) neurons per layer. Figures 6a and 7a demonstrate the impact of embedding size. We can observe that model performance boosts steadily when the embedding size increase from 4 to 20. Even with very low embedding sizes, XCrossNet still has comparable performance to some popular Wide&Deep based models with high embedding size. Specifically, XCrossNet achieves AUC\(>0.800\) and Logloss\(<0.541\) with embedding size set as 10, which is even better than DeepFM with embedding size set as 20. Figures 6b and 7b demonstrate the impact of the number of deep layers. The model performance boosts with the depth of MLP at the beginning. However, it starts to degrade when the depth of MLP is set to greater than 3. As shown in Figs. 6c and 7c, ReLU is indeed more appropriate for hidden neurons of deep layers compared with different activation functions. As shown in Figs. 6d and 7d, model performance barely boosts as the number of neurons per layer increasing from 300 to 700. We consider 400 is a more suitable setting to avoid the model being overfitting.

5 Conclusion

Due to the fact that previous work rarely attempts to individually learn representations for different feature structures, this paper presented a novel feature structure-oriented learning model, namely Extreme Cross Network (XCrossNet), for improving CTR prediction in recommender systems. A XCrossNet model starts with a Feature Crossing stage, followed by a Feature Concatenation stage and a Feature Selection stage. The main contribution of our approach is to represent dense and sparse feature interactions in an explicit and efficient way. Empirical studies verified the effectiveness of our model on Criteo Kaggle dataset.