Keywords

1 Introduction

Large-scale Learning to Rank (LTR) is a central part of real-world information retrieval problems, such as web search and content recommendations. Given a query of web search, the search engine first retrieves relevant webpages from the database and sorts the webpages by the ranking scores predicted by LTR models. While traditional LTR models transform ranking into regression problems of various types, they usually fail to capture the structural information over the interactions between billions of queries and trillions of webpages. These interactions indeed characterize how all these queries and webpages connect each other in a large bipartite graph of the web. To provide a better search experience, it is inevitable to incorporate such structural information in LTR.

On the other hand, in recent years, Graph Neural Networks, such as Graph Convolutional Neural Networks (GCN) [18], have been used for user-item recommendations and demonstrated their unique advantages in modeling the problem as link prediction over bipartite graphs [10]. Similar to LTR based on query-webpage pairs, the user-item recommendation also needs to rank items (e.g., products or contents) subject to the given profile of every user. However, it is difficult to directly adopt GCNs for LTR tasks at web-scale, due to the sparsity and imbalanced issues as follows. First of all, as shown in Fig. 1, links are extremely sparse in the query-webpage bipartite graph extracted from LTR training datasets, as labeling query-webpage pairs by professional annotators is expensive and time-consuming (difficult to scale-up). Furthermore, from the webpages’ perspectives, edges between queries and webpages are severely imbalanced—it is quite common to link a query to dozens of webpages with both high and low relevant scores, while a webpage hardly links to any queries, especially to the queries that the webpage is less relevant or low-ranked, in the annotations. Apparently, either sparsity or imbalance issue would significantly downgrade the performance of GCN models [29]. Therefore, to tackle the above two challenges, there needs a non-trivial extension on the GCN-based model for handling LTR at web-scale.

In this work, we propose LtrGCN to tackle the above two issues and adopt GCNs for LTR tasks in a pipeline as follows. Given all query-webpage pairs in the LTR training dataset, LtrGCN first leverages two advanced sampling strategies to generate the Q-subgraph and W-subgraph for every query and webpage. Then, LtrGCN leverages GCNs to extract feature vectors from the Q-subgraph and W-subgraph as the representation of the query-webpage pair for ranking score prediction. The feature extraction and ranking score prediction are optimized in an end-to-end manner, so as to enable discriminative feature extraction while preserving structural information in the bipartite graph. As sparsity and imbalance issues are addressed, LtrGCN can work with the training datasets, where it is sufficient that only a small proportion of query-webpage pairs are labeled by experts. Furthermore, we conduct extensive experiments to evaluate LtrGCN using two real-world datasets (offline), and launch online experiments based on an A/B test at a large-scale search engine. The offline results show that LtrGCN could achieve the best performance on both datasets compared to baselines. As for the online evaluation, we deploy LtrGCN with realistic traffic at a large-scale search engine, where we still observe significant improvement. LtrGCN performs consistently in both offline and online experiments. Our main contributions are summarized as follows:

  • We study the problem of the extreme sparsity of links in query-webpage bipartite graphs caused by the expense of ranking score annotation and the imbalance between queries and webpages for web-scale search. To the best of our knowledge, this work is the first to investigate sparsity and imbalance in query-webpage bipartite graphs for large-scale industrial LTR tasks.

  • We propose LtrGCN consisting of three steps: (1) Q-subgraph Generation via Self-tuned Labeling that annotates all unlabeled query-webpage pairs and assigns every query webpages with ranking scores to generate Q-subgraphs, (2) W-subgraph Generation via Negative Sampling that find irrelevant queries for every webpage to construct W-subgraphs, (3) Learning to Rank based on GCN with Q-subgraphs and W-subgraphs that learns the representations of query-webpage pairs from Q-subgraphs and W-subgraphs and predicts ranking scores in an end-to-end manner.

  • We carry out extensive offline experiments on a public LTR dataset and a real-world dataset collected from a large-scale search engine. We also deploy LtrGCN at the search engine and implement a series of online evaluations. The experiment results show that, compared to the state-of-the-art in webpage ranking, LtrGCN could achieve the best performance on both offline datasets. Furthermore, LtrGCN obtains significant improvements in online evaluations under fair comparisons.

Fig. 1.
figure 1

Sparsity and Imbalance Issues in the Query-Webpage Bipartite Graph.

2 Methodology

2.1 Task Formulation

Given a set of search queries \(\mathcal {Q}=\{q_1,~q_2,\) \(\dots \}\) and all archived webpages \(\mathcal {D}=\{d_1,d_2,\dots \}\), for each query \(q_i\in \mathcal {Q}\), the search engine could retrieve a set of relevant webpages denoted as \(D_i=\{d^i_j\}^{|D_i|}_{j=1}\subset \mathcal {D}\). Through professional labeling, a set of ranking scores \(\boldsymbol{y}_i=\{y^i_j\}^{|D_i|}_{j=1}\) for \(q_i\) is established to characterize the relevance of the webpage \(d^i_j\in D_i\) to the search query \(q_i\). In this paper, we follow the settings in [26] and scale the relevant score from 0 to 4 to represent levels of relevance (i.e., \(\left\{ {\textbf {bad-0, fair-1, good-2, excellent-3, perfect-4}}\right\} \)). We denote a set of query-webpage pairs with ranking score annotations as triples \(\mathcal {S}=\{(q_1,D_1,\boldsymbol{y}_1),~(q_2,D_2,\boldsymbol{y}_2),(q_3,\) \(D_3,\boldsymbol{y}_3),\dots \}\). We aim to learn an LTR scoring function \(f:\mathcal {Q}\times \mathcal {D}\rightarrow [0,4]\), which could be approximated through minimizing the following ranking loss:

$$\begin{aligned} \mathcal {L}=\frac{1}{|\mathcal {S}|} \sum _{i=1}^{|\mathcal {S}|}\left( \frac{1}{|D_i|}\sum _{j=1}^{|D_i|} \ell (\boldsymbol{y}^i_j, f(q^i,d^i_j))\right) , \end{aligned}$$
(1)

where \(\ell \) represents the loss of the ranking prediction for query \(q_i\) with returned webpage \(d^i_j\) against the ground truth label \(\boldsymbol{y}^i_j\). Note that, LtrGCN is flexible with standard loss functions (i.e., pointwise, pairwise, and listwise). As annotators can barely label a small number of query-webpage pairs due to the limited budgets, the key problem of LTR is thus to incorporate the unlabeled query-webpage pairs denoted as set \(\mathcal {S}^{'}=\{(q'_1,D'_1),~(q'_2,D'_2),\dots \}\).

Fig. 2.
figure 2

The framework of LtrGCN consisting of three steps: (1) Q -subgraph Generation via Self-tuned Labeling, (2) W -subgraph Generation via Negative Sampling, and (3) GCN-based LTR with Q -subgraphs and W -subgraphs.

2.2 Overall Framework of LtrGCN

As illustrated in Fig. 2, LtrGCN consists of three steps: (1) Q-subgraph Generation via Self-tuned Labeling, (2) W-subgraph Generation via Negative Sampling, and (3) Learning to Rank based on GCN with Q-subgraphs and W-subgraphs. Specifically, in Step (1), LtrGCN first annotates all unlabeled query-webpage pairs with pseudo ranking scores and then assigns every query webpages with high ranking scores and also webpages with low scores to generate Q-subgraphs from the training set. Then, in Step (2), LtrGCN proposes a negative sampling strategy to find irrelevant queries for every webpage to construct W-subgraphs. Eventually, in Step (3), given Q-subgraphs and W-subgraphs for every high-ranked query-webpage pair, LtrGCN learns the representations of query-webpage pairs using a Light Graph Convolution Network (LightGCN) [12] and enables LTR in an end-to-end manner.

2.3 Q-subgraph Generation via Self-tuned Labeling

As mentioned above, to leverage GCN for ranking, there needs to feed the model with Q-subgraphs and W-subgraphs. Given query-webpage pairs that are sparsely annotated with ranking scores in the training set, LtrGCN adopts a labeling approach [23] that first annotates every unlabeled query-webpage pair with a pseudo ranking score and then assigns every query webpages with high ranking scores and also webpages with low scores, so as to generate Q-subgraphs from the training set at full-scale. Thus, there needs the learning to predict pseudo ranking scores with labeled/unlabeled samples in the training set.

LtrGCN first gets every possible query-webpage pair from query and webpage datasets as \((q_i, d_i^j)\) for \(\forall q_i\in \mathcal {Q}\) and \(\forall d_i^j\in D_i\subset \mathcal {D}\). For each query-webpage pair \((q_i, d_i^j)\), LtrGCN further extracts an m-dimensional feature vector \(\boldsymbol{x}_{i,j}\) representing the features of the \(j^{th}\) webpage under the \(i^{th}\) query. Then, the labeled and unlabeled sets of feature vectors can be presented as \(\mathcal {M}=\{(\boldsymbol{x}_{i,j},\boldsymbol{y}^i_j)|\forall (q_i,D_i,\boldsymbol{y})\in \mathcal {S}~\text {and}~\forall d_j^i\in D_i\}\) and \(\mathcal {M}^{'}=\{\boldsymbol{x}_{i,j}|\forall (q_i,D_i)\in \mathcal {S}^{'}\}\). Given the labeled feature set \(\mathcal {M}\) and the unlabeled feature set \(\mathcal {M}^{'}\), LtrGCN further takes a two-step strategy to accomplish the pseudo-label generation via multi-loss learning as follows.

First, LtrGCN trains an LTR model with the listwise loss function as:

$$\begin{aligned} \mathcal {L}_{List}=-\frac{1}{|\mathcal {M}|} \sum _{i=1}^{|\mathcal {M}|}\left( \frac{1}{|D_i|}\sum _{j=1}^{|D_i|} {\text {softmax}}(\boldsymbol{y}^i_j) \times \log \left( {\text {softmax}}f(q^i,d^i_j)\right) \right) . \end{aligned}$$
(2)

The listwise-based LTR model is denoted as \(\textrm{Rank}^{List}\). LtrGCN trains \(\textrm{Rank}^{List}\) using both \(\mathcal {M}\) and \(\mathcal {M}^{'}\) through self-training, where \(\textrm{Rank}^{List}\) is first trained using \(\mathcal {M}\) through supervised learning. Then, \(\textrm{Rank}^{List}\) predicts the ranking score for each feature vector in \(\mathcal {M}^{'}\) and pseudo-labels the feature vector with the prediction result. After that, LtrGCN combines \(\mathcal {M}\) with pseudo-labeled data \(\mathcal {M}^P\) and retrains \(\textrm{Rank}^{List}\) using the combined data \(\mathcal {M}^C\).

Given the \(\mathcal {M}^C\), \(\mathcal {M}\) and \(\mathcal {M}^{'}\), LtrGCN (1) trains an LTR model with the pointwise loss function as:

$$\begin{aligned} \mathcal {L}_{Point}=\frac{1}{|\mathcal {M}^C|} \sum _{i=1}^{|\mathcal {M}^C|}\left( \frac{1}{|D_i|}\sum _{j=1}^{|D_i|} |f(q^i,d^i_j)-\boldsymbol{y}^i_j|^{2}\right) . \end{aligned}$$
(3)

The pointwise-based LTR model is denoted as \(\textrm{Rank}^{Point}\). LtrGCN first trains \(\textrm{Rank}^{Point}\) with \(\mathcal {M}^C\) and predicts pseudo-labels for each feature vector in \(\mathcal {M}^{'}\) using trained \(\textrm{Rank}^{Point}\). Then, LtrGCN updates \(\mathcal {M}^P\) with the prediction results of \(\textrm{Rank}^{Point}\) and combines \(\mathcal {M}\) with \(\mathcal {M}^P\) to conduct \(\mathcal {M}^C\). LtrGCN further retrains \(\textrm{Rank}^{List}\) using \(\mathcal {M}^C\) and predicts ranking scores for each feature vector in \(\mathcal {M}^{'}\) using trained \(\textrm{Rank}^{List}\). Finally, LtrGCN updates \(\mathcal {M}^P\) with the prediction results of \(\textrm{R}^{List}\) and combines \(\mathcal {M}\) with \(\mathcal {M}^P\) to obtain \(\mathcal {M}^C\). LtrGCN repeats the above steps with \(\mathcal {T}\) rounds and returns \(\mathcal {M}^C\).

With pseudo ranking scores predicted for all unlabeled samples, LtrGCN builds a Q-subgraph for every query with the (pseudo) ranking scores greater than 2 (good). Specifically, to build the Q-subgraph, LtrGCN randomly picks up a webpage that the query-webpage is with the ranking score lower than 1 (fair), and forms the three items (i.e., the query, a highly-ranked webpage of the query, a low-ranked webpage of the query) into a Q-subgraph.

2.4 W-subgraph Generation via Negative Sampling

Though Q-subgraph Generation step could generate ranking scores for every query-webpage pair in the training dataset, it is still difficult to construct W-subgraphs using predicted scores at full-scale. While every query connects to the webpages with high/low pseudo ranking scores, a webpage usually only connects to one or very limited highly-relevant queries and the number of webpages is much larger than that of effective queries from a webpages’ perspective. Thus, there needs to find irrelevant queries for every webpage. To build W-subgraphs for a webpage, LtrGCN leverages a negative sampling strategy. Given a webpage, LtrGCN retrieves all query-webpage pairs, builds a W-subgraph for every query-webpage with the ranking scores higher than 2 (fair). Specifically, LtrGCN randomly picks up a query that does not connect to the webpage as the irrelevant query, then forms the three (i.e., the webpage, a query where the webpage is highly ranked, and an irrelevant query) into a W-subgraph. Specifically, for a query \(q^{i}\), LtrGCN randomly chooses the webpage from the other query to conduct the negative samples and assigns the relevant score as 0 or 1 to represent poor relevance. Through this negative sampling method, LtrGCN could build W-subgraphs for a webpage.

2.5 GCN-Based LTR with Q-subgraphs and W-subgraphs

Given Q-subgraphs and W-subgraphs for every high-ranked query-webpage pair, in this step, LtrGCN learns the representations of query-webpage pairs with a GCN and enables learning to rank (LTR) in an end-to-end manner.

In the initial step, given the Q-subgraph and W-subgraph, LtrGCN extracts the feature vector of each query and webpage. Specifically, the feature of query \(q_{i}\) and webpage \(d^{i}_{j}\) is denoted as \(\boldsymbol{z}_{q^{i}}^{(n=0)}\) and \(\boldsymbol{z}_{d^{i}_{j}}^{(n=0)}\), where n indicates the feature output from the \(n^{th}\) GCN layer. Next, the GCN-based encoder utilizes the query-webpage interaction graph to propagate the representations as:

$$\begin{aligned} \begin{aligned} \boldsymbol{z}_{q^{i}}^{(n+1)}=\sum _{{d^{i}_{j}} \in \mathcal {N}_{q^{i}}} \frac{1}{\sqrt{\left| \mathcal {N}_{q^{i}}\right| } \sqrt{\left| \mathcal {N}_{d_{j,i}}\right| }} \boldsymbol{z}_{d^{i}_{j}}^{(n)}, \\ \boldsymbol{z}_{d^{i}_{j}}^{(n+1)}=\sum _{q^{i} \in \mathcal {N}_{d^{i}_{j}}} \frac{1}{\sqrt{\left| \mathcal {N}_{q^{i}}\right| } \sqrt{\left| \mathcal {N}_{d_{j,i}}\right| }} \boldsymbol{z}_{q^{i}}^{(n)}, \end{aligned} \end{aligned}$$
(4)

where \(\mathcal {N}_{q^{i}}\) and \(\mathcal {N}_{d_{j,i}}\) represent the set of webpages that are relevant to query \(q^{i}\) and the set of queries that are relevant to webpage \(d_{j}^{i}\), respectively. Moreover, \(\frac{1}{\sqrt{\left| \mathcal {N}_{q^{i}}\right| } \sqrt{\left| \mathcal {N}_{d_{j,i}}\right| }}\) is the normalization term used to prevent the scale of representations from increasing as a result of graph convolution operations. After N layers graph convolution operations, LtrGCN combines the representations generated from each layer to conduct the final representation of query \(q^{i}\) and webpage \(d^{i}_{j}\) as follows:

$$\begin{aligned} \begin{aligned} \boldsymbol{z}_{q^{i}}=\sum _{n=0}^{N} \beta _{n} \boldsymbol{z}_{q^{i}}^{(n)};~ \boldsymbol{z}_{d^{i}_{j}}=\sum _{n=0}^{N} \beta _{n} \boldsymbol{z}_{d^{i}_{j}}^{(n)}, \end{aligned} \end{aligned}$$
(5)

where \(\beta _{n} \in [0,1]\) is a hyper-parameter to balance the weight of each layer representation. Then, LtrGCN combines \(\boldsymbol{z}_{q^{i}}\) and \(\boldsymbol{z}_{d^{i}_{j}}\) to conduct the learned query-webpage pair representation as \(\boldsymbol{z}_{i,j}\).

Given the learned vector \(\boldsymbol{z}_{i,j}\), LtrGCN adopts a Multi-Layer Perception (MLP)-based model with a fully-connected layer to calculate the predicted score \(\boldsymbol{s}_{i,j}\). The whole process can be formulated as: \(\boldsymbol{s}_{i,j} = f_{\theta }(\boldsymbol{z}_{i,j})\), where \(\theta \) is the set of discriminative parameters. Against the ground truth, LtrGCN leverages the discriminative loss function, which is defined as:

$$\begin{aligned} \mathcal {L}_{Disc}= \frac{1}{|\mathcal {Q}|}\sum _{i=1}^{|\mathcal {Q}|}\left( \frac{1}{|D_i|}\sum _{j=1}^{|D_i|} \ell _{LTR}\left( \boldsymbol{y}^i_j, f_{\theta }(\boldsymbol{z}_{i,j})\right) \right) , \end{aligned}$$
(6)

where \(\ell _{LTR}\) represents the standard LTR loss function (i.e., pointwise, pairwise and listwise).

3 Deployment of LtrGCN

In this section, we introduce the deployment settings of LtrGCN at a large-scale industrial search engine. As illustrated in Fig. 3, we present the overall workflow of the real-world deployment and the three-stage design of the search engine as follows: (1) Webpage Collection, (2) Webpage Storage and Indexing, and (3) Retrieval and Ranking.

Webpage Collection. To efficiently navigate the vast expanse of webpages available on the internet, the search engine employs high-performance crawlers known as Web Crawlers. These crawlers play a vital role in collecting and downloading webpages. The Web Crawler operates by systematically scanning a comprehensive list of links, and actively searching for new webpages and updates to existing ones. It selects and stores valid links containing the desired content, creating a downloading list. Utilizing real-time web traffic data, the Web Crawler initiates the downloading process, ensuring timely retrieval of information.

Fig. 3.
figure 3

The overview of the large-scale search engine with LtrGCN deployed.

Webpage Storage and Indexing. The search engine stores downloaded webpages in distributed archival storage systems and create efficient indices for high-performance search. These storage systems utilize elastic resources across multiple regional data centers, reducing storage costs. The indexing system balances indexing workloads and achieves superb I/O efficiency through novel key-value operations and in-memory computation. This combination allows search engines to effectively manage large volumes of web content and provide fast and accurate search results.

Retrieval and Ranking. Given a search query, the search engine first retrieves all relevant webpages from the dataset and sorts top-K relevant webpages. Specifically, the search engine adopts a pre-trained language model based semantic retrieval algorithms to enhance the conventional retrieval approach. Then, the search engine pairs each webpage with the query to conduct a query-webpage pair and uses LtrGCN to accomplish ranking tasks. To ensure LtrGCN can satisfy the rapid shift of internet interest, the search engine periodically picks up new queries and relevant webpages, hires people to annotate scores and re-trains LtrGCN with labeled data.

4 Experiments

In this section, we first detail experimental settings. Then, we introduce the results of the offline experiments. Finally, extensive online experiments further demonstrate the effectiveness of LtrGCN at a real-world search engine.

4.1 Experimental Settings

Datasets. To demonstrate the effectiveness of our proposed model, we present extensive experiments on a common-used public dataset, MSLR-Web30K [26] and a real-world dataset collected from a large-scale commercial web search engine. MSLR-Web30K contains about 30,000 queries and 3,771,125 query-webpage pairs. Each query-webpage pair is represented as a 136-dimensional real-valued feature vector associated with a relevance label with a scale from 0 (irrelevant) to 4 (perfectly relevant). In our experiments, we perform the five-fold cross-validation [26] and report the average results across five folds.

Collected Dataset contains 15,000 queries and over 770,000 query-webpage pairs from a large-scale industrial search engine. In our dataset, each query-webpage pair is also represented as a 120-dimensional real-valued feature vector associated with a relevant score. We randomly split the real-world dataset into a training set (9,000 queries), a validation set (3,000 queries), and a test set (3,000 queries). All features are standardized before being fed into the ranking models in our experiments.

Evaluation Metric. To evaluate the performance of LtrGCN, we utilize Normalized Discounted Cumulative Gain (NDCG) [15], which is widely adopted in LTR tasks. The NDCG score for the query could be computed as follows:

$$\begin{aligned} \textrm{NDCG}_{N}=\frac{1}{Z} \sum _{i=1}^{N} \frac{2^{y_{i}}-1}{\log _{2}(1+i)}, \end{aligned}$$
(7)

where Z is a normalization factor that is the ideal order of Discounted Cumulative Gain [14], and \(y_{i}\) is the ranking score of the \(i^{th}\) webpage. Additionally, the value of NDCG ranges between [0, 1], and a higher NDCG\(_{N}\) indicates a better LTR model. In our experiments, we consider the NDCG of the top 5 and 10 results (i.e., NDCG\(_{5}\) and NDCG\(_{10}\)) for research and business purposes.

Interleaving [9] is a widely used metric for evaluating the performance of an industrial search engine. In interleaved comparison, two results generated from different systems are delivered to users whose click-through actions would be attributed to the system that delivers the corresponding results.

Good vs. Same vs. Bad (\(\textrm{GSB}\)) [41] is an online pairwise metric evaluated by professional annotators. In manual comparison, two results produced by the new system and the legacy system are provided to human experts that are required to judge which result is better.

Loss Functions and Competitor Systems. To evaluate the effectiveness of our proposed model comprehensively, we adopt different state-of-the-art ranking loss functions as follows: Root Mean Square Error (RMSE) is a widely used pointwise loss. RankNet [6] and LambdaRank [5] are two popular pairwise losses for neural LTR tasks both in research and industry. More particular, LambdaRank multiplies actual gradients with the change in NDCG by swapping the rank positions of the two candidates. ListNet [7] and ListMLE [36] are two listwise losses, which calculate the probability of the ideal permutation based on the ground truth. ApproxNDCG [27] and NeuralNDCG [25] are two listwise loss functions that directly optimize the metric.

For offline experiments, we compare LtrGCN with the state-of-the-art ranking models to conduct comprehensive comparisons as follows: MLP is a commonly used ranking model. Context-Aware Ranker (CAR) [24] is a Transformer [31]-based ranking model. XGBoost [8] and LightGBM [17] are two tree-based ranking models with pairwise and listwise loss, respectively.

Considering the high expense of deploying ranking models and the prior experience, we only compare our proposed model with the aforementioned models without including more previous ranking models [2, 3, 16]. For online experiments, we only report the improvement between LtrGCN and the legacy system.

Table 1. Performance on MSLR-Web30k under various ratios of labeled data.
Table 2. Performance on Collected Dataset under various ratios of labeled data.

4.2 Offline Experimental Results

Comparative Results. Tables 1 and 2 illustrate offline experimental results of LtrGCN compared with baselines on MSLR-Web30K and Collected Dataset on NDCG\(_{5}\) and NDCG\(_{10}\). We use the name of each loss to present MLP with the loss and “+” to represent “ LtrGCN +”. Intuitively, we could observe that LtrGCN outperforms all baselines on two datasets. Specifically, LtrGCN+ApproxNDCG obtains the best performance on NDCG\(_{10}\) with 5% labeled data on MSLR-Web30K, which gains 1.88% improvements compared with the base model with ApproxNDCG. LtrGCN with NeuralNDCG achieves the best performance on MSLR-Web30K with the other settings. Moreover, LtrGCN with NeuralNDCG obtains the best performance against all competitors. Specifically, LtrGCN+ NeuralNDCG achieves the improvement with 1.11%, 1.17%, 1.38% and 1.43% that MLP with NeuralNDCG on NDCG\(_{5}\) on Collected Dataset. The performance of our model improves consistently with the label ratio increasing. We also compared LtrGCN with LightGCN [12], which cannot be trained at all with “Out of Memory” flagged, due to the sparsity issue.

Table 3. Ablation studies of LtrGCN+NeuralNDCG on NDCG\(_5\) under various ratios of labeled data on two datasets.
Table 4. Ablation studies of LtrGCN+ApproxNDCG on NDCG\(_{10}\) under various ratios of labeled data on two datasets.

Ablation Studies. In this study, we conduct a series of ablation studies to investigate the effectiveness of the three steps of LtrGCN. Specifically, LtrGCN w/o Q -subgraph Generation via Self-tuned Labeling (QGSL) is the model that replaces QGSL with a pointwise-based self-trained LightGBM to pseudo data. As for LtrGCN w/o W -subgraph Generation via Self-tuned Labeling, it fails to train the model due to the sparsity issue. LtrGCN w/o GCN-based LTR with Q -subgraphs and W -subgraphs (GLQW) is the proposed model that directly utilizes the MLP-based LTR model on the combined data. LtrGCN w/o MLP-based LTR Model (MLM) is the proposed model that utilizes an MLP model with two layers following GLQW.

As shown in Tables 3 and 4, we sample the ablation study results of LtrGCN with NeuralNDCG on NDCG\(_{5}\) and LtrGCN with ApproxNDCG on NDCG\(_{10}\) under four ratios of labeled data. Intuitively, we could observe that the three steps contribute to positive improvements for LtrGCN under all settings. Specifically, GLQW gains the improvement with 1.04%, 0.97%, 1.02% and 0.95% on NDCG\(_5\) for LtrGCN+NeuralNDCG on MSLR-Web30K. Similarly, QGSL improves the performance of LtrGCN with ApproxNDCG on NDCG\(_{10}\) with 1.36%, 1.37%, 1.41% and 1.43% on Collected Dataset. All results of ablation studies demonstrate the effectiveness of the three steps for LtrGCN.

4.3 Online Experimental Results

Interleaving and Manual Evaluation. Table 5 illustrates performance improvements on \(\varDelta _{A B}\) and \(\varDelta \textrm{GSB}\). We first find that LtrGCN trained under 20% labeled data achieves substantial improvements for the online system on two metrics, which demonstrates the practicability and effectiveness of our proposed model. Specifically, the proposed model achieves the most significant improvement with 0.26% and 3.00% on \(\varDelta _{A B}\) and \(\varDelta \textrm{GSB}\) for random queries, respectively. Also, we observe that the proposed model outperforms the legacy system for long-tail queries whose search frequencies are lower than 10 per week. Particularly, the largest advantages of \(\varDelta _{A B}\) and \(\varDelta \textrm{GSB}\) are 0.41% and 6.50%.

Online A/B Test. To further verify the effectiveness of LtrGCN, we conduct a series of online A/B test with real-world web traffic and compare it with the legacy system at a large-scale search engine. According to offline experimental results, we deploy the trained LtrGCN under four ratios of labeled data with 5% real-world web traffic, which contains millions of queries per day. The online A/B tests last for 7 days. Due to the page limit, we only report the performance of trained models under 15% and 20% labeled data. Figure 4 illustrates the comparison of LtrGCN with the legacy system on \(\varDelta \)NCDG\(_{5}\). LtrGCN could boost the performance compared with the online legacy system all day, which demonstrates that LtrGCN is practical for improving the performance of the large-scale search engine. Moreover, we could observe that the trained LtrGCN with NeuralNDCG under 15% and 20% labeled data achieves the most significant improvement with 0.64% and 0.60%. The improvement reveals the effectiveness of LtrGCN. Eventually, it could be observed that LtrGCN performs stably on all days. Online performance is consistent with offline experiment results.

Table 5. Performance improvements of online evaluation.
Fig. 4.
figure 4

Online comparative performance (\(\varDelta \)NCDG\(_{5}\)) of LtrGCN for 7 days (t-test with \(p <0.05\) over the baseline).

5 Related Work

Learning-to-rank (LTR) techniques generally pertain to machine learning methods that are utilized to solve ranking problems, which are crucial in various applications, such as search engine and recommendation system. Based on the loss function, LTR models could be divided into three types: pointwise [20], pairwise [5, 16] and listwise [7, 25, 27, 36]. The pointwise loss formulates the LTR problem into a regression task. The pairwise loss converts two documents into a document pair to treat LTR tasks as binary classification problems. The listwise loss treats the whole document list as a sample and directly optimizes the evaluation metrics [28], such as NDCG [15]. Recently, deep models have been widely employed in LTR tasks, achieved by minimizing various ranking loss functions in an end-to-end manner [4, 22, 32, 39]. However, deep techniques have led to the study of learning to rank using implicit user feedback, but biases cause unsatisfied performance, so unbiased learning to rank has been proposed to mitigate these biases [30, 34, 38]. In this work, we focus on solving practical LTR problems in the industrial scenario.

In recent years, the modeling graph structure is highlighted by the developed Graph Convolutional Networks (GCN). Existing GCN methods could be categorized into two families [35, 40]: spectral GCN and spatial GCN. Spectral GCN leverages graph spectral representations to define graph convolutions, such as SGCN [33], JK-Net [37] and MixHop [1]. Spatial GCN models suggest mini-batch graph training on spatially connected neighbours [42]. Many works have studied the problem of node representation and re-defined graph convolution in the spatial domain, such as GraphSage [11] and ASGCN [13]. It is important to note that several recent attempts offer comprehensive insights on GNNs [19, 21]. Moreover, some outstanding works pay more attention to avoiding the unnecessary complexity of GCN, such as SGCN [33] and LightGCN [12]. In this work, we leverage a GCN-based encoder to learn the representations of query-webpage pairs for the downstream LTR task.

6 Conclusion

In this work, we design, implement and deploy a GCN-based LTR model LtrGCN at a large-scale industrial search engine to address the problem of extreme sparsity of links in query-webpage bipartite graphs and imbalance between queries and webpages for web-scale search. Specifically, LtrGCN utilizes two advanced sampling strategies to generate the Q-subgraphs and W-subgraphs from all query-webpage pairs in the first two steps. Then, LtrGCN leverages GCNs to extract feature vectors from Q-subgraphs and W-subgraphs for LTR as the representation of the query-webpage pair or ranking score prediction. The feature extraction and ranking scores prediction are optimized in an end-to-end manner, so as to enable discriminative feature extraction while preserving structural information in the bipartite graph. To demonstrate the effectiveness of LtrGCN, we conduct extensive offline and online experiments compared with a large number of baseline methods. Offline experiment results show that LtrGCN could achieve significant performance compared with other competitors. Furthermore, LtrGCN significantly boosts the online ranking performance at the industrial search engine, which is consistent with offline results.