Keywords

1 Introduction

In the era of information explosion, it is more and more critical to quickly and accurately find query-related information from a large number of documents. Representation learning based retrieval has impressively improved the retrieval accuracy and reformed this critical field researched for decades [25]. Based on the deep matching models [4] and the state-of-the-art pre-trained frameworks, semantic retrieval has thrived as a typical application of representation learning.

An extensive collection of works, especially the deep Siamese models [4, 6], have been proposed to tackle the semantic retrieval task [11]. DSSM [8], CLSM [23], ARC-I [5] explore adopting the traditional neural networks, while Sentence-BERT [18], ColBERT [10], TwinBERT [13] take a further step to employ the pre-trained language models like BERT [2]. All these works highlight the charm of deep Siamese structures. Especially, pre-trained BERT can effectively capture the contextual semantic meanings in the query or document with the self-attention [9], which significantly enhances the accuracy of the Siamese semantic retrieval model.

However, it is hard for deep Siamese models to directly infer the keywords in the query since there is no interaction between the query and the document. Keywords have been proven to play a unique and important role in information retrieval applications [19]. To realize the pre-computation for massive documents, the deep Siamese structure has the independent query encoder and document encoder, which have no interaction until the last layer computing the similarity. But unfortunately, the query encoder itself can have difficulties in adequately weighting different words in the query without any context information about documents. And the document encoder has a similar problem without any query information. Therefore, semantic representations without keyword identification will directly impact semantic similarity computing and thus affect the overall matching process.

In many previous studies, the global statics of context information is used to improve the query representations by introducing some pre-computed prior knowledge, like BM25 [19] or TF-IDF [20] scores. However, such statics can not take the contextual semantics into consideration to reflect the word weights precisely. A word is significant in one sentence, but may be not in another. Apparently, if we use the pre-computed statics as the prior knowledge, it can lead to a sub-optimal and even poor decision.

To remedy the limitations, we introduce a multi-task learning based keywords weighted Siamese model (MKSM) for semantic retrieval in this work. We propose a novel keywords identification model joined with the Siamese retrieval model to explicitly model the weights of the adaptable keywords and get better representations for the retrieval. Specifically, we model the keyword identification as a regression learning problem to consider contextual semantics instead of rule-based statistics. Furthermore, The keyword identification model shares the same neural network model with the Siamese model but has different training loss functions. Therefore, we train both the keyword identification model and the deep Siamese model jointly in the style of multi-task learning to improve each other’s performance. The multi-task learning enables our solution to learn better keyword weights from retrieval signals and the regression target. Therefore, we can get a better representation containing the semantic meaning of keyword weights to conduct the matching process better.

To verify its effectiveness, we evaluate our proposed MKSM in the online production environment and on famous and public benchmark datasets. Specifically, MKSM has been deployed for the online service search scenario of a popular social application frequently used by over 100 million users. The online A/B test in real production shows that MKSM concretely improves the user experiences for the service search in terms of click-through rate (CTR) and retrieval rate (RR) for real production. Furthermore, the empirical results on public searching benchmarks have also demonstrated considerable improvements over baselines.

To summarize, our contributions are three-fold:

  • We introduce an adaptable keywords identification model to learn better representations for queries and documents.

  • We propose a novel semantic retrieval framework MKSM which joins the keywords identification method to a Siamese model for semantic retrieval in the form of multi-task learning.

  • Extensive experiments on online A/B tests and two offline public benchmarks verify the effectiveness of our proposed model.

2 Related Works

A variety of deep matching models have been proposed for the information retrieval problems [15]. Siamese models applied to semantic retrieval started from the Deep Structured Semantic Model (DSSM) [8], which mapped both query and document to the same semantic space, and achieved the purpose of retrieval by maximizing the cosine similarity. Then, ARC-I [5] and CLSM [23] used Convolutional neural networks (CNNs) and max pooling to replace the fully connected networks to extract features, which could capture more contextual information for semantic vector representation. Further, LSTM-DSSM [17] proposed to use Long Short-term Memory (LSTM) networks to replace CNNs to obtain contextual information over longer sequences accurately. Sentence-BERT [18], a modification of the pre-trained BERT network that used Siamese and triplet network structures to derive semantically meaningful sentence embeddings that could be compared using cosine-similarity, which reduced the effort of finding the most similar pair from 65 h with BERT/RoBERTa to about 5 s, while maintaining the accuracy from BERT. Recently, TwinBERT [13] used twin-structured BERT-like encoders to encode the query and document, respectively, and a crossing layer to combine the two embeddings to produce a similarity score. Additionally, ColBERT [10] introduced a late interaction architecture that independently encoded the query and the document using BERT and then employed a cheap yet powerful interaction step that modeled their fine-grained similarity. Furthermore, [14] proposed a simple neural model that combined the efficiency of dual encoders with some of the expressiveness of more costly attentional architectures and explored sparse dense hybrids to capitalize on the precision of sparse retrieval. Accurately representing the text and its contextual information has always been a hot research direction, which is our concern in MKSM. Neither the models mentioned above nor the [7] (Facebook), MOBIUS [3] (Baidu), etc. that have been applied in the actual business has learned global contextual word weights.

3 Methods

In this section, we first provide the problem definition of the semantic retrieval task. Then we propose a comprehensive overview of our framework and further introduce the implementations of each component in the MKSM framework. Finally, we describe the multi-task training process of our structure.

Fig. 1.
figure 1

The hierarchical framework of MKSM. The whole framework can be divided into two parts. the left part, Part A, illustrates the semantic retrieval process, while the right part, part B, represents the keywords identification task.

3.1 Problem Definition

The semantic retrieval task can be described as a matching problem M that gives a matching score for each query q and document d pair. Here, we use a single symbol d to stand the entire document which usually contains not only one field (e.g., name, description, etc.). Before calculating the matching score \(m_s\), every string needs to be embedded as a semantic vector by some embedding methods E, like the BERT language model. In our framework, in addition to simply embedding the query and document, the keywords identification model can be regarded as an independent function K. Hence, the keywords weighted Siamese model for semantic retrieval can be represented as Eq. (1).

$$\begin{aligned} m_s = match(q, d) = M(K(E_q(q)), K(E_d(d))). \end{aligned}$$
(1)

3.2 Framework Overview

From a horizontal view, MKSM comprises three parts, BERT semantic representation, Keyword weight correction, and Matching score calculation, as shown in Fig. 1. The framework, divided into semantic retrieval and keywords identification parts, starts with the query, document, and text represented by a shared BERT-pertained language model to get the corresponding embedding (Emb). A shared fully connected (FC) layer and softmax are appended to learn the word weights (Weight) in the keywords identification module. Moreover, the Weight is utilized for weighing the embeddings of the query and various fields in the document to get better feature vectors (vec). Then, the contrastive loss (with L\(_2\) normalization) is performed to measure the relevance of the query and document representations in the semantic retrieval part. The mean squared error (MSE) loss is applied as the objective of the keywords identification module. The total loss is the sum of the contrastive loss and MSE loss weighted by two hyperparameters, \(\alpha \), and \(\beta \), respectively.

3.3 The Keywords Identification Model

Unlike other statistical methods or keyword detection methods, we model the keywords identification task as a regression task that fits word importance sequence \(s_k\) from input sequence s.

For the offline public benchmark, we take all positive documents as one click and negative documents as no click to estimate the clicking rate. As shown in Fig. 2, we consider the clicking rates \(a_i\) of documents and all related second-order queries \(s'\) to generate the keyword weights of the first-order query s. Specifically, we generate labels as follows:

  1. 1.

    We collect a large amount of high-relevance retrieval logs containing the query s and remove all stop words in the logs.

  2. 2.

    For the first-order query s, we dig out all clicked documents d and their corresponding click rates a.

  3. 3.

    For any clicked document \(d_j\), we find all second-order queries that retrieve it. And we think the queries retrieving the same document have similar semantic meanings.

  4. 4.

    Then we count the word frequency \(f_{w_i}^{d_j}\) in the second-order queries of the document \(d_j\) and then normalized all word frequencies as \(f_{w_i}^{d_j}=f_{w_i}^{d_j}/\sum _{i} f_{w_i}^{d_j}\).

  5. 5.

    Finally, we combine all normalized word frequencies of all documents by weight averaging to generate the keyword weight as \(f_{w_i}=\sum _j a_j f_{w_i}^{d_j}/\sum _{i} f_{w_i}^{d_j}\)

Similarly, we generate the document keyword label by assuming that the clicked documents for the same query have similar semantic meanings. Specifically, we dig out all queries that retrieve the document and all other documents related to those queries. We count word frequencies in all related documents and then normalize the frequency according to click rates and the sum of frequencies.

Fig. 2.
figure 2

Keyword weight label mining from second-order queries in history logs

As presented in the right part of Fig. 1, the keywords identification model includes three components, (1) representation component, (2) weight layer (\(L_k\)), and (3) loss calculation. We use the embedding of the “[CLS]” token in the BERT sentence embeddings as the initial representation. Then, a fully connected layer is supplemented with a hidden size equal to the padding size used to learn the word weights. Finally, the MSE loss function is computed for optimizing the parameters in the weight layer by Eq. (2).

$$\begin{aligned} loss_{keywords} = \frac{1}{P} \sum _{i=1}^{P}\left( l_{i} - L_k(BERT(s)_{[CLS]})_i\right) ^{2}, \end{aligned}$$
(2)

where l stands for the observed values of s.

3.4 The Siamese Retrieval Model

Like most deep matching models [3, 7, 13, 18, 24], the retrieval model also employs a twin-structured Siamese framework as shown in the left part of Fig. 1. The structure is a two-part design formed by the representation part and the matching part. In the representation part, there are three layers of representation, (1) initial representation, (2) weighted representation (\(L_k^{\prime }\)), and (3) final representation (\(L_f\)). The initial representation is the average BERT embeddings of all the tokens. The weighted representation (\(L_k^{\prime }\)) is the initial representation associated with the weight layer (\(L_k\)) in the keywords identification model. The final representation is used as input of the matching part to calculate the matching scores minimized by a loss function, introduced in the rest of this section. We optimize the model to acquire a better matching score by minimizing the contrastive loss (\(loss_{matching}\)) as presented in Eq. (3).

$$\begin{aligned} loss_{matching} = \frac{1}{2N} \sum _{i = 1}^{N} y (\text {D}(r_q, r_d)_i)^{2} + (1 - y) \max (m - \text {D}(r_q, r_d)_i, 0)^{2}, \end{aligned}$$
(3)

where y is the relevance label with equals to 1 (relevant) or 0 (irrelevant), \(\text {D}\) represents the Euclidean distance, which can be expressed by Eq. (4), and m is a margin threshold.

$$\begin{aligned} \text {D}(r_q, r_d) = \left\| r_q - r_d\right\| _{2} = \left( \sum _{i = 1}^{P}\left( r_q^{i} - r_d^{i}\right) ^{2}\right) ^{\frac{1}{2}} \end{aligned}$$
(4)

3.5 The Multi-task Learning Strategy

As stated in Sects. 1, 3.3 and 3.4, to make the keywords identification model can learn adaptive keywords weights, we propose to train the keywords identification model and the Siamese retrieval model together. The combined loss function can be represented as Eq. (5).

$$\begin{aligned} loss = \alpha \times loss_{matching} + \beta \times loss_{keywords}. \end{aligned}$$
(5)

where \(\alpha \) and \(\beta \) are two hyper-parameters.

The training repeats the following back-propagation processes until the Siamese model can learn the representations of queries and documents well.

  1. 1.

    Firstly, back-propagating on the keyword identification model.

  2. 2.

    Secondly, back-propagating on the Siamese retrieval model with fixed parameters of shared weight layer.

4 Experiments

In this section, we first introduce the details of the experiment settings. Then we discuss the experiment results, including offline performance, online evaluation, and ablation study. Case study results and discussion on our work are in the supplementary materials.

Table 1. Dataset Statistics

4.1 Experiments Setup

Evaluation Datasets. We validate the performance of MKSM on two datasets, where one is a public benchmark and the other is private. MS MARCO [16] is a famous benchmark from Microsoft, which is sampled from Bing’s search query logs. To construct our Private dataset, we extract the daily service search logs from a popular instant messaging application and manually label the relevance. And we implement a label noise detection method based on confident learning [12] to purify this dataset. MS MARCO is an English dataset while Private is a Chinese benchmark. Table 1 summarizes the detailed statistics of such two datasets from three aspects.

Evaluation Metrics. Because our approach focuses on the matching stage in semantic retrieval, we choose Normalized Cumulative Gain (NCG) [22] as the evaluation metric. It is the best empirical metric for query-document matching, because it reflects the number of relevant documents returned without casing the specific ranking. NCG is computed as

$$\begin{aligned} \text {NCG} = \frac{\text {CG}}{i\text {CG}}, \end{aligned}$$
(6)

where \(\text {CG}\)(Cumulative Gain) is the sum of all the relevance scores in the recall set, and \(i\text {CG}\) is the ideal \(\text {CG}\), which is the sum of relevance scores of the ideal document recall set. Specifically, \(\text {CG}\) is defined as

$$\begin{aligned} \text {CG} = \sum _{i = 1}^{T}{relevance\_score_i}, \end{aligned}$$
(7)

Baselines. We compare our proposed MKSM framework with 6 representative retrieval baseline modelsFootnote 1. Such methods can be categorized into different classes as follows (the detailed discussion of these methods is presented in Sect. 2),

  • Classical retrieval methods: TF-IDF [20] and BM25 [19].

  • Deep Siamese models: CLSM [23] and USE [1].

  • Pre-trained language model: BERT [2].

  • Keywords weighted model: BERT+TF-IDF [21].

Implementation Details All the implementations mentioned in this paper are based on TensorFlow. We train MKSM with 4 NVIDIA Tesla V100 GPUs paralleled. The semantic vector representations for queries and documents are based on BERT pre-trained language model with padding size 128 or 1024 in the two datasets. The \(\alpha \) and \(\beta \) are set as 0.6 and 0.4, respectively. AdamW, as an improved Adam optimizer, is used in the training processes in the MKSM framework. To make the inference phase more efficient, the embedding of documents is offline performed ahead. The cosine similarity of the query and document representations is utilized as the matching score.

4.2 Overall Performance

Table 2 illustrates the comparison results of NCG@T, where underlined numbers are the best results of baselines and bold numbers are the best results of all models. The difference between MKSM and MKSM\(_{\text {SEP}}\) is whether training the keywords identification model and the Siamese retrieval model separately.

From the results of Table 2, we can conclude the following observations,

  • Our proposed MKSM framework obtains the best performance over other retrieval models in both benchmarks. Specifically, in MS MARCO and Private datasets, MKSM receives at least 0.9\(\%\) and 0.7\(\%\) promotion in terms of NCG, respectively. These results indicate the superiority of our proposed MKSM framework over baselines in both English and Chinese benchmarks, with different lengths of documents.

    Table 2. Overall performance on MS MARCO and Private datasets
  • BERT+TF-IDF performs better than BERT in most empirical metrics, which indicates that the leverage of prior knowledge in query representations significantly improves the retrieval performance. Besides, our proposed MKSM and MKSM\(_{\text {SEP}}\) both perform better than BERT+TF-IDF. It demonstrates that keyword identification performs better than the traditional statistical information TF-IDF as the prior knowledge, no matter whether in separating training or multi-task training. It is because that our proposed keywords identification model can provide the keywords weight information, which is essential in the retrieval task.

  • MKSM performs better than MKSM\(_{\text {SEP}}\) in terms of all metrics, which indicates that the training strategy of MKSM can influence the performance, and multi-task learning can introduce the prior knowledge to the Siamese model effectively.

Fig. 3.
figure 3

Online evaluations.

4.3 Online Evaluation

We conduct A/B testing in the service retrieval scenario, comparing the proposed model MKSM with the current baseline, a distilled BERT.

The whole online experiment lasts 15 days. We monitor the results of A/A testing for the first five days, conduct A/B testing for the following five days, and conduct A/A testing again in the last five days. 15% of the users are randomly selected as the experimental group, and another 15% of the users are in the control group. During A/A testing, all the users are served by the BERT. During A/B testing, users in the control group are presented with retrieval results by the BERT, while users in the experimental group are presented with the MKSM semantic retrieval results. Note that the click experiment of MKSM shares the same exposure with the distilled BERT to verify whether the improvement is caused by the new semantic retrieval design.

Figures 3(a) and 3(b) show the improvement of the experimental group over the control group with respect to click-through rate (CTR) and retrieval rate (RR), which are defined as Eq. (8). We can see that the system is relatively stable in terms of CTR and RR during the A/A testing. As for the A/B testing, which starts from day 6, a significant improvement over the baseline BERT can be clearly observed. Specifically, the improvements concerning CTR and RR received by MKSM are at least 2\(\%\) and 6\(\%\), respectively. In the final five days, we conduct A/A testing again, which replaces the MKSM framework with the distilled BERT. The improvement obtained by MKSM decays rapidly, which further proves the effectiveness of A/B testing.

$$\begin{aligned} \text {CTR} = \frac{\#click}{\#exposure}, \text {RR} = \frac{\varDelta exposure}{\#exposure}, \end{aligned}$$
(8)

where \(\#\) means the number of click and exposure, and \(\varDelta exposure\) means the increment of good results in exposure.

4.4 Ablation Study

In this subsection, to study the effectiveness of each component and certify that MKSM is the best combination, we conduct several models which are different from MKSM in terms of each component, such as MKSM\(_{\text {SEP}}\), BERT, and MKSM\(_{\text {[CLS]}}\). Specifically, MKSM\(_{\text {SEP}}\) trains the keywords identification model and the Siamese model separately. BERT is the pure pre-trained language model without any prior knowledge. MKSM\(_{\text {[CLS]}}\) uses the embedding of the [CLS] token in BERT as the initial representation of queries and documents. The performance comparison in Private benchmark is presented in Table 3.

From the results, we can confirm that:

  • Compared with MKSM\(_{\text {SEP}}\), MKSM shows a better performance, which indicates the multi-task learning manner can make the Siamese model and the keywords identification model interact more effectively.

  • To demonstrate the superiority of the keywords identification model, we compare MKSM with BERT, a pure pre-trained language model for retrieval tasks without prior knowledge. The results reflect the effectiveness of our proposed keywords identification model. Besides, in contrast to fixed statistical information(referred to as “BERT+TF-IDF” in Table 2), our proposed MKSM with keywords identification model shows a better retrieval performance, which further indicates the superiority of our proposed MKSM framework.

  • MKSM achieves better performance than MKSM\(_{\text {[CLS]}}\). It means that using the average of all token embeddings as the initial representation of queries and documents in MKSM is slightly better than using only [CLS] token embedding. The reason is, compared with [CLS] token embedding, the average embedding of the BERT Encoder can provide much more useful information for keyword identification and retrieval.

Table 3. Ablation Study of MKSM

5 Conclusion

In this paper, we propose a novel semantic retrieval model MKSM, which utilizes a keywords identification model and multi-task learning strategy to introduce practical prior knowledge in a Siamese model. MKSM can automatically learn the keywords in queries and documents by integrating the keyword weight layer and providing better final representations for calculating matching scores. We conduct extensive experiments and rigorous analysis in online A/B tests and offline public benchmarks to demonstrate that MKSM outperforms other modern deep matching models on semantic retrieval.