Keywords

1 Introduction

Pre-trained language models (PTMs) such as BERT [7], XLNet [26], and RoBERTa [15] have achieved promising results on various natural language processing tasks [17, 22, 28]. However, the domain-specific knowledge required for certain tasks is not sufficiently learned through pre-training on open-domain corpora. Incorporating external knowledge, such as knowledge graphs (KGs), can enhance PTMs’ performance on downstream tasks. Researchers have primarily focused on two approaches for integrating KGs into PTMs: embedding-based and task-based.

The embedding-based approach, such as ERNIE-Tsinghua [29], KEPLER [23] and KELM [1], employs entity embeddings or natural language descriptions of KGs for pre-training, while the task-based approach, such as LIBERT [10] and SentiLR [9], incorporates pre-training tasks to acquire factual knowledge. However, current task-based models only consider entity information at the word level, disregarding lexical-semantic level and relationship information. This limitation hampers their ability to capture comprehensive knowledge within KGs.

We introduce K-DLM, a domain-adaptive language model pre-training framework with a KG that combines embedding-based and task-based approaches. K-DLM utilizes the masked language model (MLM) from BERT, employing entity and phrase level masking to pre-train on Chinese corpora [19, 20]. By integrating common sense knowledge base with the domain KG, K-DLM enhances both universal and specific knowledge. It employs soft-labeling to predict entity categories and sememes of phrases, while learning relationship information through supervised contrastive learning. Additionally, we propose a novel entity replacement strategy to create positive and negative samples for relationship learning. Our experiments demonstrate the superior performance of K-DLM over previous models, particularly on domain-specific tasks, making it an effective pre-training framework for knowledge-driven problems involving domain KGs.

In this paper, K-DLM’s performance was evaluated on six tasks across 17 Chinese datasets in open and specific domains. Results show that K-DLM performs well on open-domain tasks, especially those involving sememes. Additionally, category information of entities is crucial for NER tasks. K-DLM also performs better on domain-specific tasks due to its ability to utilize relationship knowledge. Overall, the main contributions of K-DLM are as follows:

  • We propose a domain-adaptive language model pre-training framework with a KG (K-DLM).

  • K-DLM can fully capture word and lexical-semantic level entity information as well as relationship information in the KG.

  • By incorporating external knowledge, K-DLM significantly outperforms previous models not only on all domain-specific tasks but also on most open-domain NLP tasks.

2 Related Works

2.1 Embedding-Based Approaches

KG embedding, as represented by TransE [2], models relationships by operating on low-dimensional embeddings of entities in KGs. ERNIE-Tsinghua [29] introduces the KG into pre-trained language models by combining the language and knowledge embeddings obtained by TransE. However, this approach presents a Heterogeneous Embedding Space (HES) problem where the language and knowledge embeddings are not obtained simultaneously. To address this issue more effectively, K-BERT [13] and CoLAKE [18] explicitly include knowledge triples in the training corpus to pre-train the language model and learn the knowledge representation concurrently. KEPLER [23] learns entity representation directly from entity description text and combines it with relationship embedding obtained by TransE. Furthermore, KELM [1] converts knowledge triples into fluent and natural sentences and adds them to the corpus of the pre-training model. By transforming heterogeneous KGs into text, the vector-space of knowledge representation becomes more consistent with that of language representation.

Fig. 1.
figure 1

Overview of the proposed Domain-Adaptive Language Model Pre-Training Framework with KG (K-DLM)

2.2 Task-Based Approaches

Since the release of BERT [7], various pre-training tasks have been proposed for different purposes, including learning external knowledge. ERNIE-Baidu [19] improved BERT’s masking strategy to incorporate entity information from KGs. To overcome the limitations of predicting only single words when masked, ERNIE-Baidu masks all tokens that compose a complete phrase or entity simultaneously. SentiLR [9] extends MLM to Label-Aware MLM by adding emotional polarity to each word, while SenseBERT [11] predicts masked words and their super senses in WordNet [16] simultaneously to integrate semantic KGs into pre-trained language models. WKLM [24] replaces entities in the sentence with the same type of entities in Wikipedia and trains the model to recognize these replacements.

3 Method

In this section, we introduce K-DLM, a framework consisting of three steps: pre-processing and two-stage pre-training. The overall architecture of K-DLM is illustrated in Fig. 1.

3.1 Knowledge Graph Fusion

Before two-stage pre-training, we merge the domain-specific KG with HowNet, guided by two fundamental principles:

  • We classify entities into their corresponding sememes in HowNet when the categories of entities in the KG align with those in HowNet.

  • We retain the categories in the domain-specific KG when the categories of entities in the KG do not align with those in HowNet.

We refer to the categories of entities and sememes of phrases in the fused KG collectively as "sememes".

Fig. 2.
figure 2

Construction of mapping from characters to sememes. The set of possible sememes of each character is the union of possible sememes of all words composed of it

3.2 Masked Language Model and Chinese Word Sememe Prediction

During the first pre-training stage, K-DLM undergoes two tasks: masked language modeling and Chinese word sememe prediction. For masked language modeling, entity and phrase level masking strategies are employed. In the Chinese word sememe prediction task, K-DLM is trained to predict sememes in the fused KG. To accommodate KG integration, we modify the embedding layer and pre-training objective while utilizing the Transformer Encoder [21].

Embedding Layer. The embedding layer combines multiple embeddings to generate the input representation. We modify the input embedding \(E_{word}\) by summing four embeddings:

$$\begin{aligned} E_{word} = E_{tok} + E_{sem} + E_{seg} + E_{pos} \end{aligned}$$
(1)

where \(E_{sem}\) and \(E_{pos}\) follow the original BERT. For \(E_{tok}\), we utilize entity and phrase level masking strategies, masking entire words instead of individual Chinese characters. To capture the linguistic characteristics of Modern Chinese, we introduce a new split-and-merge mapping strategy. Let \(X=(x_0, x_1, \cdots , x_n)\) denote the vocabulary index of a sentence \((c_0, c_1, \cdots , c_n)\), where n is the sentence length and \(x_i \in \mathbb {R}^{D_W}\). \(E_{sem}\) is computed using a two-layer mapping:

$$\begin{aligned} E_{sem} = SMX \end{aligned}$$
(2)

where \(M \in \mathbb {R}^{D_S\times D_W}\) is a static mapping from characters to the union of corresponding sememes constructed with the fused KG, and \(S \in \mathbb {R}^{d\times D_S}\) is a learnable mapping from sememes to the internal Transformer dimension d, where \(D_S\) is the size of the sememe vocabulary. Figure 2 illustrates an example of the construction process for M.

Fig. 3.
figure 3

Chinese word sememe prediction

Pre-Training Objective. We enhanced the original Masked Language Model (MLM) used in BERT by introducing entity and phrase level masking. Specifically, we mask all Chinese characters that belong to a complete word and require the model to recover the entire word during MLM pre-training. Besides, in our paper, we focus on predicting the allowed sememes of Chinese words, not characters. Thus, we propose \(\mathcal {L}_{CSP}\) for Chinese word sememe prediction:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{CSP}=&-\log \sum _{ s\in \bigcap \limits _{c\in w}PS(c)}p(s|context) \\&-\sum _{ s\in \bigcap \limits _{c\in w}PS(c)} \frac{1}{|\bigcap \limits _{c\in w}PS(c)|} \log p(s|context) \end{aligned} \end{aligned}$$
(3)

where c is the character-level masked token and w is the whole Chinese word which c belongs to. The second penalty term enables the model to predict the given token c as all sememes in PS(c) with possibility tending to equality, which enhanced the generalization ability. The output embedding of tokens belonging to the same word can be averaged according to the word segmentation boundary and used to predict all possible sememes of the whole Chinese word. We show an illustration of Chinese word sememe prediction in Fig. 3. Taking MLM task and CSP task together as objective of the first stage pre-training, we introduced the entities in the KG into the K-DLM:

$$\begin{aligned} \mathcal {L}_{stage1}=\mathcal {L}_{MLM}+\mathcal {L}_{CSP} \end{aligned}$$
(4)
Fig. 4.
figure 4

The construction of positive and negative samples. Positive sample is obtained by synonymous replacement. Negative sample 1 illustrates entity replacement based on unique relationships, while negative sample 2 demonstrates replacement using non-unique relationships

3.3 Contrastive Learning

To incorporate entity relationships into K-DLM, we conduct the second stage of pre-training due to the inherent complexity of learning explicit relationships through MLM tasks. Following the contrastive framework of SimCLR [3], we introduce a novel replacement strategy for relationship types in the knowledge graph, generating positive and negative examples. Our second-stage pre-training objective employs a cross-entropy loss with in-batch negatives [4].

Replacement Strategy. In the fused KG, relationships encompass both general semantic relationships (e.g., synonyms, hypernyms, and hyponyms) and domain-specific relationships between entities. To connect corpus entities to the KG, we employ an off-the-shelf entity linking toolFootnote 1, discarding sentences without entities. For an entity e, we define positive (P(e)) and negative (N(e)) candidate sets. We propose three replacement strategies for relationships:

  • Semantic Relationship Replacement: Synonymous replacements are considered positive samples, while hypernymous and hyponymous replacements are treated as semantic changes and used to construct negative samples. Specifically, synonyms of e are added to P(e), and hypernyms and hyponyms of e are added to N(e).

  • Unique Relationship Replacement: This strategy is applied when a sentence contains multiple entities. We design a replacement strategy for each entity based on relationship uniqueness. For entities \(e_1\) and \(e_2\) in a sentence, assuming \(e_1\) remains constant and \(e_2\) is replaced, the relationship between \(e_1\) and \(e_2\) is denoted as r, with type t. If r is the unique relationship of type t for \(e_1\), we select an entity of the same type as \(e_2\) from the KG and add it to \(N(e_2)\). To increase task difficulty, we calculate the edit distance between the original entity and each entity of the same type, randomly selecting from the 10 entities with the smallest edit distance as replacements.

  • Non-unique Relationship Replacement: This strategy is employed when \(e_1\) has relationships with other entities in the KG, excluding r, of type t. Now, r is considered a non-unique relationship for \(e_1\). We randomly select an entity from all entities with the same relationship to \(e_1\) and add it to \(N(e_2)\).

Figure 4 illustrates the creation of positive and negative samples. The "direct" relationship between directors and films is used for comprehensibility purposes, but is not included in ServiceKG.

4 Experiments

In this section, we present the details of training setup and conduct experiments on 17 Chinese datasets, among which 13 are open-domain, and 4 are specific-domain, to answer the following research questions:

  • RQ1: What is the role of sememes and types of phrases and entities in open-domain classification tasks?

  • RQ2: How does our proposed method perform compared with others of introducing KG into pre-trained language model?

  • RQ3: Could our proposed method benefit from domain KG in domain-specific tasks?

4.1 Experiment Setup

Pre-training Corpora. To evaluate our proposed method and compare it with previous works [6, 13], we pre-train our model using five Chinese corpora. These corpora include WikiZh, which is utilized to train bert-base-chinese in [7], WebtextZh, which is used to train K-BERT in [13], Sogou baike, Baike QA, and NewsZh. The total size of these corpora is similar to the pre-training corpus of Chinese RoBERTa-wwm-ext in [6], which is not publicly available. We construct our pre-training corpus, named "ext", utilizing the above five corpora about encyclopedia, QA, and news to ensure impartial comparison with the pre-training corpus of Chinese RoBERTa-wwm-ext.

Knowledge Graph. We utilize HownetFootnote 2 [8] as our source of common sense knowledge. Unlike K-BERT [13], we only integrate some concepts and sememes from Hownet into the domain KG, named ServiceKG. ServiceKG is constructed from billions of search logs and contains around 60k nodes and 200k relations of ten types of nodes and five types of relations. Due to copyright reasons, we cannot publish the complete ServiceKG. The sememes and relations from Hownet are used in all our experiments, including the results marked as ServiceKG in Table 3. Limited by model scale and computing power, we selected the top 97 sememes out of 2196, which cover over 70% of the sememe labels in Hownet.

Baselines. We compare our proposed K-DLM with five baselines in this paper:

  • Google BERT, the official BERT (Chinese) pretrained on Wikizh [7].

  • Chinese RoBERTa-wwm-ext, the RoBERTa-like BERT pretrained on ext corpus [6].

  • K-BERT, the K-BERT pretrained on WikiZh and WebtextZh, utilizing HowNet as a KG [13].

  • RoBERTa-wwm, our implementation of RoBERTa-wwm pretrained on Wikizh and WebtextZh.

  • RoBERTa-wwm-ext, our implementation of RoBERTa-wwm-ext pretrained on our ext corpus.

Evaluation Benchmarks. To evaluate the performance of our proposed K-DLM, we conducted experiments on 17 datasets belonging to six natural language understanding tasks, as follows:

  • Natural Language Inference: CMNLI [25], XNLI [5].

  • Winograd Schema Challenge: CLUEWSC2020 [25].

  • Semantic Similarity: AFQMC [25], LCQMC [14], CSL [25].

  • Sentiment Analysis: Book-Review [13], Chnsenticorp [13],

    Shopping [13], Weibo [13].

  • Named Entity Recognition: MSRA-NER [12], Finance-NER [13], Medicine-NER [13].

  • Text Multi-Class Classification: TNEWS [25], IFLYTEK [25], Affair-CLS, Service-CLS.

Out of the 17 datasets mentioned above, Finance-NER, Medicine-NER, Affair-CLS, and Service-CLS are domain-specific, while the rest are open-domain. Affair-CLS and Service-CLS are five-class datasets from two vertical search scenarios within WeChat, a platform with billions of daily active users. Affair-CLS is related to government affairs search, while Service-CLS is related to service search. In our experiments, we perform coarse-grained binary classification by grouping labels 0 to 2 as weakly relevant and labels 3 and 4 as strongly relevant.

Training Details. To clearly demonstrate the effect of introducing the KG, we used the pre-training configuration of RoBERTa-wwm [6], which employs the WordPiece encoding scheme. Our hyperparameters aligned with those of Google BERT, where model size matched the \(BERT_{base}\) configuration, which is L=12, H=768, A=12, with a total of 102M parameters. Inputs were constructed as DOC-SENTENCES as in RoBERTa-wwm. We utilized the LAMB optimizer [27] with a batch size scaled from 512 to 32K on 128 T V100 GPUs with 32GB VRAM. Our model underwent pre-training for 15625 steps with an initial learning rate of 5e-3 and implemented a warm-up strategy for the first 20% of steps, followed by a linear decay of the learning rate.

Table 1. Results of various models on classification tasks in CLUE benchmark (Acc.%)
Table 2. Results of various models on seven open-domain tasks (F1% for MSRA-NER, Acc.% for others)

4.2 Overall Performance

In this section, we compared our proposed K-DLM with several baseline pre-trained language models above respectively.

Classification Tasks in CLUE Benchmark (RQ1). We compare our K-DLM with Chinese RoBERTa-wwm-ext [6] to highlight the advantages of incorporating sememe and category knowledge. Table 1 presents the results on the development set of six sentence classification tasks in the CLUE benchmark [25]. The performance of Google BERT and Chinese RoBERTa-wwm-ext is obtained from the public leaderboard of the CLUE benchmark, while we use the hyperparameters provided by Chinese RoBERTa-wwm-ext without further tuning. Our observations can be categorized as follows:

  • Sememes help correctly classify sentences with categories based on common-sense phrases or entities, showing a positive impact on tasks such as TNEWS, CMNLI, CLUEWSC2020, and CSL.

  • In contrast, when domain-specific entities consisting of common characters impact the category (AFQMC and IFLYTEK), sememes lead to incorrect category predictions, indicating that adding these entities into the KG can solve the issue.

Table 3. Results of various models on specific-domain tasks (%)

Other Open-Domain Tasks (RQ2). To compare our K-DLM with K-BERT, which is equipped with HowNet, we pre-trained both RoBERTa-wwm and our K-DLM on the same corpus as K-BERT (HowNet) as stated in [13]. We then evaluated the models on seven open-domain tasks, with each dataset divided into train, dev, and test subsets. We fine-tuned the models on the train subset, selected the best model based on the dev subset, and evaluated its performance on the test subset. The experimental results are presented in Table 2.

  • K-DLM did not show a significant performance improvement for sentiment analysis tasks (i.e., Book-Review, Chnsenticorp, and Shopping) because sentiment mainly relies on emotion words and negations rather than knowledge. Moreover, for colloquial-style sentences from social media (i.e., Weibo), inaccurate sememe predictions impaired the model’s ability to judge emotions.

  • K-DLM outperforms K-BERT on common knowledge-dependent tasks (XNLI, LCQMC, and MSRA-NER) by addressing a problem encountered by K-BERT. Fine-tuning K-BERT requires word segmentation and NER, which introduce errors and restrict knowledge utilization. In contrast, our character-level sememe incorporation in K-DLM benefits downstream tasks without entity linking and improves decision-making for entities outside the KG.

Specific-Domain Tasks (RQ3). We conduct experiments on four specific-domain tasks to assess whether domain KG benefits K-DLM. Following the experiment setup in [13] for the Finance-NER and Medicine-NER tasks, we evaluate models equipped with HowNet on these tasks. For our self-developed ServiceKG, we fine-tune the RoBERTa-wwm as described in Sect. 4.2 using the method proposed by K-BERT with this KG, and obtain K-BERT (ServiceKG) for comparison with our K-DLM. The results are summarized in Table 3.

  • In domain-specific NER tasks, HowNet’s financial and medical knowledge aids entity identification in sentences. Our K-DLM outperforms K-BERT (HowNet) in terms of precision and F1 score, indicating its successful classification of entities into correct categories with the assistance of sememes.

  • For our query intention classification tasks (Affair-CLS and Service-CLS), we focus more on the relationship between entities (i.e., services offered to something) than on the types of entities in the query. Therefore, the phrase and entity mask has no significant effect. Because queries are relatively short, adding entity relation entity triples into a query using K-BERT can cause semantic drift that cannot be ignored. In contrast, our K-DLM introduces relations by replacement without changing the sentence length, resulting in improved performance in short text classification.

5 Conclusion

In summary, our proposed K-DLM framework utilizes a Chinese soft-label scheme, split-and-merge mapping strategy, and replacement-based relation injection strategy for short text processing. This approach enhances the utilization of sememes and category information, leading to improved model performance in vertical domain query understanding while avoiding semantic drift. Experimental results demonstrate that using sememes of Chinese words enhances the performance of open-domain classification tasks relying on common knowledge. We conducted comparisons with alternative KG introduction methods to assess efficacy and applicability. Additionally, our approach enables adaptability to tasks across diverse domains by leveraging domain-specific KGs during pre-training.