Keywords

1 Introduction

Recently, the introduction of pre-trained language models (PLMs), including GPT [18], BERT [3], and ELMo [17], among many others, has achieved tremendous success to the natural language processing (NLP) research. Typically, the basic structure of such a model consists of two successive stages, one step during the pre-training phase and another step during the fine-tuning phase. During the pre-training phase it pre-trains on unsupervised dataset firstly, then during the fine-tuning phase it fine-tunes on downstream supervised NLP tasks. Up to now, these models obtained the best performance on various NLP tasks. Some of the most prominent examples are BERT, and BERT based SpanBERT [5], ALBERT [8]. These PLMs are trained on the large unsupervised corpus through some unsupervised training objectives. However, it is not obvious that the model parameters which is obtained during unsupervised pre-training phase can be well-suited to support the this kind of transfer learning. Especially during the fine-tuning phase, for the target NLP task only a small amount of supervised text data is available, fine-tuning the pre-trained model are potentially brittle. And for the pre-trained model, supervised fine-tuning requires substantial amounts of task-specific supervised training dataset, not always available. For example, in GLUE benchmark [25], Winograd Schema dataset [9] have only 634 training data, too small for fine-tuning natural language inference (NLI) task. Moreover, although PLMs, such BERT, can learn contextualized representations across many NLP tasks (to be task-agnostic), which leverages PLMs alone still leaves the domain-specific challenges unresolved (BERT are trained on general domain corpora only, and capture a general language knowledge from training dataset, but lack domain or task-specific data severely). For example, in financial domain, they often contain unique vocabulary information, such as stock, bond type, and the sizes of labeled data are also very small (even only few hundreds of samples). In the paper, to overcome the aforementioned issues, we proposed a novel three-stage BERT (called PPBERT) architecture, in which we add a second stage of training, that is ‘ ’, to improving the original BERT architecture model.

Typically there are two directions to pursue new state-of-art in the post pre-trained PLMs era. One is to construct novel neural network architecture model based on PLMs, like BERTserini [26] and BERTCMC [15]. Other approach is to optimize pre-training, like GPT 2.0 [18], MT-DNN [10], SpanBERT [5], and ALBERT [8]. In the paper, we present another novel method to improve the PLMs. We present a ‘ ’+‘ ’+‘ ’ three-stage paradigm and further present a language model named PPBERT. Compared with original BERT architecture that is based on the standard ‘ ’+‘ ’ PLMs approach, we do not fine-tune pre-trained models directly, but rather them on the domain or task related training dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also in the training dataset can reduce bias. More specifically, our framework involves three sequential stages: pre-training stage using on large-scale corpora (see Subsect. 2.1), post-training stage using the task or domain related datasets via multi-task continual learning method (see Subsect. 2.2), and fine-tuning stage using target datasets, even with little labeled samples or without labeled samples (see Subsect. 2.3). Thus, PPBERT can benefits from the regularization effect since it leverages cross-domain or cross-task data, which helps model generalize better with limited data and adapt to new domains or tasks better.

Sum up, on a wide variety of tasks our proposed post-training process outperforms existing BERT benchmark, and achieved better performance on small dataset and domain-specific tasks in particular substantially. Specifically, we compared our model with BERT baselines on GLUE and SuperGLUE benchmark tasks and consistently significantly outperform BERT on all of 16 tasks (8 GLUE tasks and 8 SuperGLUE tasks), increasing by the GLUE average score of 87.02, showing an absolute improvement of 2.97 over BERT; showing an absolute improvement of 5.55, pushing the SuperGLUE to 74.55. More remarkably, our model is a more flexible and pluggable. The post-training appoach can be straight plugged into other PLMs based on BERT. In our ablation studies, we plug the post-training strategy into original BERT (i.e., PPBERT) and its variant, ALBERT (called PPALBERT), respectively. Our approaches advanced the SOTA results for five popular question answering datasets, surpassing the previous pre-trained models by at least 1 point in absolute accuracy. Moreover, through further ablation studies, the best model obtains SOTA results on small datasets (1/20 training set). All of these clearly demonstrate our proposed three-stage paradigms exceptional generalization capability via post-training learning.

Fig. 1.
figure 1

An illustration of the architecture for our PPBERT, which is a ‘ ’-‘ ’-then-‘ ’ three-stage BERT. Compared with standard BERT architecture that has the two-stage ‘ ’-then-‘ ’, we do not directly fine-tune pre-trained models, but rather add a second stage of training (called ‘ ’). More specifically, during the pre-training stage, we first on the large-scale dataset conduct unsupervised pre-training, and then during the post-training stage post-train pre-trained models on the task or domain related dataset, and last during the fine-tuning stage conduct fine-tuning on downstream supervised NLP tasks.

2 The Proposed Model: PPBERT

As shown in Fig. 1, the standard BERT is built based on two-stage paradigm architecture, ‘ ’+‘ ’. Compared traditional pre-training methods, PPBERT does not fine-tune the pre-trained model directly after pre-training, but rather continues to post-train the pre-trained model on the task or domain related corpus, helping to reduce bias. During post-training processing our proposed PPBERT framework can continuously update pre-trained model. The architecture of our PPBERT architecture is shown in Fig. 1.

2.1 Pre-training

The training procedure of our proposed PPBERT has 2 processing: pre-training stage and post-training stage. As BERT outperforms most existing models, we do not intend to re-implement it but focus on the second training stage: Post-training. The pre-training processing follows that of the BERT model. We first use original BERT and further adopt a joint post-training method to enhance BERT. Thus, our proposed PPBERT is more flexible and pluggable, where post-training approach is able to be plugged into other language models based on BERT, such as ALBERT [8], SpanBERT [5], not only applied to original BERT.

2.2 Post-training

Compared with original BERT architecture that has two-stage paradigm, ‘ ’+‘ ’, we do not fine-tune pre-trained model, but rather first the model on the task or domain related training dataset directly. We add a second training stage, that is ‘ ’ stage, on an intermediate task before target-task fine-tuning.

Training Details. In the post-training stage, its aims to train the pre-trained model on the task or domain related annotated data continuously, to learn task knowledge or domain knowledge from different post-training tasks by keeping updating the pre-trained model. Thus, it brings a big challenge: How to train these post-training tasks in a continual way, and more efficiently post-train a new task without forgetting the knowledge that is learned before.

Inspired by [2, 22] and [16], which show Continual Learning can train the model with several tasks in sequence, but we find that, standard Continual Learning method trains the model with only one task at each time with the demerit that it is easy to forget the knowledge previously learned. Also concurrently, inspired by [10, 12] and [4, 13], which show Multi-task Learning can allow the use of different training corpus to train sub-parts of neural networks, but we find that, although Multi-task Learning could train multiple tasks at the same time, it is necessary that all customized pre-training tasks are prepared before the training could proceed. So this method takes as much time as continual learning does, if not more. So we present a multi-task continual learning method to tackle with this problem. More specifically, whenever a new post-training task comes, the multi-task continual learning method first utilizes the parameters that is previously learned to initialize the model, and then simultaneously train the newly-introduced task together with the original tasks, which will make sure that the learned parameters can encode the knowledge that is previously learned. More crucially, during post-training we allocate each task K training iterations, and then further assign these K iterations for each task to different stages of training. Also concurrently, instead of updating parameters over a batch, we divide a batch into more sub-batches and accumulate gradients on those sub-batches before parameter updates, which allows for a smaller sub-batch to be consumed in each iteration, more conducive to iterating quickly by using distributed training. As a result, proposed PPBERT can continuously update pre-trained model using the multi-task continual learning method. So we can guarantee the efficiency of our post-training without forgetting the knowledge that is previously trained.

Post-training Datasets. As discussed above, fine-tuning processing has main challenges, on the target task directly, as follows: i) during the fine-tuning phase, there is only a small amount of supervised training data, fine-tuning the pre-trained model are potentially brittle; ii) for the pre-trained model, its supervised fine-tuning requires substantial amounts of task-specific supervised training dataset, limited and indirect, not always available; iii) leveraging BERT alone leaves the domain or task-specific questions unresolved. To enhance the performance of pre-trained model, we need to effectively fuse task knowledge (from related NLP tasks supervised data) or domain knowledge (from related in-domain supervised data). As a common NLP task, Questions and Answers (QA), to get the answer based on a question, requires reasoning on facts relevant to the given question and deep semantic understanding of document. Thus, a large-scale QA supervised corpus can benefit most NLP tasks. Similarly, NLI task (a.k.a. RTE) and sentiment analysis (SA) are also two important and basic tasks for natural language understanding. Eventually, we use QA dataset (CoQA), NLI dataset (SNLI) and SA dataset (YELP) as post-training datasets. We post-train our model on CoQA, SNLI and YELP data simultaneously.

In this work, for generality and wide applicability of our proposed PPBERT, we use only CoQA, SNLI and YELP as post-training datasets. Note that, because PPBERT adopts the effective multi-task continual learning training method (Sect. 2.2), its post-training datasets are easily scalable, which is meant to be combined further with other datasets, including domain specific data.

2.3 Fine-Tuning

In fine-tuning processing, we first initialize PPBERT model with the post-trained parameters, and then use supervised dataset from specific tasks to further fine-tune. In general, for each downstream task, after being fine-tuned it has its own fine-tuned models.

3 Experiments

3.1 Tasks

To evaluate our proposed approach, we use a comprehensive experiment tasks, as follows:

i) in Sect. 3, eight tasks in the GLUE benchmark [25] and eight tasks in the SuperGLUE benchmark [24];

ii) in Sect. 4, five question answering tasks, two natural language inference tasks and two tasks in domain adaptation, financial sentiment analysis and financial question answering.

We expect that these NLP tasks will benefit from proposed ‘ ’+‘ ’+‘ ’ three-stage paradigm particularly.

3.2 Datasets

This subsection briefly describes the datasets.

GLUE. The General Language Understanding Evaluation (GLUE) benchmark [25] is a collection of eight datasets to evaluate NLU tasks. GLUEFootnote 1 consists of a series of NLP task datasets (See Table 1), including: Corpus of Linguistic Acceptability (CoLA), Multi-genre Natural Language Inference (MNLI), Recognizing Textual Entailment (RTE), Quora Question Pairs (QQP), Semantic Textual Similarity Benchmark (STS-B), Stanford Sentiment Treebank (SST-2), Question Natural Language Inference (QNLI), Microsoft Research Paraphrase Corpus (MRPC).

Table 1. Summary of the GLUE benchmark.

SuperGLUE. Similar to GLUE, the SuperGLUE benchmark [24] is a new benchmark that is more difficult language understanding task datasetsFootnote 2, including: BoolQ, CommitmentBank (CB), Choice of Plausible Alternatives (COPA), Multi-Sentence Reading Comprehension (MultiRC), Reading Comprehension with Commonsense Reasoning (ReCoRD), Recognizing Textual Entailment (RTE), Words in Context (WiC), Winograd Schema Challenge (WSC).

SQuAD. The Stanford Question Answering Dataset (SQuAD) is one of the most popular machine reading comprehension challenges datasets. SQuAD is a typical extractive machine reading comprehension task, including a question and a paragraph of context. Its aim is to give a text span extracted from the document based on the given question. SQuAD consists of two versions: SQuAD [20] (in this version, the provided document always contains an final answer) and SQuAD v2.0 [19] (in this version, some questions are not answered from the provided document).

Financial Datasets. To better demonstrate the generality of our post-training approach, we further perform domain adaptation experiments on two financial tasks, FiQA sentiment analysis (SA) dataset and FiQA question answering (QA) dataset. As part of the companion proceedings for WWW’18 conference, [14] released two very small financial datasets (FiQA).

Table 2. The overall performance of PPBERT and the comparison against BERT models on GLUE benchmark.

Notes: The results on GLUE benchmark [25], where the results on test set are scored by the GLUE evaluation server and the results on dev set are the median of three experimental results. The metrics for these tasks are shown in Table 1. texts indicate the results on par with or pass human performance. \(^\ddag \) indicates our proposed model. \(^\dag \) indicates original model BERT [3].

Table 3. Results on SuperGLUE benchmark.

Notes: All results are based on a 24-layer architecture (LARGE model). PPBERT results on the development set are a median over three runs. Model references: \(^\S \): ([24]).

Additional Benchmarks. As shown in Table 6, we present additional datasets for extractive question answering tasks, including RACE [7], NewsQA [23], TrivaQA [6], HotpotQA [28]. More details are provided in the supplementary materials.

3.3 Experimental Results

We evaluate the proposed PPBERT on two popular NLU benchmarks: GLUE and SuperGLUE. We compare PPBERT with standard BERT model and demonstrate the effectiveness of with ‘ ’.

GLUE Results. We evaluated performance on GLUE benchmark, with the large models and the base models of each approach. We reports the results of each method on the development dataset and test dataset. The detailed experimental results on GLUE are presented in Table 2. As illustrated in the BASE models columns of Table 2, PPBERT\(\mathrm{_{BASE}}\) achieves an average score of 81.53, and outperforms standard BERT\(\mathrm{_{BASE}}\) on all of the 8 tasks. As shown, in test dataset parts of LARGE models sections in Table 2, PPBERT\(\mathrm{_{LARGE}}\) outperform BERT\(\mathrm{_{LARGE}}\) on all of the 8 tasks and achieves an average score of 85.03. We also observe similar results in the dev set column, achieveing an average score of 87.02 on the dev set, a 2.97 improvement over BERT\(\mathrm{_{LARGE}}\). From this data we can see that PPBERT\(\mathrm{_{LARGE}}\) matched or even outperformed human level.

SuperGLUE Results. Table 3 shows the performances on 8 SuperGLUE tasks. As shown in Table 3, it is apparent that PPBERT outperforms BERT on 8 tasks significantly. The main gains from PPBERT are in the MultiRC (+6.5) and in ReCoRD (+6.7), both accounting for the rise in PPBERT’s GLUE score. Also, as Table 3 shows, there is a huge gap between human performance (89.79) and the performance of PPBERT (74.55).

Overall Trends. Table 2 and Table 3 respectively show our results on GLUE and SuperGLUE with and without ‘ ’. As shown, we compare proposed method to standard BERT benchmarks on 16 baseline tasks, and find on every task our proposed PPBERT outperforms BERT. Since in pre-training phase PPBERT has the same architecture and pre-training objective as standard BERT, the main gain is attributed to ‘ ’ in post-training phase. If we consider the gains, especially PPBERT is better at natural language inference and question answering tasks, and is not good at syntax-oriented task. In GLUE benchmark (we also observe similar results in SuperGLUE), for example, i) for the question answering tasks (QNLI, MultiRC, ReCoRD) and the natural language inference tasks (MNLI and RTE), we achieves significant accuracy gain of at least 1 point improvement. ii) for sentiment task (SST-2), although we observe a smaller gain (+0.8), it is mainly because the accuracy has been already high, a reasonable score (obtained a accuracy score of 95.7); iii) for simple sentence task, we observe the smallest gain (+0.2) on all tasks in the syntax-oriented (CoLA) task. Besides, this mirrors results also reported in [1], who show that few pre-training tasks other than language modeling offer any advantage for CoLA. iv) for MRPC and RTE tasks, as shown in Table 2 and Table 3, what is interesting in the results is that we find consistent improvements after post-training This reveals that the learned PPBERT representation by ‘ ’+‘ ’ allows much more effective domain adaptation than the BERT representation by ‘ ’ only.

4 Ablation Study and Analyses

4.1 Cooperation with Other Pre-trained LMs

Our proposed PPBERT is a more flexible and pluggable, where post-training approach can be plugged into other PLMs based on BERT, not only applied to original BERT model. We further validate the performance of PPBERT when ‘ ’ appoach on different pre-trained LMs. We compare post-training by plugging it into original BERT (i.e., PPBERT) and and its variant, ALBERT (called PPALBERT) pre-trained LMs, respectively. Also, we further post-train the most recent proposed PPALBERT with one additional QA dataset (SearchQA), and call it PPALBERT\(\mathrm{_{LARGE}}\)-QA.

Comparisons to SOTA Models. We evaluate our models on the popular SQuAD benchmark (Sect. 3.2). Performance of each model is evaluated on the two standard metric values: F1 score and exact match (EM) score. F1 score measures the precision and recall, and less strict than then EM score. EM score measures whether the model output exactly matches the ground answers.

Table 4. Comparison with state-of-the-art results on the Dev set of SQuAD.

Notes: Results on SQuAD 1.1/2.0 development dataset. Best scores are in bold texts, and the previous best scores are underlined.

Table 4 details performance gains when exploiting each of the three post-trained LMs on SQuAD datasets (two versions, respectively). As shown in Table 4, on the SQuAD dev dataset (version 1.1), compared with BERT baseline, adding post-training stage improves the EM by 1.1 points (84.1\(\rightarrow \)85.2), and F1 1.2 points (90.9\(\rightarrow \)92.1). Similarly, PPALBERT\(\mathrm{_{LARGE}}\) also outperforms ALBERT\(\mathrm{_{LARGE}}\) baseline, by 0.3 EM and 0.2 F1. Especially, PPALBERT\(\mathrm{_{LARGE}}\)-QA using further post-training relatively improves 0.1 EM and 0.1 F1 over PPALBERT\(\mathrm{_{LARGE}}\), respectively. We also observe similar results on SQuAD v2.0 development set. The most recent proposed PPALBERT sets a new state-of-the-art, achieving 87.7 EM and 90.5 F1.

Performance on Other QA and NLI Tasks. Furthermore, extensive experiments on six NLP tasks about semantic relationship are conducted, including two natural language inference benchmarks (QNLI and MNLI-m, both from GLUE), and four extractive question answering benchmarks (TriviaQA, RACE, HotpotQA and NewsQA). All benchmarks except RACE, we use the same fine-tuning method as SQuAD. Different from others, RACE is a multiple-choice QA dataset. The experimental results for PPALBERT are shown in Table 5. As depicted in Table 5, both PPALBERT\(\mathrm{_{LARGE}}\) and PPALBERT\(\mathrm{_{LARGE}}\)-QA achieve state-of-the-art accuracy across all settings. Overall, as expected, only utilizing ‘ ’ is inferior to our proposed ‘ ’-then-‘ ’ method. The experimental results (Sect. 4.1 and Sect. 4.1) described above, indicate that our two stage training paradigm is very flexible, and proposed post-training appoach could be easily plugged into other PLMs. More remarkably, we achieve new SOTA performances on existing baselines.

Table 5. Performance on six QA and NLI tasks.

Notes: The details of NewsQA, TrivaQA, HotpotQA and RACE are shown in Table 6. QNLI and MNLI-m are from GLUE. Model references: \(^\dag \): ([5]), \(^\ddag \): ([11]), \(^\S \): ([8]).

Table 6. The details of QA datasets.

5 Conclusion

In the paper, we present a ‘ ’+‘ ’+‘ ’ three-stage paradigm and a language model named PPBERT based on the three-stage paradigm, which is a supplementary framework for the standard ‘ ’+‘ ’ two-stage architecture. Our proposed three-stage paradigm helps to incorporate task-awareness knowledge and domain knowledge within pre-trained model, also reduce the bias in the training corpus. PPBERT can benefits from the regularization effect since it leverages cross-domain or cross-task data, which helps model generalize better with limited data and adapt to new domains or tasks better. With the latest PLMs as baseline and encoder backbone, PPBERT is evaluated on 24 well-known benchmarks, which outperformS strong baseline models and obtains new SOTA results. We hope this work can encourage further research into the language models training, and the future works involve the choice of other transfer learning sources such as CV etc.