Abstract
In the paper, we present a ‘ ’+‘ ’ +‘ ’ three-stage paradigm, which is a supplementary framework for the standard ‘ ’+‘ ’ language model approach. Furthermore, based on three-stage paradigm, we present a language model named PPBERT. Compared with original BERT architecture that is based on the standard two-stage paradigm, we do not fine-tune pre-trained model directly, but rather post-train it on the domain or task related dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also from the training dataset reduce bias. Extensive experimental results indicate that proposed model improves the performance of the baselines on 24 NLP tasks, which includes eight GLUE benchmarks, eight SuperGLUE benchmarks, six extractive question answering benchmarks. More remarkably, our proposed model is a more flexible and pluggable model, where post-training approach is able to be plugged into other PLMs that are based on BERT. Extensive ablations further validate the effectiveness and its state-of-the-art (SOTA) performance. The open source code, pre-trained models and post-trained models are available publicly.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recently, the introduction of pre-trained language models (PLMs), including GPT [18], BERT [3], and ELMo [17], among many others, has achieved tremendous success to the natural language processing (NLP) research. Typically, the basic structure of such a model consists of two successive stages, one step during the pre-training phase and another step during the fine-tuning phase. During the pre-training phase it pre-trains on unsupervised dataset firstly, then during the fine-tuning phase it fine-tunes on downstream supervised NLP tasks. Up to now, these models obtained the best performance on various NLP tasks. Some of the most prominent examples are BERT, and BERT based SpanBERT [5], ALBERT [8]. These PLMs are trained on the large unsupervised corpus through some unsupervised training objectives. However, it is not obvious that the model parameters which is obtained during unsupervised pre-training phase can be well-suited to support the this kind of transfer learning. Especially during the fine-tuning phase, for the target NLP task only a small amount of supervised text data is available, fine-tuning the pre-trained model are potentially brittle. And for the pre-trained model, supervised fine-tuning requires substantial amounts of task-specific supervised training dataset, not always available. For example, in GLUE benchmark [25], Winograd Schema dataset [9] have only 634 training data, too small for fine-tuning natural language inference (NLI) task. Moreover, although PLMs, such BERT, can learn contextualized representations across many NLP tasks (to be task-agnostic), which leverages PLMs alone still leaves the domain-specific challenges unresolved (BERT are trained on general domain corpora only, and capture a general language knowledge from training dataset, but lack domain or task-specific data severely). For example, in financial domain, they often contain unique vocabulary information, such as stock, bond type, and the sizes of labeled data are also very small (even only few hundreds of samples). In the paper, to overcome the aforementioned issues, we proposed a novel three-stage BERT (called PPBERT) architecture, in which we add a second stage of training, that is ‘ ’, to improving the original BERT architecture model.
Typically there are two directions to pursue new state-of-art in the post pre-trained PLMs era. One is to construct novel neural network architecture model based on PLMs, like BERTserini [26] and BERTCMC [15]. Other approach is to optimize pre-training, like GPT 2.0 [18], MT-DNN [10], SpanBERT [5], and ALBERT [8]. In the paper, we present another novel method to improve the PLMs. We present a ‘ ’+‘ ’+‘ ’ three-stage paradigm and further present a language model named PPBERT. Compared with original BERT architecture that is based on the standard ‘ ’+‘ ’ PLMs approach, we do not fine-tune pre-trained models directly, but rather them on the domain or task related training dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also in the training dataset can reduce bias. More specifically, our framework involves three sequential stages: pre-training stage using on large-scale corpora (see Subsect. 2.1), post-training stage using the task or domain related datasets via multi-task continual learning method (see Subsect. 2.2), and fine-tuning stage using target datasets, even with little labeled samples or without labeled samples (see Subsect. 2.3). Thus, PPBERT can benefits from the regularization effect since it leverages cross-domain or cross-task data, which helps model generalize better with limited data and adapt to new domains or tasks better.
Sum up, on a wide variety of tasks our proposed post-training process outperforms existing BERT benchmark, and achieved better performance on small dataset and domain-specific tasks in particular substantially. Specifically, we compared our model with BERT baselines on GLUE and SuperGLUE benchmark tasks and consistently significantly outperform BERT on all of 16 tasks (8 GLUE tasks and 8 SuperGLUE tasks), increasing by the GLUE average score of 87.02, showing an absolute improvement of 2.97 over BERT; showing an absolute improvement of 5.55, pushing the SuperGLUE to 74.55. More remarkably, our model is a more flexible and pluggable. The post-training appoach can be straight plugged into other PLMs based on BERT. In our ablation studies, we plug the post-training strategy into original BERT (i.e., PPBERT) and its variant, ALBERT (called PPALBERT), respectively. Our approaches advanced the SOTA results for five popular question answering datasets, surpassing the previous pre-trained models by at least 1 point in absolute accuracy. Moreover, through further ablation studies, the best model obtains SOTA results on small datasets (1/20 training set). All of these clearly demonstrate our proposed three-stage paradigms exceptional generalization capability via post-training learning.
2 The Proposed Model: PPBERT
As shown in Fig. 1, the standard BERT is built based on two-stage paradigm architecture, ‘ ’+‘ ’. Compared traditional pre-training methods, PPBERT does not fine-tune the pre-trained model directly after pre-training, but rather continues to post-train the pre-trained model on the task or domain related corpus, helping to reduce bias. During post-training processing our proposed PPBERT framework can continuously update pre-trained model. The architecture of our PPBERT architecture is shown in Fig. 1.
2.1 Pre-training
The training procedure of our proposed PPBERT has 2 processing: pre-training stage and post-training stage. As BERT outperforms most existing models, we do not intend to re-implement it but focus on the second training stage: Post-training. The pre-training processing follows that of the BERT model. We first use original BERT and further adopt a joint post-training method to enhance BERT. Thus, our proposed PPBERT is more flexible and pluggable, where post-training approach is able to be plugged into other language models based on BERT, such as ALBERT [8], SpanBERT [5], not only applied to original BERT.
2.2 Post-training
Compared with original BERT architecture that has two-stage paradigm, ‘ ’+‘ ’, we do not fine-tune pre-trained model, but rather first the model on the task or domain related training dataset directly. We add a second training stage, that is ‘ ’ stage, on an intermediate task before target-task fine-tuning.
Training Details. In the post-training stage, its aims to train the pre-trained model on the task or domain related annotated data continuously, to learn task knowledge or domain knowledge from different post-training tasks by keeping updating the pre-trained model. Thus, it brings a big challenge: How to train these post-training tasks in a continual way, and more efficiently post-train a new task without forgetting the knowledge that is learned before.
Inspired by [2, 22] and [16], which show Continual Learning can train the model with several tasks in sequence, but we find that, standard Continual Learning method trains the model with only one task at each time with the demerit that it is easy to forget the knowledge previously learned. Also concurrently, inspired by [10, 12] and [4, 13], which show Multi-task Learning can allow the use of different training corpus to train sub-parts of neural networks, but we find that, although Multi-task Learning could train multiple tasks at the same time, it is necessary that all customized pre-training tasks are prepared before the training could proceed. So this method takes as much time as continual learning does, if not more. So we present a multi-task continual learning method to tackle with this problem. More specifically, whenever a new post-training task comes, the multi-task continual learning method first utilizes the parameters that is previously learned to initialize the model, and then simultaneously train the newly-introduced task together with the original tasks, which will make sure that the learned parameters can encode the knowledge that is previously learned. More crucially, during post-training we allocate each task K training iterations, and then further assign these K iterations for each task to different stages of training. Also concurrently, instead of updating parameters over a batch, we divide a batch into more sub-batches and accumulate gradients on those sub-batches before parameter updates, which allows for a smaller sub-batch to be consumed in each iteration, more conducive to iterating quickly by using distributed training. As a result, proposed PPBERT can continuously update pre-trained model using the multi-task continual learning method. So we can guarantee the efficiency of our post-training without forgetting the knowledge that is previously trained.
Post-training Datasets. As discussed above, fine-tuning processing has main challenges, on the target task directly, as follows: i) during the fine-tuning phase, there is only a small amount of supervised training data, fine-tuning the pre-trained model are potentially brittle; ii) for the pre-trained model, its supervised fine-tuning requires substantial amounts of task-specific supervised training dataset, limited and indirect, not always available; iii) leveraging BERT alone leaves the domain or task-specific questions unresolved. To enhance the performance of pre-trained model, we need to effectively fuse task knowledge (from related NLP tasks supervised data) or domain knowledge (from related in-domain supervised data). As a common NLP task, Questions and Answers (QA), to get the answer based on a question, requires reasoning on facts relevant to the given question and deep semantic understanding of document. Thus, a large-scale QA supervised corpus can benefit most NLP tasks. Similarly, NLI task (a.k.a. RTE) and sentiment analysis (SA) are also two important and basic tasks for natural language understanding. Eventually, we use QA dataset (CoQA), NLI dataset (SNLI) and SA dataset (YELP) as post-training datasets. We post-train our model on CoQA, SNLI and YELP data simultaneously.
In this work, for generality and wide applicability of our proposed PPBERT, we use only CoQA, SNLI and YELP as post-training datasets. Note that, because PPBERT adopts the effective multi-task continual learning training method (Sect. 2.2), its post-training datasets are easily scalable, which is meant to be combined further with other datasets, including domain specific data.
2.3 Fine-Tuning
In fine-tuning processing, we first initialize PPBERT model with the post-trained parameters, and then use supervised dataset from specific tasks to further fine-tune. In general, for each downstream task, after being fine-tuned it has its own fine-tuned models.
3 Experiments
3.1 Tasks
To evaluate our proposed approach, we use a comprehensive experiment tasks, as follows:
i) in Sect. 3, eight tasks in the GLUE benchmark [25] and eight tasks in the SuperGLUE benchmark [24];
ii) in Sect. 4, five question answering tasks, two natural language inference tasks and two tasks in domain adaptation, financial sentiment analysis and financial question answering.
We expect that these NLP tasks will benefit from proposed ‘ ’+‘ ’+‘ ’ three-stage paradigm particularly.
3.2 Datasets
This subsection briefly describes the datasets.
GLUE. The General Language Understanding Evaluation (GLUE) benchmark [25] is a collection of eight datasets to evaluate NLU tasks. GLUEFootnote 1 consists of a series of NLP task datasets (See Table 1), including: Corpus of Linguistic Acceptability (CoLA), Multi-genre Natural Language Inference (MNLI), Recognizing Textual Entailment (RTE), Quora Question Pairs (QQP), Semantic Textual Similarity Benchmark (STS-B), Stanford Sentiment Treebank (SST-2), Question Natural Language Inference (QNLI), Microsoft Research Paraphrase Corpus (MRPC).
SuperGLUE. Similar to GLUE, the SuperGLUE benchmark [24] is a new benchmark that is more difficult language understanding task datasetsFootnote 2, including: BoolQ, CommitmentBank (CB), Choice of Plausible Alternatives (COPA), Multi-Sentence Reading Comprehension (MultiRC), Reading Comprehension with Commonsense Reasoning (ReCoRD), Recognizing Textual Entailment (RTE), Words in Context (WiC), Winograd Schema Challenge (WSC).
SQuAD. The Stanford Question Answering Dataset (SQuAD) is one of the most popular machine reading comprehension challenges datasets. SQuAD is a typical extractive machine reading comprehension task, including a question and a paragraph of context. Its aim is to give a text span extracted from the document based on the given question. SQuAD consists of two versions: SQuAD [20] (in this version, the provided document always contains an final answer) and SQuAD v2.0 [19] (in this version, some questions are not answered from the provided document).
Financial Datasets. To better demonstrate the generality of our post-training approach, we further perform domain adaptation experiments on two financial tasks, FiQA sentiment analysis (SA) dataset and FiQA question answering (QA) dataset. As part of the companion proceedings for WWW’18 conference, [14] released two very small financial datasets (FiQA).
Notes: The results on GLUE benchmark [25], where the results on test set are scored by the GLUE evaluation server and the results on dev set are the median of three experimental results. The metrics for these tasks are shown in Table 1. texts indicate the results on par with or pass human performance. \(^\ddag \) indicates our proposed model. \(^\dag \) indicates original model BERT [3].
Notes: All results are based on a 24-layer architecture (LARGE model). PPBERT results on the development set are a median over three runs. Model references: \(^\S \): ([24]).
Additional Benchmarks. As shown in Table 6, we present additional datasets for extractive question answering tasks, including RACE [7], NewsQA [23], TrivaQA [6], HotpotQA [28]. More details are provided in the supplementary materials.
3.3 Experimental Results
We evaluate the proposed PPBERT on two popular NLU benchmarks: GLUE and SuperGLUE. We compare PPBERT with standard BERT model and demonstrate the effectiveness of with ‘ ’.
GLUE Results. We evaluated performance on GLUE benchmark, with the large models and the base models of each approach. We reports the results of each method on the development dataset and test dataset. The detailed experimental results on GLUE are presented in Table 2. As illustrated in the BASE models columns of Table 2, PPBERT\(\mathrm{_{BASE}}\) achieves an average score of 81.53, and outperforms standard BERT\(\mathrm{_{BASE}}\) on all of the 8 tasks. As shown, in test dataset parts of LARGE models sections in Table 2, PPBERT\(\mathrm{_{LARGE}}\) outperform BERT\(\mathrm{_{LARGE}}\) on all of the 8 tasks and achieves an average score of 85.03. We also observe similar results in the dev set column, achieveing an average score of 87.02 on the dev set, a 2.97 improvement over BERT\(\mathrm{_{LARGE}}\). From this data we can see that PPBERT\(\mathrm{_{LARGE}}\) matched or even outperformed human level.
SuperGLUE Results. Table 3 shows the performances on 8 SuperGLUE tasks. As shown in Table 3, it is apparent that PPBERT outperforms BERT on 8 tasks significantly. The main gains from PPBERT are in the MultiRC (+6.5) and in ReCoRD (+6.7), both accounting for the rise in PPBERT’s GLUE score. Also, as Table 3 shows, there is a huge gap between human performance (89.79) and the performance of PPBERT (74.55).
Overall Trends. Table 2 and Table 3 respectively show our results on GLUE and SuperGLUE with and without ‘ ’. As shown, we compare proposed method to standard BERT benchmarks on 16 baseline tasks, and find on every task our proposed PPBERT outperforms BERT. Since in pre-training phase PPBERT has the same architecture and pre-training objective as standard BERT, the main gain is attributed to ‘ ’ in post-training phase. If we consider the gains, especially PPBERT is better at natural language inference and question answering tasks, and is not good at syntax-oriented task. In GLUE benchmark (we also observe similar results in SuperGLUE), for example, i) for the question answering tasks (QNLI, MultiRC, ReCoRD) and the natural language inference tasks (MNLI and RTE), we achieves significant accuracy gain of at least 1 point improvement. ii) for sentiment task (SST-2), although we observe a smaller gain (+0.8), it is mainly because the accuracy has been already high, a reasonable score (obtained a accuracy score of 95.7); iii) for simple sentence task, we observe the smallest gain (+0.2) on all tasks in the syntax-oriented (CoLA) task. Besides, this mirrors results also reported in [1], who show that few pre-training tasks other than language modeling offer any advantage for CoLA. iv) for MRPC and RTE tasks, as shown in Table 2 and Table 3, what is interesting in the results is that we find consistent improvements after post-training This reveals that the learned PPBERT representation by ‘ ’+‘ ’ allows much more effective domain adaptation than the BERT representation by ‘ ’ only.
4 Ablation Study and Analyses
4.1 Cooperation with Other Pre-trained LMs
Our proposed PPBERT is a more flexible and pluggable, where post-training approach can be plugged into other PLMs based on BERT, not only applied to original BERT model. We further validate the performance of PPBERT when ‘ ’ appoach on different pre-trained LMs. We compare post-training by plugging it into original BERT (i.e., PPBERT) and and its variant, ALBERT (called PPALBERT) pre-trained LMs, respectively. Also, we further post-train the most recent proposed PPALBERT with one additional QA dataset (SearchQA), and call it PPALBERT\(\mathrm{_{LARGE}}\)-QA.
Comparisons to SOTA Models. We evaluate our models on the popular SQuAD benchmark (Sect. 3.2). Performance of each model is evaluated on the two standard metric values: F1 score and exact match (EM) score. F1 score measures the precision and recall, and less strict than then EM score. EM score measures whether the model output exactly matches the ground answers.
Notes: Results on SQuAD 1.1/2.0 development dataset. Best scores are in bold texts, and the previous best scores are underlined.
Table 4 details performance gains when exploiting each of the three post-trained LMs on SQuAD datasets (two versions, respectively). As shown in Table 4, on the SQuAD dev dataset (version 1.1), compared with BERT baseline, adding post-training stage improves the EM by 1.1 points (84.1\(\rightarrow \)85.2), and F1 1.2 points (90.9\(\rightarrow \)92.1). Similarly, PPALBERT\(\mathrm{_{LARGE}}\) also outperforms ALBERT\(\mathrm{_{LARGE}}\) baseline, by 0.3 EM and 0.2 F1. Especially, PPALBERT\(\mathrm{_{LARGE}}\)-QA using further post-training relatively improves 0.1 EM and 0.1 F1 over PPALBERT\(\mathrm{_{LARGE}}\), respectively. We also observe similar results on SQuAD v2.0 development set. The most recent proposed PPALBERT sets a new state-of-the-art, achieving 87.7 EM and 90.5 F1.
Performance on Other QA and NLI Tasks. Furthermore, extensive experiments on six NLP tasks about semantic relationship are conducted, including two natural language inference benchmarks (QNLI and MNLI-m, both from GLUE), and four extractive question answering benchmarks (TriviaQA, RACE, HotpotQA and NewsQA). All benchmarks except RACE, we use the same fine-tuning method as SQuAD. Different from others, RACE is a multiple-choice QA dataset. The experimental results for PPALBERT are shown in Table 5. As depicted in Table 5, both PPALBERT\(\mathrm{_{LARGE}}\) and PPALBERT\(\mathrm{_{LARGE}}\)-QA achieve state-of-the-art accuracy across all settings. Overall, as expected, only utilizing ‘ ’ is inferior to our proposed ‘ ’-then-‘ ’ method. The experimental results (Sect. 4.1 and Sect. 4.1) described above, indicate that our two stage training paradigm is very flexible, and proposed post-training appoach could be easily plugged into other PLMs. More remarkably, we achieve new SOTA performances on existing baselines.
Notes: The details of NewsQA, TrivaQA, HotpotQA and RACE are shown in Table 6. QNLI and MNLI-m are from GLUE. Model references: \(^\dag \): ([5]), \(^\ddag \): ([11]), \(^\S \): ([8]).
5 Conclusion
In the paper, we present a ‘ ’+‘ ’+‘ ’ three-stage paradigm and a language model named PPBERT based on the three-stage paradigm, which is a supplementary framework for the standard ‘ ’+‘ ’ two-stage architecture. Our proposed three-stage paradigm helps to incorporate task-awareness knowledge and domain knowledge within pre-trained model, also reduce the bias in the training corpus. PPBERT can benefits from the regularization effect since it leverages cross-domain or cross-task data, which helps model generalize better with limited data and adapt to new domains or tasks better. With the latest PLMs as baseline and encoder backbone, PPBERT is evaluated on 24 well-known benchmarks, which outperformS strong baseline models and obtains new SOTA results. We hope this work can encourage further research into the language models training, and the future works involve the choice of other transfer learning sources such as CV etc.
References
Bowman, S.R., Pavlick, E., Grave, E.: Looking for Elmo’s friends: sentence-level pretraining beyond language modeling. CoRR abs/1812.10860 (2018)
Chen, Z., Liu, B.: Lifelong Machine Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2nd edn. Morgan & Claypool Publishers, Williston (2018). https://doi.org/10.2200/S00832ED1V01Y201802AIM037
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Hou, M., Chen, X., Huang, S., Xie, S., Zhou, G.: Generalizing deep multi-task learning with heterogeneous structured networks. In: Proceedings of ICLR (2020)
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. CoRR abs/1907.10529 (2019). http://arxiv.org/abs/1907.10529
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. In: Barzilay, R., Kan, M. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, 30 July – 4 August, Volume 1: Long Papers, pp. 1601–1611. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1147
Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.H.: RACE: large-scale reading comprehension dataset from examinations. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11 September, 2017, pp. 785–794. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/d17-1082
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020), https://openreview.net/forum?id=H1eA7AEtvS
Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge (2012)
Liu, X., et al.: The microsoft toolkit of multi-task deep neural networks for natural language understanding. In: Celikyilmaz, A., Wen, T. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, 5–10 July, 2020, pp. 118–126. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-demos.16
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
Liu, Z., Huang, D., Huang, K., Li, Z., Zhao, J.: Finbert: a pre-trained financial language representation model for financial text mining. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, 5–10 January, 2021, Yokohama, Japan, pp. 4513–4519 (2020)
Liu, Z., Huang, K., Huang, D., Liu, Z., Zhao, J.: Dual head-wise coattention network for machine comprehension with multiple-choice questions. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) CIKM 2020: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, 19–23 October, 2020, pp. 1015–1024. ACM (2020). https://doi.org/10.1145/3340531.3412013
Maia, M., et al. (eds.): Proceedings of WWW. ACM (2018). https://doi.org/10.1145/3184558
Ohsugi, Y., Saito, I., Nishida, K., Asano, H., Tomita, J.: A simple but effective method to incorporate multi-turn context with BERT for conversational machine comprehension. CoRR abs/1905.12848 (2019). http://arxiv.org/abs/1905.12848
Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: a review. Neural Networks. 113, 54–71 (2019). https://doi.org/10.1016/j.neunet.2019.01.012
Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, 1–6 June, 2018, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-1202
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. In: Proceedings of Technical Report, OpenAI (2018). https://github.com/openai/finetune-transformer-lm
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 15–20 July, 2018, Volume 2: Short Papers, pp. 784–789. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-2124, https://www.aclweb.org/anthology/P18-2124/
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Su, J., Carreras, X., Duh, K. (eds.) Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November, 2016, pp. 2383–2392. The Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/d16-1264
Reddy, S., Chen, D., Manning, C.D.: CoQA: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249–266 (2019). https://transacl.org/ojs/index.php/tacl/article/view/1572
Sun, Y., Wang, S., Li, Y.: ERNIE: enhanced representation through knowledge integration. CoRR abs/1904.09223 (2019). http://arxiv.org/abs/1904.09223
Trischler, A., et al.: Newsqa: a machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP (2017)
Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December, 2019, Vancouver, BC, Canada, pp. 3261–3275 (2019). https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May, 2019. OpenReview.net (2019). https://openreview.net/forum?id=rJ4km2R5t7
Yang, W., et al.: End-to-end open-domain question answering with BERTserini. In: Ammar, W., Louis, A., Mostafazadeh, N. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Demonstrations, pp. 72–77. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-4013
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLnet: generalized autoregressive pretraining for language understanding. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December, 2019, Vancouver, BC, Canada, pp. 5754–5764 (2019). https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html
Yang, Z., et al.: Hotpotqa: a dataset for diverse, explainable multi-hop question answering. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October – 4 November, 2018, pp. 2369–2380. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/d18-1259
Acknowledgements
We would like to thank the reviewers for their helpful comments and suggestions to improve the quality of the paper. The authors gratefully acknowledge the financial support provided by the Basic Scientific Research Project (General Program) of Department of Education of Liaoning Province, the University-Industry Collaborative Education Program of the Ministry of Education of China (No.202002037015).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Z., Lin, W., Shi, Y., Zhao, J. (2021). A Robustly Optimized BERT Pre-training Approach with Post-training. In: Li, S., et al. Chinese Computational Linguistics. CCL 2021. Lecture Notes in Computer Science(), vol 12869. Springer, Cham. https://doi.org/10.1007/978-3-030-84186-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-84186-7_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84185-0
Online ISBN: 978-3-030-84186-7
eBook Packages: Computer ScienceComputer Science (R0)