Keywords

1 Introduction

Pretrained Language Models (PLMs) are a new paradigm in Natural Language Processing (NLP) [22]. As shown in Fig. 1, a PLM is a large neural network. It is pre-trained on a large-scale text corpus through self-supervised learning (which is used to learn common sense from a large corpus with nothing to do with a specific downstream task). Pre-training can be regarded as regularisation to prevent the model from overfitting small data [16]. After being pre-trained, a PLM needs to be fine-tuned for a specific downstream task.

Fig. 1.
figure 1

The training process of a language model

In early NLP tasks, low-dimensional and dense vectors are often used to represent language’s syntactic or semantic features through various deep neural networks [32]. However, since deep neural networks usually have many parameters and the dataset used for training is limited, it may often lead to the phenomenon of overfitting. Transfer learning can apply the knowledge learnt in the source domain to the learning task in the target domain [40], alleviating the pressure caused by limited manual annotation data. However, unlabelled data is much larger than labelled data, so it is necessary to learn how to extract useful information from unlabelled data. The emergence of self-supervised learning and unsupervised learning solves this problem. Transformer [41] (a deep learning model) is proposed to solve the problem of slow training and low efficiency of Recurrent Neural Networks (RNNs) [36], and integrated with the self-attention mechanism to achieve fast parallel effects. Since then, PLMs have entered a boom phase. Large-scale PLMs such as BERT [11] and GPT [33] succeed greatly, and various improvements to them have been made to solve various NLP tasks.

Although PLMs are crucial to NLP tasks, there are not many surveys for helping researchers to quickly understand various PLMs from different viewpoints and determine the appropriate ones for their specific NLP projects. To amend this, in this paper, we provide a survey of PLMs. We found only two surveys on PLM through Google scholar, although ours in this paper is unique from them. The first one is provided by Li et al. [22], concerning the general task definition, the mainstream architectures of PLMs for text generation, the usage of existing PLMs to model different input data and satisfy unique properties in the generated text, and several critical fine-tuning strategies for text generation. However, they did not discuss the mainstream PLMs one by one as we do in this paper. The second survey we found was provided by Qiu et al. [32] in 2020. They comprehensively review PLMs, and, in particular, they systematically categorise various PLMs. However, the survey was published in March 2020, so it does not cover PLMs published afterwards, particularly Chinese PLMs in 2020 and 2021. So, instead, we cover the recent two years, especially the Chinese ones.

The rest of this paper is organised as follows. Section 2 briefs three main machine learning methods for training PLMs. Section 3 recalls early PLMs that focus on word vectors. Section 4 reviews the second generation of PLMs, including ELMo, BERT, GPT, and their derivatives. Section 5 briefs several Chinese PLMs and compares them with several typical English PLMs. Section 6 lists the main NLP tasks for which PLMs can be used and gives an application example for each task. Finally, Sect. 7 summarises this paper with the future work.

2 Basic Machine Learning Methods for PLMs

This section will brief machine learning methods for PLMs: Long-Short Term Memory (LSTM) [19], Attention Mechanism (AM) [6], and Transformer [41].

2.1 Long-Short Term Memory

RNNs are often used to process sequence data such as machine translation and sentiment analysis, but they are short-term memory networks. When faced with a long enough data sequence, it is difficult to transmit the earlier information to them later because RNNs may meet gradient disappearance in the reverse transmission. LSTM is an improved RNN model. Based on RNN, an input gate, a forgetting gate and an output gate are added to control and retain information, which overcomes the limitation of short-term memory. The forget gate controls how much of the unit status at the last moment can be retained to the current moment. The input gate determines how much of the immediate status can be input into the unit status. Finally, the output gate is responsible for controlling how much of the unit status can be used as the current output value of the LSTM. However, in both RNN and LSTM, much of the information carried by the input word vector may be lost in the face of long sequences.

2.2 Attention Mechanism

AM sets a weight for all hidden states in the encoder and inputs the information of the hidden states after the weighted summation to the decoder layer. AM pays more attention to inputs relevant to the current task. The AM acts between the encoder and the decoder. When RNNs are integrated with the attention mechanism, they can predict a particular part of the output sequence and focus their attention on a specific part of the input sequence to generate a higher quality output. Thus, Yu et al. [49] integrated LSTM with an AM and two-way LSTM for the Chinese question answering system, which solves the difficulties caused by Chinese grammar, semantics and lexical limitations in the Chinese question answering dataset. The AM+LSTM model retains the intermediate outputs of the LSTM encoder on the input sequences and then trains the model to selectively learn these inputs and associate the output sequences with the model outputs.

Later on, a self-AM was proposed [41]. The self-AM acts on the encoder or the decoder, and can connect longer-distance words in the same sentence. General embedding methods, such as Word2Vec, need to be integrated with context to clarify the semantics, and the sequence information of the sentence is lost. Self-AM can effectively solve these problems. Moreover, self-AM replaces the most commonly used loop layer in the encoder-decoder architecture with multi-headed self-attention. Multi-headed attention focuses on information from different representation subspaces in different positions, leading to a dramatic improvement in training speed [24, 27].

Fig. 2.
figure 2

Transformer architecture used by GPT and GPT-2

2.3 Transformer

Transformer [41] uses multiple encoders and decoders. The encoder contains a self-attention layer and a feed-forward neural network in addition to the self-attention layer and the feed-forward neural network. The advantage of the Transformer model is that it can solve the problems of slow training and low efficiency of the RNN model and use self-attention to achieve fast parallel effects. Moreover, it can deeply mine the characteristics of a Deep Neural Network (DNN) to improve the efficiency and performance of the model. After the Transformer model was proposed, PLMs entered a boom phase. Figure 2 shows the Transformer architectures used by GPT [33] and GPT-2 [34].

3 Early PLMs

From 2013 to 2021, PLMs have upgraded year by year. As early as 2013, Mikolov et al. [28] proposed the first PLM, called Word2Vec, which generates word vectors. According to the corpus, the optimised model trained expresses a word as a vector quickly and effectively. After Word2Vec trains the word vector, each independent word has a fixed dimension vector corresponding to its semantics. The Word2Vec model is also a widely used word embedding model in sentiment analysis with excellent analysis performance [1]. The Word2Vec model uses two algorithms for generating word vectors, Skip-Gram (SG) and Continuous Bag of Words (CBoW). SG has a good performance on small training data, but CBoW is also very efficient in big training data, and its accuracy rate for frequent words is also higher than SG.

Later on, Pennington, Socher, and Manning [30] proposed GloVe to overcome the shortcomings of Word2Vec: its vector dimensionality is low, and it cannot completely cover the data in the corpus. Moreover, GloVe can be more generalised than Word2Vec in the process of word embedding. However, GloVe uses the matrix factorisation method and the method based on shallow windows. So, it can contain local or global information of specific words in the corpus, which is necessary for improving its performance.

Both Word2Vec and GloVe map the input to a fixed-dimensional vector representation, so the generated word vectors are all context-independent. Thus, they cannot handle linguistic phenomena like polysemous words. For example, “open” has different meanings in “the door is open” and “the farm is in the open countryside”. So, it is unreasonable that its word vectors for the two sentences are the same. Moreover, these models are no longer needed in downstream tasks because their computational efficiency usually is low.

4 Second Generation of PLMs

This section will review representative PLMs of second generation.

4.1 ELMo Model

To solve the problem of polysemy and understand complex context, in 2018, Peters et al. [31] proposed ELMo (Embedding from Language Models). It learns word vectors via the internal state of a deep bidirectional language model. It extracts embeddings from a bi-directional LSTM pre-trained on a sizeable unsupervised corpus. The resulting embeddings are derived from a weighted combination of internal layers that can be easily applied to existing models. When doing the downstream task, ELMo extracts word embeddings from a pre-trained network corresponding to words from each layer of the network as new embeddings to be added to the downstream task. It is a typical PLM residing in feature fusion. In many NLP tasks in different domains, ELMo performs very well [3, 15, 47].

4.2 BERT Family

ELMo is a one-way language model, and its ability to model semantic information is limited. To remove these limitations, Google AI launched pre-training language model BERT (Bidirectional Encoder Representations from Transformers) at the end of 2018 [11], which uses Masked Language Model (MLM) and Next Sentence Prediction (NSP) for deep two-way joint training. The task of MLM is to randomly erase one or several words in a given sentence and predict the erased words according to the remaining words. For the erased words, MASK can replace 80% of cases; for any word, it can replace 10% of cases; and for unchanged words, it can do 10% of cases. This is more conducive to the model to grasp the context information. Figure 3 shows the MLM process. Moreover, BERT uses NSP to capture the relationship between sentences [11]. Its performance, ease of use and versatility surpass many models. The difference between BERT’s pre-training and downstream specific task training is only the top-level output layer, and it can be used in many tasks. BERT has achieved significant improvements in 11 basic tasks in NLP.

Fig. 3.
figure 3

Masked LM

The emergence of BERT has extensively promoted the development of the NLP field. Since its emergence, researchers have proposed many improved models based on BERT. RoBERTa [23] uses a more extensive dataset, changes the static mask to the dynamic mask, and cancels the NSP task. AlBERT [21] can share parameters cross-layer, which significantly reduces parameters. It factors Embedding into two smaller embedding matrices and changes the NSP task in BERT to SOP (sentence-order prediction). XLNet [48] uses Permutation Language Modeling (PLM), which can capture contextual information in the language model and has apparent advantages over BERT in text generating tasks with long document input. ELECTRA [7] replaces the MLM in BERT with RTD (Replaced Token Detection), which solves the inconsistency between the pre-training phase and the fine-tuning phase of MASK. ELECTRA is better than BERT under the same computing power, data, and model parameters. It is also better than RoBERTa and XLNet under the same amount of calculation.

4.3 GPT Family

Although the unlabelled text corpus is rich, there is very little labelled data for learning for specific tasks. To address the issue, in 2018, Radford et al. [33] proposed the GPT (Generative Pre-trained Transformer) model. Its pre-training includes two stages. The first stage is unsupervised pre-training, learning high-capacity language models on a large number of text corpora. The unsupervised pre-training in GPT mainly uses the decoder layer in the transformer; Fig. 2(a) shows the process. First, the sum of the word vector and the position vector is input. Then, after 12 layers of transformers, the predicted vector and the vector of the last word are obtained. Finally, the word vector of the last word will be used as the input of subsequent fine-tuning. The second stage is supervised fine-tuning, which adapts the model to discriminative tasks with labelled data. GPT is the first model that integrates the modern Transformer architecture and the self-supervised pre-training objective [45]. It surpasses the previous model in various assessments of natural language reasoning, classification, question answering, and comparative similarity.

In 2019, Radford et al. [34] proposed GPT-2 to predict the next word in a sentence. GTP-2 can answer questions, summarise texts, and translate texts without training in a specific field. However, GPT-2 still uses the one-way transformer mode of GPT with simple adjustments. Figure 2(b) shows the minimum model of GPT-2. It puts layer normalisation before each sub-block and adds a layer normalisation after the last self-attention. The training data of GPT-2 has been greatly improved in quantity, quality, and breadth. However, the network parameters have also increased, and the network parameters of the largest GPT-2 has reached 48 layers. As a result, both Zero-Shot (especially the tiny dataset Zero-Shot) and long text (long-distance dependence) perform well.

In 2020, Brown et al. [5] proposed GPT-3, which has 175 billion parameters. GPT-3 has excellent performance on many NLP datasets, including translation, question answering, and text filling. It is highly efficient, especially in text generation, and it is almost indistinguishable from a human-generated text. Although GPT-3 has made significant progress, it does not follow the real intentions of users very well, and it often produces unreal, harmful or unresponsive emotional outputs. To remove this flaw, Open AI uses reinforcement learning from human feedback to fine-tune GPT-3 - the resulting fine-tuned model is called InstructGPT [29]. The three main steps of its training process are: 1) perform supervised learning with a manually written demo dataset, 2) train the reward model RM on this dataset, and 3) use RM as the reward function for reinforcement learning. After over a year of testing, the experiments show that although InstructGPT still has simple errors, compared with GPT-3, it reduces harmful output and significantly improves its ability to follow user intentions.

5 Chinese PLM

This section will discuss important Chinese PLMs.

5.1 PLMs from IFLYTEK and Harbin Institute of Technology

In 2019, Cui et al. [9] of IFLYTEK and Harbin Institute of Technology released the Chinese PLM BERT-wwm based on the whole word mask. For Chinese, if part of a complete word is masked, other parts of the same word also are masked. Their experiments show that BERT-wwm outperforms BERT in various Chinese NLP tasks. They also increase the training data level and the training steps to upgrade BERT-wwm to BERT-wwm-ext. In addition, they also proposed a series of Chinese PLMs based on BERT, such as RoBERTa [26], in the same year, which achieved good experimental results. In 2020, Cui et al. [8] trained the Chinese PLM XLNet-mid based on the XLNet open source code using large-scale Chinese corpus. As a result, it surpassed the effects of BERT-wwm and BERT-wwm-ext on most NLP tasks and achieved significant performance improvements in machine reading comprehension tasks.

5.2 ERNIE Family from Baidu

In 2019, Zhang et al. [51] in Baidu released the Chinese PLM ERNIE, which has a greatly enhanced general semantic representation ability by uniformly modelling the grammatical structure, lexical structure, and semantic information in its training data. Moreover, they use a higher quality Chinese corpus, making ERNIE more effective on Chinese NLP tasks. Their experiments show that on 5 Chinese NLP tasks, ERNIE surpassed BERT. In December 2019, ERNIE topped the list in the authoritative dataset GLUE (General Language Understanding Evaluation) in the field of NLP.Footnote 1 In 2020, Sun et al. [39] released ERNIE 2.0. This model extracts more valuable information from the training corpus through continuous multi-task learning. Their experiments show that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks, including the English task on the GLUE benchmark and several similar tasks in Chinese. In 2021, Sun et al. [38] released ERNIE 3.0 by integrating autoregressive and autoencoder networks with general semantic layers and task-related layers. Once the pre-training of the generic semantic layer is completed, it is not updated anymore. Only task-dependent layers are fine-tuned when performing downstream tasks, significantly improving efficiency. Their experiments show that the model outperforms state-of-the-art models on 54 Chinese NLP tasks.

5.3 TinyBERT from Huawei

Large-scale PLMs such as BERT have huge parameters and complex computing processes, making it challenging to apply them on edge devices with limited computing power and memory. To this end, many model compression techniques have been proposed, mainly including quantisation [17], weights pruning [18], and knowledge distillation [35]. In 2019, Huawei proposed TinyBERT [20], which uses a new knowledge distillation method to perform transformer distillation in pre-training and task-specific learning stages. This two-stage learning framework enables TinyBERT to acquire a general knowledge of “teacher” BERT and task-specific knowledge. The research results show that although the size of TinyBERT is only 13.3% of BERT, its computing speed is 9.4 times that of BERT, and the testing effect on the GLUE benchmark is comparable to BERT.

5.4 WuDao Family from BAAI

In March 2021, the Beijing Academy of Artificial Intelligence (BAAI) released large-scale Chinese PLM WuDao 1.0, called WenHui.Footnote 2 It replaces the Transformer model in GPT with Transformer-XL [10], generating human-based text and better maintaining content consistency. It can also learn concepts between different modalities, overcoming the limitation of large-scale self-supervised PLMs that do not possess such cognitive capabilities. Two months later, BAAI released WuDao 2.0 with a parameter volume of 1.75 trillion.Footnote 3 China’s first trillion-level PLM with ten times the number of parameters than GPT-3. Wudao 2.0 can be applied not only to a single text field but also to the visual field. It can generate pictures according to text, and it can also retrieve text according to pictures. WuDao 2.0 achieved first place in 9 benchmarks in terms of precision.Footnote 4

Table 1. Comparison of the characteristics of some pre-trained models
Table 2. Performance comparison of some pretrained models

5.5 PLUG from Alibaba Dharma Academy

In April 2021, Alibaba Dharma Academy released the world’s largest Chinese text PLM, PLUG (Pre-training for Language Understanding and Generation).Footnote 5 According to the strengths of their NLU (Natural Language Understanding) language model StructBERT [44] and NLG (Natural Language Generation) language model PALM [33], they jointly train NLU & NLG of PLUD. The joint training makes PLUG understand an input text better and generate more relevant content accordingly. It also uses more than 1TB of high-quality Chinese training datasets, setting a new record for Chinese GLUE with a score of 80.614 on language understanding tasks. PLUD performs excellently in long text generation such as novel creation, poetry generation, and intelligent question answering. Its goal is to surpass humans in various tasks of Chinese NLP.

5.6 Comparison of Some Chinese and English PLMs

Table 1 and Table 2 compares some Chinese and English PLMs on datasets GLUE/CLUE, MRPC, and SST-2. GLUE is a benchmark dataset for evaluating and analysing the performances of various models in various existing NLU tasks [42], and CLUE is a Chinese NLU evaluation benchmark [46]. The SST-2 (Stanford Sentiment Treebank v2) dataset consists of 215,154 phrases with fine-grained sentiment labels from movie reviews [37]. MRPC (Microsoft Research Paraphrase Corpus), introduced by Dolan et al. [13], is a corpus of 5,801 sentence pairs collected from newswire articles.

6 Practical Applications of Pretrained Models

This section will briefly review the main NLP tasks that PLMs can be applied.

6.1 Sentiment Analysis

During the COVID-19 pandemic, it is critical to identify negative public sentiment characteristics and adopt scientific guidance to alleviate the public’s concerns. To more accurately analyse the sentiment of online reviews, Wang et al. [43] first uses unsupervised BERT to classify the sentiment of the collected text and then uses the TF-IDF algorithm to extract text topics. The accuracy of this method outperforms all baseline NLP algorithms.

6.2 Named Entity Recognition

To solve the accuracy of biomedical nomenclature recognition in low-resource languages and improve the efficiency of text reading, Boudjellal et al. [4] proposed a model named ABioNER based on BERT. They first pre-trained AraBERT on a general-domain Arabic corpus and a corpus of biomedical Arabic literature and then fine-tuned AraBERT using a single NVIDIA GPU. Their test result values demonstrate that building a monolingual BERT model on small-scale biomedical data can improve understanding of data in the biomedical domain.

6.3 Summarisation

Liu, Wu, and Luo [25] proposed a method for summarising legal case documents. First, they extract five key components of a legal case document. Thus, the text summarisation problem becomes five text compression and integration problems for sentences of five different categories. Then they fine-tune five models of PLM GPT-2 for each key component. Next, they use the five fine-tuned models to conduct text compression and integration for summarising each key component. Finally, they put all the summaries of five key components together to obtain the summary of the entire legal case document. They did lots of experiments to confirm the effectiveness of their approach.

6.4 Question Answering

The current BERT-based question answering systems suffer several problems. For example, a system of this kind may return wrong answers or nothing, cannot aggregate questions, and only consider text contents but ignore the relationship between entities in the corpus. As a result, the system may not be able to validate its answer to a question. To address these issues, Do and Phan [12] developed a question answering system based BERT and knowledge graph. They used BERT to build two classifiers: (1) BERT-based text classification for content information and (2) BERT-based triple classification for link information. Their experiments show that their method significantly outperformed the state-of-the-art methods in terms of accuracy and executive time.

6.5 Machine Translation

Zhang et al. [50] proposed a BERT-based method for machine translation, called BERT-JAM. The proposed method has the following features. First, BERT-JAM fuses BERT’s multi-layer representations into an overall representation that the neural machine translation model can use. Second, BERT-JAM can dynamically integrate the BERT representation with the encoder/decoder representations. Third, they fine-tune BERT-JAM using a three-phase optimisation strategy. The strategy can gradually ablate different components to beat catastrophic forgetting during fine-tuning. Their experiments show that the performance of BERT-JAM on multiple translation tasks is state-of-the-art.

7 Conclusions

Before being fine-tuned, PLMs already perform very well. After fine-tuning, their performances are even better, and the fine-tuned models are well-converged. Therefore, PLMs have been used for many NLP tasks [2, 11, 14, 48]. Thus, this paper provides a survey on PLMs to help researchers quickly understand various PLMs and determine which ones are appropriate for their specific NLP projects. Specifically, we brief the main machine learning methods used by PLMs and review early PLMs, main state-of-art PLMs, and several well-known Chinese PLMs. Moreover, we compare the performance of some mainstream PLMs. In addition, we list the main NLP tasks for which PLMs have been used and review some state-of-art work for each task of them.

Although the emergence and application of PLMs have promoted the rapid development of many NLP tasks, due to the complexity of natural language, PLM technology still faces many challenges. First of all, the performance of PLMs is far from reaching its upper limit. Longer training steps and larger datasets could potentially improve its performance. Secondly, fine-tuning is required when applying PLM to downstream tasks, but the fine-tuning is specific, which may result in low efficiency. When applying PLMs in specialised fields such as biomedical science and law, PLMs may be susceptible to learning and amplifying biases in datasets due to the specificity of datasets in specialised fields. For example, a PLM may generate biases against age groups and gender. Finally, there are many different languages, and many ways to express their linguistic information. So, a single pre-trained language model cannot meet people’s needs fully. Hence, multi-lingual PLMs and multi-modal PLMs have become a particular focus of attention as it is vital to improve their performance to meet various needs now and in the future.