Keywords

1 Introduction

The rate of adoption of NLP applications by companies and customers is increasing rapidly. This is mostly due to the progress that has been made by deep learning (DL) and transformer-based pre-trained language models (LM) [21]. Some of these LM can even be used, and personalized directly without any knowledge of machine learning or coding.

The field of NLP contains many tasks, and new tasks are proposed each year by the NLP research community. In deep learning based NLP, some tasks are more studied than others. In the last couple of years, DL transformer-based LM achieved state-of-the-art performances on the majority of NLP tasks.

The field of NLP does not have a universal evaluation metric that can be used to evaluate the performance of new models on every task. But rather, a variety of metrics, like, BLEU [19], and ROUGE [15], among others. These metrics are used for specific tasks, BLEU is used in machine translation (MT) for example, and ROUGE for summarization. However, when we want to evaluate the generalization of a LM on multiple tasks at once, we confront with a major problem, which is the lack of a universal and unique metric for all or at least a subset of NLP tasks. This is one of NLP’s open challenges [10] that is attracting more research in recent years. The study of this problem is the core of this paper, where we provide an overview of evaluation metrics and multitask NLP benchmarks along with our proposed octaNLP benchmarking approach for comparing the generalization capabilities of DL transformer-based language models.

This paper is organized as follows, the next section overviews the most used NLP evaluation metrics. Section 3 describes the available multitask NLP benchmarks. In Sect. 4, we overview the most important DL transformer-based pre-trained LM. In Sect. 5, we discuss the limitations of the available multitask NLP benchmarks, and we propose our octaNLP benchmark for comparing the generalization performance of transformer-based pre-trained LM on multiple downstream tasks simultaneously. Finally, we finish the paper with a conclusion.

2 Evaluation Metrics in NLP

In the field of NLP, there is no single metric that can be used to evaluate the performance of a system on all NLP tasks. But rather, a set of metrics that are used depending on the task. In the case of classification for example, the accuracy metric can be used, which indicates the percentage of correct classifications. Other metrics can also be used in the case of classification, like F1, exact match, and Matthews correlation coefficient [17] These classification metrics are not specific to NLP, but rather, used in a wide range of areas and disciplines. On the other hand, there are metrics that are specific to NLP, the most used ones are listed below:

BLEU: The bilingual evaluation understudy (BLEU) [19] is an automatic metric that was initially defined to evaluate systems for machine translation (MT). However, it is now also used in other Natural Language Generation (NLG) tasks like, summarization, and dialogue. The BLEU score is used to compare a candidate translation to one or more reference translations. This score can range between O and 1, for 1 being a perfect translation. BLEU has many strong advantages; it is an automatic metric, language independent, and proved to correlate highly with human judgment.

ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [15] is a set of metrics that are used to evaluate the performance of automatic summarization or machine translation systems. ROUGE metrics compare a candidate summary or translation to one or more reference summarizations or translations.

In Table 1, we list the most used metrics in NLP along with their associated tasks.

Table 1 The most used metrics in NLP along with their associated tasks

3 Multitask NLP Benchmarks

decaNLP: The Natural Language Decathlon (decaNLP) [18] benchmark was introduced in 2018. The goal of this benchmark is to evaluate single models that can generalize to many different NLP tasks simultaneously. The tasks included in the benchmark are, semantic parsing, natural language inference, question answering, document summarization, machine translation, sentiment analysis, semantic role labeling, goal-oriented dialogue, pronoun resolution, and relation extraction. All these tasks were framed as a question answering problem, and are trained jointly. All training instances are in the form of (question, context, answer) triplets. To be able to evaluate the generalization of NLP models across all tasks simultaneously, the creators of decaNLP defined their own score that they called decaScore, which is simply the sum of the scores of all tasks. The creators of decaNLP also provided and evaluated three baseline models, a pointer-generator sequence-to-sequence (S2S) model [24], an S2S model augmented with self-attentive encoder and decoder layers [26], and an S2S model augmented with a coattention mechanism [32]. In addition to the three baseline models, the creators of decaNLP also built their own model that they called the multitask question answering network (MQAN). MQAN learns all of decaNLP tasks jointly, and does not require any task-specific modules or parameters. This model achieved improved performance on the majority of decaNLP tasks.

GLUE: Similar to decaNLP, the General Language Understanding Evaluation (GLUE) [28]Footnote 1 benchmark aims to drive research in general NLP models that can generalize well to a variety of different tasks. However, the scope of GLUE is more limited than decaNLP, because GLUE is only concerned with Natural Language Understanding (NLU) tasks. These tasks along with their associated datasets and metrics are listed in Table 2. To evaluate the general performance of NLP models across all tasks, GLUE define a single score, which is simply the average score on all tasks with all tasks having the same weight. For tasks with multiple metrics, the benchmarking algorithm of GLUE first averages those metrics to get a single task score. Since its release, a large number of models have been tested on the benchmark, especially, transformer-based pre-trained language models. Recent models have surpassed the human performance on GLUE for the majority of its tasks.

Table 2 GLUE tasks along with their associated datasets and metrics

SuperGLUE: Like GLUE, The SuperGLUE [27]Footnote 2 benchmark aims to evaluate general NLP models on a variety of tasks simultaneously. This benchmark was introduced after the surpassing of human performance on GLUE by the recent models on the majority of GLUE tasks, which made GLUE no longer suitable for tracking the progress towards general NLU models. SuperGLUE differs from GLUE in that it contains more difficult and challenging NLU tasks with more diverse tasks formats. SuperGLUE adopts the same scoring philosophy as GLUE, by weighting each task equally and averaging all tasks score’s, to provide a single general score. The tasks used in SuperGLUE along with their associated datasets and metrics are listed in Table 3.

SentEval: SentEval [4] is a benchmark and a toolkit for evaluating the quality of universal general-purpose sentence representations. The goal of this benchmark is to drive research in finding sentence representations that can yield good results when applied on a variety of different downstream NLP tasks. SentEval contains a diverse set of tasks including, binary and multi-class classification, entailment and semantic relatedness, Semantic Textual Similarity (STS), paraphrase detection, caption-Image retrieval, and sentiment analyses.

Table 3 Super GLUE tasks along with their associated datasets and metrics

4 Transformer-Based Pre-trained Language Models

In this section, we will provide a short overview of the most important transformer-based pre-trained language models. Most of these models are based on BERT, the first transformer-based pre-trained language model released. Figure 1 shows the models that were derived from BERT, along with what was added to them.

BERT [6]: a pre-trained model based on the transformer model. BERT is designed to perform deep two-way representations from unlabeled text by jointly conditioning the left and right context in all the layers. It pre-trains a next sentence prediction task to understand sentence relationships.

SemBERT [34]: This model is capable of explicitly absorbing contextual semantics over a BERT backbone. SemBERT keeps the convenient usability of its BERT precursor with light fine-tuning and without substantial task-specific modifications. Compared with BERT, SemBERT is as simple in concept but more powerful. It obtains new state-of-the-art or substantially improves results on ten reading comprehension and language inference tasks.

StructBERT [29]: This model was made by incorporating language structures into pre-training. Specifically, its trained with two auxiliary tasks to make the most of the sequential order of words and sentences, which leverage language structures at the word and sentence levels, respectively.

ALBERT [12]: This model presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT: Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.

ELECTRA [3]: is a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

T5 [22]: is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.

BART [14]: This model combines bidirectional and auto-regressive transformers. It is a denoising autoencoder built with a sequence-to-sequence model that can tackle a wide range of NLP tasks from NLU to NLG. Although it is particularly effective when fine-tuned for text generation tasks. BART achieved new state-of-the-art on numerous tasks such as dialogue, question answering, and summarization.

Fig. 1
figure 1

Models that were derived from BERT, along with what was added to them

5 Our OctaNLP Benchmarking Approach

In Sect. 3, we reviewed the available multitask NLP benchmarks. We saw that decaNLP include 10 diverse tasks, with different evaluation metrics, such as F1, accuracy, BLEU and ROUGE. The variety of tasks and metrics makes decaNLP a perfect benchmark for evaluating the generalization of NLP models. However, decaNLP was released before BERT, the first transformer-based pre-trained language model. Therefore, it is not known if the benchmark is compatible with those kind of models. To the date of the writing of this paper, and to the best of our knowledge, no transformer-based pre-trained language model has been tested on decaNLP.

As for GLUE and SuperGLUE, we saw in the same section, that these two models are only concerned with evaluating the generalization on NLU tasks. The lack of any NLG task such as machine translation, summarization or dialogue, inhibits these two benchmarks from evaluating the generalization capabilities on all NLP tasks.

As for SentEval benchmark, its only goal is to evaluate the generalization of sentence representations, and same as GLUE and SuperGLUE, it is only concerned with NLU tasks.

To overcome the limitations of these multitask benchmarks, we propose a novel benchmark that we call octaNLP for evaluating the generalization capabilities of transformer-based pre-trained language models. The 8 tasks that we included in our benchmark covers the two pillars of NLP, NLU and NLG. Therefore, we think that our benchmark is more suitable for evaluating the generalization capabilities across all NLP tasks.

In Table 4, we list the datasets that we considered in our benchmark, along with their associated tasks and evaluation metrics.

Table 4 The datasets adopted in octaNLP, along with their associated tasks and evaluation metrics

We defined a single overall score, that we called octaScore, which is simply the average of the scores of all 8 tasks. We applied our benchmarking approach to two of the recent transformer-based pre-trained language models, BART and T5. Table 5 shows the results of these two models on each individual task along with the overall octaScore. We plan to apply our octaNLP benchmark to other models as a future work.

From this experiment, we can see that the T5 model achieved the best octaScore, which means that this model can generalize better than BART on diverse NLP tasks. This is because T5 is a text-to-text model, meaning that it approaches every NLP task in the same manner, as text input to text output.

Table 5 Benchmarking results of BART and T5 on octaNLP benchmark

6 Conclusion and Future Work

Transformer-based pre-trained language models have achieved remarkable results on many individual NLP tasks, but are still lacking generalization capabilities to be applied to multiple tasks simultaneously. In this paper, we provided an overview of NLP evaluation metrics, multitask benchmarks, and transformer-based pre-trained language models. We presented the limitations of the current multitask benchmarks, and proposed our octaNLP benchmark for comparing the generalization capabilities of the transformer-based pre-trained language models on multiple downstream NLP tasks simultaneously. As a future work, we plan to test the multitask generalization capabilities of other transformer-based pre-trained language models using our octaNLP benchmark.