1 Introduction

Natural language generation (NLG) tasks based on deep learning techniques have received significant research attention in recent years. The applications of natural language generation are text summarization [17, 31] (generating summarization of a given article), machine translation [8, 24] (generating text in a different language based on the source text), and question generation task [4, 29] (automatically generating questions based on a given context paragraph). Such NLG applications play important roles for AI applications nowadays.

For NLG, a common practice is to employ the Transformer [27]. The Transformer is an encoder-decoder architecture. The encoder contains encoding layers processing the input iteratively one layer after another, while the decoder contains decoding layers for text generation. In the literature, there are three variants of using the Transformer for language generation, which we review them as follows.

  • Encoder-Decoder (full Transformer): A direct employment of the Transformer for text generation. In this architecture, input is encoded by the encoder and the decoder conducts the shifted right [27] operation to predict the next token. Representatives for such an architecture are BERT2BERT model [24] and BART model [14].

  • Encoder-Only: In this line of work, only the encoder part of the Transformer is used for text generation. Representatives for such a architecture are BERT-HLSQG model [4] and BERTGEN model [19]. The main idea is to predict the next token by iterating over the encoder.

  • Decoder-Only: In this line of work, only the decoder part of the Transformer is used. A well-known representative is GPT-2 model [22]. The idea is to predict the next token by continuously iterating the Decoder.

Although all three architectures have the own advocators, by our investigation, the state-of-the-art (SOTA) result of various NLG tasks are based on the full Transformer architecture. Therefore, the consensus is that when training dataset is sufficient, the full transformer architecture would be the best choice.

However, their comparison (the three transformer variant architectures) under low-resource datasets settings remained under-explored. Example scenario for low-resource text generation setting are matching of medical questions [18], reading comprehension of Persian [12], or machine translation from Cherokee to English [32].

In this paper, we investigate the performance difference of the full Transformer, Encoder-Only, and Decoder-Only architecture under low-resource dataset setting. Specifically, in this paper, we mainly explore the research question:

Which architecture will be the best choice for text generation task under low-resource dataset setting?

Specifically, this paper reports experiment results of applying the three architectures to four different tasks (Paraphrase Generation, Machine Translation, Question Generation, and Abstractive Text Summarization). In contrast to the conclusion drawn by rich dataset settings, we find that there are no consistent results indicating which architecture is the best under low-resource dataset settings.

By experiment observation, we find the following observations for text generation under low-resource datasets settings.

  • First, NLG tasks requiring semantic understanding, such as abstractive text summarization and question generation, are better to tackle by using the full Transformer.

  • Second, the Encoder-Only architecture shows better performance for a NLG task requiring only the lexical reformation or rewriting, such as paraphrase generation.

  • Third, the Decoder-Only architecture seems not a good choice when a low-resource dataset setting is considered.

2 Related Work

In the literature, there are three main Transformer variants for NLG: full Transformer, Encoder-Only, and Decoder-Only architectures. The full Transformer’s representative is the BERT2BERT model [24], the Encoder-Only’s representative is the BERT-GEN model [19], and the Decoder-Only’s representative is the GPT-2 model [22]. Three architectures have been employed for various NLG applications. For evaluating the performance of NLG models, there are four commonly used benchmark tasks: Paraphrase Generation, Machine Translation, Question Generation, and Abstractive Text Summarization. We find that although the models have the own advocators and the SOTA results of the four tasks are mainly the full Transformer. Please refer to Table 1, from which one can see the SOTA performance of the tasks are the full Transformer. In fact, the general consensus of selecting NLG architecture is to use the full Transformer.

Table 1. The current SOTA research on rich-resource generation tasks.

The works [1, 24] compare the performance of the full Transformer and Decoder-Only architecture. The study [24] is to replace the weights of full Transformer with pre-trained checkpoints and compare them with GPT-2. [1] is to pre-train the full transformer with the information of the relevant task and compare it with GPT-2. The conclusions made by the two studies indicate that the full Transformer is the winner.

However, we would like to note that the existing comparison are based on the rich training setting. To our best knowledge, the comparison under low-resource setting is not explored. In this paper, we use the datasets listed in Table 1 to compare the three architectures under low-resource settings.

With regard to low-resource settings, research investigated by [9] points out that the amount of so-called low-resource varies for different tasks. For example, the work in [30] treats 350K as a low-resource for question response generation tasks. Yet another example is that [6] treats 10K as low-resource on abstractive text summarization tasks. In order to maintain uniformity and to consider a more demanding resource situation, we set to take 1K and 3K of the training data for each task to compare the models.

Note that there are many techniques for addressing low-resource setting, such as data augmentation [5] or transfer learning [26] for making effective use of low-resources datasets. We would like to note that the goal of this study to compare the strengths and weaknesses of the architectures directly trained with the given insufficient data.

3 Performance Comparison

In this section, we conduct experiments on the mentioned four generation tasks to observe the performance difference of the compared architectures.

For a fair comparison, the models are all trained with initial parameters whose weights are randomly set. Furthermore, we consider two low-resource dataset setting: 3k and 1k settings; we randomly select 3000 and 1000 instances from the original datasets as training datasets for simulating the low-resource dataset setting. We use the original released testing dataset setting for performance evaluation. The scores are the average of the three random selected training sets. We evaluate the performance through the evaluation package released by [25]. The package includes BLEU 1, BLEU 2, BLEU 3, BLEU 4 [21], METEOR [2] and ROUGE [16] evaluation scripts.

3.1 Model Setup

We built the Encoder-Decoder, Encoder-Only, and Decoder-Only architectures based on the PyTorch version of BERTFootnote 1, and initialized all weights randomly for training each task. All tasks use the BERT-Base Cased vocabulary (28996 words) and follow the [27] settings, with the hidden dimension set to 512, attention heads set to 8, and a feed-forward layer set to 2048.

Note that we adjust the layers of the compared models to have a fair comparison. This is because if the architectures are using the same number of layers, there will be significant difference of the total number of parameters for the architecture, bringing the concern of unfair comparison. Therefore, we adjust the number of layers of the implemented architectures to enable a match/close parameter numbers. We set the Encoder-Decoder to have one layer only (an encoder layer and a decoder layer). On the other hands, the number of layers of Encoder-Only and Decoder-Only is set to 7. The total number of parameters of each implemented model is near 52M.

The dropout probability between transformer layers was set to 0.1. The Adamax optimizer is applied during training with an initial learning rate of 5e-5. The batch size for the update is set at 50. The Epoch is set to 60 for Encoder-Only and 100 for Decoder-Only and Encoder-Decoder. All of our models are trained by using two TITAN RTX GPUs.

3.2 Paraphrase Generation

Paraphrase Generation is a task that take a source sentence to generate a sentence with different syntax structure but the same semantic meaning. We use GLUE-QQP dataset [28] to compare the model performance.

GLUE-QQP: A collection of question pairs collected from the community question-answering website Quora are tagged with either 0 or 1, with 1 meaning the two sentences are semantically identical, and 0 the other way around. In the dataset, there are 134,378 instances labeled as 1. We set the maximum length of a source sentence to 105 and a target sentence to 97.

Results. Table 2 shows GLUE-QQP validation results. We see Encoder-Only show the best performing results on both 1K and 3K. We think this is related to the characteristic of the paraphrase generation task. Since paraphrase generation is mainly about swapping or grammatical changes to the words, Encoder-Only learns one token at a time, so it can effectively catch the points that need to be changed.

3.3 Machine Translation

Machine Translation task, which translates the source text of one language into the text of another language. We conduct our comparisons on WMT14 German-English Newstest2014 [3].

WMT14 German-English Newstest 2014: There are 4.5M sentence pairs in the original datasets. We set the goal to translate German to English. We set the maximum length of German sentence to 234 and English sentence to 124.

Table 2. GLUE-QQP evaluation results
Table 3. WMT14 German-English Newstest 2014 test results

Results. Table 3 shows WMT14 German-English Newstest 2014 test results. We see Encoder-Decoder obtains the best scores on both BLEU 3 and BLEU 4. We think that machine translation tasks requires deep understanding of grammatical differences between different languages. Under this task characteristic, the full Transformer is a better fit, where the encoder takes charge of understanding the source language and the decoder responses for target language generation.

Table 4. SQuAD 73K test results
Table 5. CNN/DailyMail test results

3.4 Question Generation

Question generation task, which takes a context text and an answer phase as input and generates a question corresponding to the given answer phase. We evaluate the performance on SQuAD 73K [7]. The SQuAD contains 536 articles with 100K questions (and the corresponding answers) about these articles.

SQuAD 73K: Based on the setting by [7], SQuAD 73k [23] is divided into the training data with a training set (80%), a development set (10%) and a test set (10%). We set the maximum length of the context to 422, question to 50 and answer to 15.

Results. Table 4 shows SQuAD 73K test results. The best results were obtained by Encoder-Decoder in both 1k and 3k experiment setting. We think that the question generation task needs to understand the context and the answer before the relevant questions can be generated. A conclusion similar to the language translation is that for understanding paragraph level information, the Encoder-Decoder is the most suitable.

3.5 Abstractive Text Summarization

Abstractive Text Summarization task, whose goal is to take an article to generate a coherent and semantically correct abstract. We evaluate our model performance on CNN/DailyMail [10, 20].

CNN/DailyMail: The corpus has 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs. We set the maximum length of the article to 435 and corresponded abstract to 73.

Results. Table 5 shows CNN/DailyMail test results. Encoder-Decoder still gets the best scores on both 1K and 3K ROUGE-L. We think that the abstractive text summarization task still requires understanding the context of the article in order to generate a relevant summary. Therefore, the Encoder-Decoder architecture is again a best fit for the abstractive text summarization task.

3.6 Result Discussion

Based on the experimental results, we think that the full Transformer can efficiently leverage the bidirectional information captured by the encoder component and leverage the auto-regressive capability of the decoder component for coherent text generation, which is suitable for text generation tasks requiring semantic understanding. On the other hand, Encoder-Only can generate the next word that may appear by using bidirectional information, but it cannot be utilized and generated efficiently. However, it provides excellent performance in lexical reformation or rewriting tasks, such as paraphrase generation. Decoder-Only can only use the previous information to generate the next word that may appear, but it cannot use the previous information to do the action of mutual consideration, so it is inferior to the full Transformer in tasks that require paragraph level semantic understanding.

4 Conclusion

In this paper, we conduct experiments to compare three main NLG architectures to see which one is more effective for each task under low-resource setting scenario. Different to the previous conclusion (the full Transformer always a winner) on rich dataset setting, we find Encoder-Only architecture will be good for tasks requiring only text reformation or rewriting, such as paraphrase generation. However, if a NLG task requires understanding paragraph level semantic, the full Transformer is still the best choice.