Keywords

1 Introduction

The original NLI task, known as Recognizing textual entailment [8, 9], required the machine learning model to capture the semantics of a given pair of premise and hypothesis sentences. This semantic relationship can fall into cases like Entailment, Contradiction, or Neutral. In recent years, the Natural Language Inference task has achieved significant success, which plays a crucial role because it affects many NLP tasks such as machine reading comprehension [18] and question answering [4]. The remarkable point in this task is that the presence of many high-quality large datasets in many different languages ranging from rich-resource languages such as English [2, 20, 26] and Chinese [15] to poor-resource languages such as Korean [14], Indonesian [17], and Persian [1]. As a low-resource language, Vietnamese still has many limitations for outstanding research in this NLI task. However, recently the research community has witnessed the launch of the ViNLI dataset, which was developed by Huynh et al. [16] for Vietnamese. This dataset has yielded some positive research results, so it is hoped to promote more and better research outcomes in the future.

It can be seen that there is an interplay between datasets and machine learning models. In other words, datasets play an essential role in the evaluation of machine models, and machine learning models are increasingly thriving to improve accuracy on NLI tasks dramatically. In particular, the appearance of transformer architecture [24] is a leap forward for developing various tasks in NLP, including NLI task. After that, the BERTology model [10] is becoming a trend thanks to its transformer-based architecture. However, we still do not fully understand why BERT has such good performance, which is also a problem for that many researchers are trying to find an explanation.

In this paper, we try to investigate the behavior of the pre-trained BERT language model and variant models of BERT through the lens of the Vietmanses NLI task. Vietnamese is an interesting language, but not much research has been done. From the current research results from the ViNLI dataset [16], we focused on setting up experiments in this paper. We deeply analyzed the features contained in ViNLI to see what affects the pre-trained model performance. This study can help us better understand pre-trained models as well as the ViNLI dataset. We hope these analyses point to potential future studies to improve the Vietnamese NLI task outcomes further.

2 Related Work

In recent years, many NLI datasets have been built for studying the effectiveness of machine learning models such as deep learning and transfer learning. Many large benchmark datasets have been introduced related to human natural language inference. Specifically, the dataset named SNLI [2] introduced in 2015 is a large manually labeled dataset from Stanford University. Then, a series of other datasets appeared, such as STS-B [3], QQP [5], introduced in 2017 and 2018 for English. In 2018, a large dataset for this language was also published for research as MultiNLI [26] with 433K pairs. In addition, datasets for various languages have emerged in the NLP Research communities, including FarsTail [1] for Persian, KorNLI & KorSTS [14] for Korean, IndoNLI [17] for Indonesian, and OCNLI [15] for Chinese. Regarding the multilingual dataset, the XNLI dataset [7] was released in 2018 with more than 112K pairs for 15 languages. In Vietnamese, we have a ViNLI dataset introduced by Huynh et al. [16] to promote NLI research in Vietnamese.

Natural language inference research is growing rapidly due to the explosion of high-quality large datasets and deep learning models. Besides machine learning models based on neural networks such as RNN [11], Bi-LSTM [12] have achieved good performance on this task, the transformer-based core model plays a vital role. BERT was published by Devlin et al. [10]. Its architecture includes a variable number of Transformer encoder layers and self-attention heads. With this architecture, BERT achieves many state-of-the-art results for several Natural Language Understanding tasks on different datasets such as GLUE benchmark [25], SQuAD [22], and SWAG [28]. With the NLI task, pre-trained transformer models on many languages such as multilingual BERT [10], XLM-R [6], SBERT [23] give surprising results on the datasets MultiNLI [26], XNLI [7], QQP [5], STS-B [3]. PhoBERT [19] is a monolingual pre-trained model developed only for Vietnamese that is also giving positive results on many NLP tasks such as text classification, natural language inference, or named entity recognition.

3 Dataset

The ViNLI benchmark [16] is used for evaluating the accuracy of pre-trained models. The statistics on the dataset are shown in Table 1. ViNLI is an open-domain dataset built on Vietnamese news text. This dataset is quite large for Vietnamese at the moment, with 30,376 pairs of premise-hypothesis sentences manually annotated by humans. The special thing about ViNLI compared to other datasets is that it has an additional label Other instead of three labels Entailment, Contradiction, and Neutral like other datasets. The authors added the Other label to distinguish it from the Neural label.

Table 1. The number of premises-hypothesis pairs in the ViNLI dataset.

4 Experiments and Results

This section presents experiments with multilingual pre-trained models on the ViNLI dataset. Following the prior work [2, 16], we use the accuracy measures and F1-score to evaluate the performance of those models.

4.1 Data Preparation

The ViNLI benchmark dataset is used for experiments on pre-trained models. However, according to the experimental results of Huynh et al. [16], the accuracy of the best model giving accurate results on the Other label is very high, above 98%, so we focus on the analysis of the dataset with three labels Contradiction, Entailment, and Neutral. Therefore, before installing the experiment, we remove the pairs of sentences labeled Other from the train, dev, and test set.

4.2 Experiment Settings

Besides experiments with pre-trained models, including multilingual BERT [10], PhoBERT [19], XLM-R [6] established on ViNLI by Huynh et al. [16], we also carry out the experiment on a model Another pre-trained model is SBERT [23]. The SBERT model is pre-trained in many different languages, including Vietnamese. We use these pre-trained models provided to HugggingFace’s library in our experiments. The parameters in the SBERT model are we set up as follows: learning_rate = 1e−05, batch_size = 16, max_length = 256, in addition, we set epoch = 10.

4.3 Experimental Results

The experimental results are shown in Table 2. Compared with the experimental results of Huynh et al. [16], it can be seen that the performance of the SBERT model is the lowest with the accuracy on the dev and test sets of 59.29% and 58.17%, respectively. Besides, the experimental results on SBERT have a rather large gap compared with other pre-trained models, especially when compared with the XLM-R_large model. This difference in accuracy is more than 23% on both the dev set and test set.

Table 2. Machine performances on the development and test sets of ViNLI dataset. Results of mBERT, PhoBERT, and XLM-R are from Huynh et al. [16]

5 Result Analysis

In this section, we carry out an analysis of the results of these pre-trained models, which aims to explore how the characteristics of the ViNLI dataset affect the performance of these pre-trained models. The issues in ViNLI that we are interested in analyzing include the influence of the annotation rule, word overlap, sentence length on performance, ability to capture annotation artifacts of pre-trained models, and error analysis by confusion matrixes.

5.1 Effects of Annotation Rules

According to Huynh et al. [16], to build the ViNLI dataset, annotators have to follow an annotation guideline. In the guidelines, they present suggested rules for annotators to writing a hypothesis corresponding to a premise sentence. To analyze how the characteristics of the ViNLI construction method affect the results of the pre-trained models, we investigate how the rules of creating hypothesis sentences for entailment and contradiction labels affect the performance of models. The rules list for creating the hypothesis sentences of the label entailment and contradiction is shown in Table 3 and Table 4. We selected 200 premise-hypothesis pairs of the entailment label and 200 premise-hypothesis pairs of the contradiction label in the test set for analysis. From these 400 pairs of sentences, we annotate the creating hypothesis sentence rules for these pairs of sentences following guidelines of Huynh et al. [16]. The percentages of each rule generating the hypothesis of the label entailment and contradiction are shown in Table 3 and Table 4, respectively.

In terms of entailment rules, we found that annotators tended to use the “replace words with synonyms” rule the most, with 56%. Besides, rules like “Add or remove modifiers that do not radically alter the meaning of the sentence” and “Change active sentences into passive sentences and vice versa” also account for a significant percentage of the annotators’ writing style, with 54% and 35%, respectively. In contrast, rules like “Turn adjectives into relative clauses”, “Create conditional sentences”, or “Turn the object into relative clauses” are the least used by annotators to create the entailment hypothesis, with only from 1% to less than 4%.

We observe that the accuracy results of the pre-trained models on the entailment rules in Table 3 are interesting, with many similarities and differences between the models. All four models, SBERT, mBERT, PhoBERT, and XLM-R have the worst performance on pairs of sentences generated from the rule “Turn adjectives into relative clauses” even the mBERT model does not correctly predict any pairs, while the other three models correctly predicted half of those pairs of sentences. In addition, the rule “Create conditional sentences” is also a rule that makes it difficult for mBERT, PhoBERT, and XLM-R models with lower accuracy compared to the accuracy of other rules. SBERT model has the highest accuracy on two rules, “Replace words with synonyms” and “Create conditional sentences” with over 66%. Furthermore, the PhoBERT model has the best performance on the pairs of entailment sentences generated from the rule “Add or remove modifiers that do not radically alter the meaning of the sentence” with 86.11%. Both the mBERT and XLM- R models have the highest accuracy on the rule “Turn the object into relative clauses” with 85.71% and 100%, respectively.

Table 3. Statistics of rules generate entailment sentences and the accuracy of pre-trained models on these rules.

Regarding contradiction rules, Annotators frequently use the “Replace words with antonyms” rule to generate most hypothesis sentences with over 36%, around five times as high as the “Opposite of time” rule, which has the lowest percentage. In addition, the percentage of contradiction hypothesis generated from “Opposite of quantity” and “Opposite of time” rules is quite low. Table 4 shows that four pre-trained models have the best predictive ability on pairs of “Use negative words” rule with high accuracy, especially mBERT and XLM-R models achieve nearly 90%. In particular, the XLM-R model does not have difficulty with pairs of sentences belonging to “Other” rules with absolute accuracy up to 100%, while the number of these pairs of sentences in the dataset is the lowest. The analysis results also show that the SBERT model has the worst performance on the hypothesis sentences generated from the “Replace words with antonyms” rule with only 32.87%. In contrast, the predictive ability of the PhoBERT and XLM-R model on this rule is quite high relative to 82.19% and 87.67%. Besides, The mBERT model has the lowest accuracy on sentence pairs from the rule “Wrong reasoning about an event”. PhoBERT’s accuracy is the lowest on the “Opposite of time” rule with around 50%.

Table 4. Statistics of rules generate contradiction sentences and the accuracy of pre-trained models on these rules.

We also analyze how annotators combine multiple rules to write hypothesis statements that affect the performance of pre-trained models. The ratio of the number of rules have in a hypothesis is shown in Table 5, along with the performance of the pre-trained models. In general, the number of rules used to generate the entailment hypothesis sentences is equally distributed over 1, 2, and more than 2 rules. In addition, most contradiction hypothesis sentences are written using a rule with 63% and a lower percentage of 37% for cases generated from more than 1 rule. We observe on the entailment label that while mBERT has the best accuracy on the entailment hypothesis sentences with only 1 rule, accuracy decreases as the number of rules increases. In contrast, the performance of the XLM-R model increases as the number of rules used to generate the entailment hypothesis sentences increases. Besides, both SBERT and PhoBERT models have the best predictive ability on entailment hypothesis sentences with 2 rules and maintain stability with 1 or more than 2 rules. For the number of rules in the contradiction hypothesis, all four pre-trained models have better predictability when the hypothesis is generated from multiple rules.

Table 5. The effect of the number of rules in the entailment and contradiction hypothesis sentence on the performance of the pre-trained models.
Fig. 1.
figure 1

The effect of word overlap on the accuracy of pre-trained models.

5.2 Effects of Word Overlap

To analyze whether word overlap between premise and hypothesis sentences in ViNLI affects the performance of pre-trained models? We calculate the word overlap of premise-hypothesis pairs on the test set according to three different metrics, including Jaccard, The Longest Common Subsequence (LCS), and new token rate similar [17]. And then, we analyze the accuracy of the models according to these measures.

First, we use Jaccard to measure the degree of unordered word overlap by token level; The resulting accuracy of models by Jaccard is shown in Fig. 1a. It can be seen that the XLM-R model has the best performance on all Jaccard ranges. The accuracy of the SBERT model is quite low when Jaccard is less than 40%, and then slightly increases with Jaccard in a range of from 42% to more than 80%. All three models, mBERT, PhoBERT, and XLM-R have the worst performance when the Jaccard between the premise and the hypothesis is less than 20%, and performance increases dramatically as the Jaccard increases. However, the accuracy of the PhoBERT model decreases significantly when the Jaccard between premise and hypothesis sentences is more than 80%.

Second, we use LCS to measure the degree of word overlap in order between the premise and the hypothesis sentences by character. The accuracy of the models according to the LCS is indicated in Fig. 1b. We found that the XLM-R model has the highest performance and is relatively stable on most levels of LCS compared to the other models. While the PhoBERT model has low performance on premise-hypothesis pairs with LCS less than 20 characters, the mBERT model has difficulty when sentence pairs have LCS less than 20 characters and higher than 60 characters.

Third, we also analyze the results of the models according to the ratio of new words in the hypothesis sentence compared to the premise sentence. The analysis results are shown in Fig. 1c. Most of the performance of pre-trained models decreases remarkably as the new word rate increase from 0 to more than 80%.

From these analysis results, it can be seen that the degree of word overlap between the premise and the hypothesis sentences significantly influences the accuracy of the pre-trained models.

5.3 Effect of Sentence Length

The issue we are also interested in analyzing in this section is the effect of the length of inference sentences pair on the performance of pre-trained models. Models’ accuracy on the test set concerning the length of the premise sentence, the length of the hypothesis sentence, and the total length of the premise sentence and the hypothesis sentence by token are shown in Figs. 2a, 2b, and 2c, respectively. We found that the accuracy of most models increases significantly as the length of the premise sentences rises from 1–10 tokens to 21–30 tokens. While the accuracy of the PhoBERT and mBERT models continues to increase slightly as the premise sentence length rises to more than 50 tokens, the XLM-R and SBERT models decrease slightly.

Regarding the hypothesis sentence length, the mBERT and XLM-R model’s accuracy decreases significantly when the hypothesis sentence length increases from 1 to 40 tokens, followed by a gradual escalation when the hypothesis sentence length is more than 40 tokens. Looking at the 2b figure, we find that the performance of the PhoBERT and SBERT model when the same when the hypothesis sentence length is in the range of 1 to 40 tokens. While SBERT’s performance continued to surge above 80% when the length of the hypothesis sentence increased from 41–50 Tokens before its performance dropped below 60% when the hypothesis length sentence length was more than 50 tokens if we take a look at PhoBERT models, we will see an opposite trend.

We find that the performance of the SBERT, mBERT, PhoBERT, and XLM-R model is relatively high when the total length of the premise and hypothesis is between 1–20 tokens; even the XLM-R model is almost entirely correct. However, the performance of these models goes down significantly as this total length increases from 20 tokens to more than 100 tokens.

Fig. 2.
figure 2

The effect of length of premise and hypothesis sentences on pre-trained models.

Table 6. Hypothesis-only baselines for ViNLI.

5.4 Hypothesis only Model Analysis

Inspired by the research of [21], we investigate whether the annotation artifacts leave any clues on the hypothesis sentence that help language inference models correctly predict the label. The models’ performance is trained with only hypotheses illustrated in Table 6. We observe that the XLM-R and PhoBERT models have pretty impressive results when the accuracy on the Test set with 56.63% and 57.68%, respectively. Besides, we calculate Pointwise Mutual Information (PMI) [13] to observe which words in the hypothesis sentences can distinguish labels from each other. PMI results for the top 5 words of each label are shown in Table 7. With the entailment label, we found it quite interesting that the word “không” is actually a word that represents this class. This is entirely different from the OCNLI [15] and IndoNLI [17] datasets, where negative lexical dominate in hypothesis sentences of contradiction label. In addition, the word “có” and “ ” can be a sign to discriminate the neutral class from other classes. However, the PMI results also show that some words can represent multiple classes, such as “và” and “trong”. There’s not too influential in terms of the lexical difference between classes. Therefore, pre-trained models are made difficult by the ViNLI dataset if only trying to rely on hypothesis sentences to predict.

Table 7. Top 5 (word, label) pairs PMI for different labels of ViNLI.

5.5 Error Analysis by Confusion Matrixes

Figure 3 illustrates the confusion matrix of the four pre-trained models the development set, including SBERT, mBERT, PhoBERT, and XLM-R. While the SBERT, mBERT, and PhoBERT models erroneously predict a significant number of sentence pairs with the CONTRADICTION label to the NEUTRAL label, many contradictory sentence pairs are mistakenly predicted by the XLM-R model as the label ENTAILMENT. In addition, the rate of EMTAILMENT sentence pairs being mispredicted to the CONTRADICTION label and the NEUTRAL label was quite similar for each model except for the mBERT model, which had more false predictions to the CONTRADICTION label than to the NEUTRAL label. With sentence pairs of the NEUTRAL label, the XLM-R model has the best prediction ability on this label. Meanwhile, the mBERT model gives a significantly incorrect prediction from the NEUTRAL label to the CONTRADICTION label.

Fig. 3.
figure 3

Confusion matrix of pre-trained language models on the development set.

6 Conclusion and Future Work

To analyze the performance of pre-trained models on the Vietnamese NLI task, we experimented with the SBERT model on the ViNLI dataset and in-depth analysis of other pre-trained models experimented by Huynh et al. [16]. There are many interesting findings relating between data characteristics and the accuracy of models. In particular, most models have relatively low accuracy on the sentences entailment hypothesis generated from the rules “Turn adjectives into relative clauses” and “Create conditional sentences”. The contradiction hypothesis generated from the “Use negative words” rule is straightforward for the models to predict correctly. In addition, when multiple rules are combined to create a contradiction hypothesis, the prediction models are more accurate. Word overlap or premise and hypothesis length also significantly affect the model’s performance. Pre-trained models are able to make predictions thanks to the clues of the annotation artifacts, although the accuracy is not too high.

In the future, we will also learn techniques to improve the accuracy of the models, such as data enhancement techniques. Besides, we will explore other transformer models like mT5 [27] which is a pre-trained text-to-text transformer in many languages.