1 Introduction

Natural Language Processing (NLP) is an indispensable contributor to the great rise of Artificial Intelligence (AI) around the world. One of the NLP downstream tasks having a plethora of practical applications is Recognizing Textual Entailment (RTE), also called Natural Language Inference (NLI). With the goal of determining whether a hypothesis of natural language h can be inferred from a given premise p, NLI is often treated as a classification problem: given two inputs - hypothesis and premise - the problem is to classify the relationship between them into one of three classes: ’entailment’, ’contradiction’ or ’neutral’. Besides, it is evident that the majority of the forms of meaningfulness in language can be considered as a form of entailment, contradiction, and neutrality in context [1, 2]. Hence, NLI has played a crucial role in advance of NLP’s downstream applications such as Question Answering, Text Summarization, and Machine Reading Comprehension.

Currently, there are many works that promote the development of this field, involving publishing high-quality NLI datasets, as well as improving NLI models to be comparable to the level of human beings. In particular, a plethora of large-scale datasets for the task NLI in various languages or domains has been published such as SNLI [3], MultiNLI [4], and ViNLI [5]. On the other hand, architecture-oriented branches (i.e., Transformer-based language models) such as XLM-R [6], InfoXLM [7], PhoBERT [8] and mBART [9] have been working well on this task. Moreover, these models have outperformed the performance of the non-expert human when being fine-tuned and evaluated on different benchmark datasets.

However, applying NLI development to other downstream NLP tasks effectively still need a lot of attempt by the NLP community. One of these factors that were shown to affect the performance of the models is the length of the premise [10]. Most recent works have been done to address the task at the sentence level, which might be a lack of contextual information. As a result, although still achieving competitive results, these models demonstrated that they are not good at performing inference over longer text, which is a main feature of the NLP downstream tasks [10]. As indicated in [11], the inference is made based on contextual information and a collection of facts. Deducing and then connecting hidden facts from a given context is an essential part of human language understanding, involving many steps and much information. Hence, only using the information from a sentence might not be enough to address sufficiently the NLP downstream tasks which require processing long text. Therefore, many works investigating NLI tasks at the passage level have received much attention [10, 12, 13].

Fig. 1
figure 1

Transformation of the original sentence-level premise ViNLI dataset into the longer-premise ViNLI dataset based on the provided context

In the Vietnamese domain, to our knowledge, the monolingual dataset for NLI task is quite rare, just including ViNLI [5] and a bilingual dataset Vietnamese-English NLI [14]. In addition, these Vietnamese datasets are on the sentence level. Therefore, to conduct experiments to investigate whether a longer premise can improve the performance of models, we leveraged the contexts that are additionally provided in the ViNLI dataset to generate a long-premise ViNLI dataset, as shown in Fig. 1. Compared to other benchmark datasets, ViNLI [5] was designed into 4 labels (ENTAILMENT, CONTRADICTION, NEUTRAL, and OTHER) instead of three (ENTAILMENT, CONTRADICTION, and NEUTRAL) due to some certain circumstances in real-life scenarios.

In this paper, we restrict our focus in solving the NLI task to the “entailment” class, as it plays an important role in downstream tasks such as Question Answering; for the other three classes, we remain unchanged. Moreover, we not only emphasize the need for a long-premise NLI dataset, but we also pay attention to how valuable information is in the premise. Specifically, we develop a framework named LMCK that uses pre-trained language models collaborated with context-based external knowledge generated by combining our rules with information retrieval models such as BM25 [15], TF-IDF [16], Sentence-Bert [17], SXLM-R [18] for our experiments. We also experiment with two types of pre-trained models on the NLI task: the encoder (XLM-R, PhoBERT, InfoXLM), and the encoder-decoder (mBART) on our converted long-premise NLI dataset. Our investigation demonstrates that besides longer premises, context-based external knowledge is an important factor for better performance on NLI task. For this task, the results display that encoders are better than the encoder-decoder. Most importantly, our approach achieves state-of-the-art performance on the ViNLI dataset.

The rest of the paper is organized as follows. In Section 2, we provide an overview of previous works about the Natural Language Inference task and Context-based external knowledge. Section 3 describes the methodology which is used for experiments in this paper. Then, we present the whole experiment, including the dataset, experimental settings, and our results in Section 4. Finally, Section 5 presents the conclusion and future works.

2 Related works

We consider previous works in the areas of both Natural Language Inference Section 2.1 and Context-based external knowledge Section 2.2.

2.1 Natural language inference

Since 2005, the NLP community has witnessed a significantly growing popularity of the task Recognizing Textual Entailment(RTE), which is now known as Natural Language Inference (NLI) due to the emergence of the PASCAL Recognizing Textual Entailment (RTE) challenges [19]. The key to the popularity is that the RTE task works as a system in which to determine the relationship between two given text fragments by employing different techniques used in NLP applications to address semantic inference which is a prominent issue shared by many NLP applications. 2 years later, on the third RTE challenge [20], a limited number of longer texts, i.e. up to a paragraph in length, were introduced to make the challenge more oriented to realistic scenarios, which is one of the most inspirational works to later works related to RTE and its applications. After that, these further RTE challenges such as RTE-5 [21], RTE-6 [22], RTE-7 [23] required communities to mainly apply RTE systems to specific application settings. In particular, all three challenges (RTE-5, RTE-6, RTE-7) are situated in the Summarization application setting.

Recently, with the challenge of more comprehensive scenarios, there has been a plethora of work improving both datasets and techniques. On one hand, the most well-known NLI benchmarks include the Standford Natural Inference (SNLI) dataset [3]; and the expanded Multi Genre NLI corpus(MultiNLI) [4] attempting to tackle the limitations of SNLI. Specifically, the dataset introduced various genre labels for each sentence pair to concentrate on domain adaption. Besides, there are several task-specific NLI datasets, consisting of Question-answering NLI (QNLI) [24], SciTail [25], Dialogue NLI [26], and Vietnamese-English NLI [14]. In addition, there is also various monolingual NLI dataset, including OCNLI [27] for Chinese, IndoNLI [28] for Indonesian, SICKNL [29] for Dutch, and ViNLI [5] for Vietnamese. However, all the above datasets are either on the sentence level or do not consider the relationships that infer from more than sentences.

Therefore, NLI datasets with longer text have been built as a necessity to address inferences in real-life situations. In 2014, the Approximate Textual Entailment (ATE) dataset used in the field of Image Captioning [30] was created based on FLICKR30k. Each item includes a premise set of four captions and a short phrase as the hypothesis. Similarly, the Multiple Premise Entailment (MPE) datasets [31] was proposed as a challenging task in which each hypothesis sentence is paired with an unordered set of written premise sentences that demonstrate the same event from FLICKR30k. Regarding the field NLI, Adversarial NLI [32] is a novel human-and-model-in-the-loop dataset in which longer contexts are considered in the premise. The ConTRol [13] is a dataset for contextual reasoning over long texts. Compared to Adversarial NLI, the context of ConTRol is much longer and described under multiple paragraphs, while Adversarial NLI has only single-paragraph contexts. As inspired by these above works, to investigate the potential of longer-premise in dealing with the Vietnamese NLI task, we leverage the contexts which are additionally provided in the ViNLI dataset and convert the ViNLI dataset from single-sentence premise into multiple-sentence premise (i.e. from sentence-level to passage-level).

On the other hand, due to the increasing growth of large-scale NLI datasets, deep learning models such as RNN [33], BiLSTM [34], and ESIM [35] have passed beyond traditional machine learning models (Skip-gram, CBOW [36]). However, in recent years, the advent of transformer architecture [37] completely changed how researchers deal with the NLI task and its applications. In particular, numerous models have proposed and given significant performances by employing both the architecture of the encoder including BERT [38], XLM-R [6], and InfoXLM [7], and that of the encoder-decoder consisting of BART [39], t5 [40]. Also, for Vietnamese transformer-based models, PhoBERT [8] and ViT5 [41] have done positive results for the Vietnamese domain.

Although the effectiveness of transformer-based models in the NLI task is significant, in this work, we demonstrate that the performance of NLI models that use pre-trained models can be augmented with context-based external knowledge.

As far as Vietnamese NLI is concerned, in the Vietnamese NLP community, NLI is an area that has recently been a new research subject. Therefore, there hasn’t been a lot of work yet in this field. The advent of [14] as a shared task in VLSPFootnote 1 has drawn more Vietnamese researchers’ attention. With great effort, several outstanding works [5, 42,43,44,45] were proposed. In particular, the studies [42, 43] are works in the shared task [14]. While [42] utilized pre-trained Multilingual Language Models, [43] employed data augmentation to deal with Vietnamese and English-Vietnamese Textual Entailment tasks. [5] made a major contribution to the Vietnamese NLP community due to the creation of the first monolingual Vietnamese NLI dataset - ViNLI. [44] proposed a method to build a Vietnamese dataset for training Vietnamese inference models that work on native Vietnamese texts. [45] presented an experiment combining semantic word representation through the SRL task with context representation of BERT relative models for the NLI problem. Despite many attempts, there is still no work using context-based external knowledge to enhance the performance of models in the Vietnamese NLI dataset.

2.2 Context-based external knowledge

Utilizing context-based external knowledge has shown improvement in performance on many NLP downstream tasks [13, 46,47,48,49,50,51]. There are two main approaches utilizing context-based external knowledge, including graph-based and information retrieval-based approaches.

For graph-based external knowledge in the field of Natural Language Inference (NLI), there are a lot of attempts such as [47, 52, 53]. In particular, Wang et al. [47] presented a combination of techniques on text, graph, and text-and-graph-based models that can leverage external knowledge to improve performance on the NLI problem. Chen et al. [52] developed a model with WordNet-based co-attention that uses five engineered features from WordNet for each pair of words from premise and hypothesis. Meanwhile, Pan et al. [53] used an external knowledge source from Knowledge Graphs (KGs) in text-based RTE models by using Personalized PageRank to generate contextual subgraphs with reduced noise and encoding these subgraphs using graph convolutional networks to capture the structural and semantic information in KGs. All in all, most of these works have employed neural networks to represent the triplets of knowledge graphs. These kinds of approaches usually need to train a knowledge-graph embedding beforehand. According to [54], despite the effectiveness, the existing methods for generating knowledge graph embeddings still suffer several severe limitations. In this situation, additional information, such as entity types and relation paths, is ignored, which can further improve the embedding accuracy.

When it comes to information retrieval-based external knowledge, there are two types of representations for retriever: bag-of-word (BOW) based sparse representation [55] and dense representation from neural networks [56]. For the sparse representation, since this method relies on BOW, a rule-based scoring system such as TF-IDF and BM25 is utilized for ranking. This allows for adaptation, to a range of large-scale search scenarios. This method has been widely explored to solve various NLP downstream applications, including Question Answering [57, 58] and Machine Translation [59, 60]. In terms of dense representation based retrieval (DPR) [56], it is the area that has received a lot of attention in recent years. Dense representation is obtained from encoders such as Transformer, trained with task-specific data. It is demonstrated that these methods can yield better recall performance than sparse representation on different tasks. However, DPR cannot process longer documents, usually less than 128 tokens [56].

In this paper, we focus on getting external knowledge by leveraging information retrieval-based approaches. Therefore, to attain the most suitable retriever for our work, we employ both two types: sparse representation using traditional information retrieval (IR) models such as TF-IDF [16], and BM25 [15], and representation-based retriever using SBERT [17], and the SXLM-R [18].

3 Methodology

Our LMCK system involves a combination of the exploitation of semantic information for the NLI task Section 3.1 and pre-trained language models Section 3.2. In particular, this system includes 3 phases: Context-based Sentence Extraction, Long-premise Generation, and Inference (see Fig 2). As presented in Section 3.1, the Context-based Sentence Extractor, which is the main core of Phase 1, is responsible for extracting external knowledge information from the given context of a document. After that, in phase 2, the most relatable sentences will be added to premise sentences and generate our converted long-premises. For the pre-trained language models Section 3.2 in Phase 3, we use two types of architectures on NLI tasks: the encoder (XLM-R, PhoBERT, InfoXLM), and the encoder-decoder (mBART).

Fig. 2
figure 2

Overview of large language models enhanced with contextual knowledge (LMCK) system

3.1 Context-based sentence extractor

As analyzed in [5], besides depending on the content of the premise, annotators tend to write hypotheses of entailment samples relying on the corresponding premises’ situation (i.e. premise’s context). Therefore, this might cause difficulties for models in inference. Hence, to facilitate capturing the semantic relations better, we employ the Information Retrieval method together with our rules to get important semantic information.

In this component, the main task requires the retrieval of a proper subset (\(S_1\),\(S_2\), ...,\(S_n\)) of each premise’s given context from the ViNLI dataset, used for inferring annotators’ corresponding hypothesis H, or relating to premise P, or the combination of hypothesis and premise \(H+P\). A proper subset results from identifying a subset of statutes for which an entailment system can judge whether the statement H is entailed or not.

In this work, we conduct experiments on two types of information retrieval-based approaches: sparse retriever and representation-based retriever. For sparse retriever, we employ traditional information retrieval (IR) models such as TF-IDF [16], and BM25 [15]. While TF-IDF is a term scoring method using cosine similarity measure, BM25 is a method scoring documents in response to a query. Specifically, TF-IDF and BM25 are respectively displayed in (1), and (2) [61]:

$$\begin{aligned} TF-IDF(D, Q) = \sum [\sqrt{f(t,D)} * (1 + log(IDF(t)))^2] \end{aligned}$$
(1)
$$\begin{aligned} BM25(D, Q) = \sum IDF(q_i) \frac{f(q_i,D) * (k_1 + 1)}{f(q_i,D) + k_1 * (1 - b + b * \frac{|D|}{avgdl})} \end{aligned}$$
(2)

where D is a document, Q is a query, f(t, D) and f(\(q_i\), D) is t’s, \(q_i\)’s term frequency in document D, |D| is the length of document D in words, avgdl is the average document length in the text collection from which documents are drawn, and IDF is the inverse document frequency. However, there is a slight difference between IDF(t) and IDF(\(q_i\)) shown respectively in (3), and (4). \(k_1\) and b are free parameters. In this work, we set 1.5 for \(k_1\) and 0.75 for b.

$$\begin{aligned} IDF(t) = log \frac{N}{df(t)} + 1 \end{aligned}$$
(3)
$$\begin{aligned} IDF(q_i) = ln (\frac{N-n(q_i) + 0.5}{n(q_i) + 0.5} + 1) \end{aligned}$$
(4)

where N is the total number of documents in the collection, df(t) is the number of documents containing tFootnote 2, and n(\(q_i\)) is the number of documents containing \(q_i\)Footnote 3.

For representation-based retriever, we use SBERT [17], and SXLM-R [18]. Representation-based retriever, also called Dual-encoder, employs two independent encoders such as BERT [38] to encode the query and the documents respectively, and then estimate their relevance by computing a single similarity score between two representations. In particular, SBERT [17] adopts two independent BERT-based encoders to encode two input sentences, then adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. Finally, the relevance score between them is computed by the cosine-similarity. In order to fine-tune BERT, they create siamese and triplet networks [62] to update the weights so that the produced sentence embeddings are semantically meaningful and can be evaluated by cosine-similarity. Compared to SBERT [17], SXLM-R [18] employed two independent XLM-R encoders, and is fine-tuned with Multiple negatives ranking (MNR) loss [63]. The loss function is given by (5):

$$\begin{aligned} L = - \frac{1}{N}\cdot \frac{1}{K}\cdot \sum _{i=1}^{K}[S(x_i, y_i)-log\sum _{j=1}^{K}e^S(x_i, y_i)] \end{aligned}$$
(5)

To evaluate and choose the best IR method, an evaluation dataset is created manually to assess the accuracy of these models by 3 well-educated annotators. Firstly, we provide them with the same dataset, including pairs of sentences: premise and hypothesis, and a context stemming from the training set of ViNLI. We require them to read carefully the content of the premise and hypothesis, then check whether we need more contextual information when generating a hypothesis from the premise. If so, annotators will choose the most 3 relevant sentences in the context they think the hypothesis was created based on. We have the most relevant sentence, the second most relevant sentence, and the third most relevant sentence as Top_1, Top_2, and Top_3, respectively. In this process, annotators will work independently. At the end of the process, we only select those samples that all three annotators agree that context is important to writing the hypothesis. As a result, we have three datasets corresponding to three annotators with the same size is 300 samples. Figure 3 shows our evaluation data example.

Fig. 3
figure 3

An example of manually generating data to evaluate IR models. In the example, the sentences Top_1, Top_2, and Top_3 are highlighted with green, orange, and blue colors representing the choices of annotator1, annotator2, and annotator3 respectively

Besides, we design three experiments to evaluate these IR models on the dataset. The difference between the three experiments is the inputs of the IR models, which are described in Fig. 4. After processing inputs into respective embeddings, these IR models calculate the similarity between these embeddings, then return a list of 3 context-based sentences sorted by most relevance.

Fig. 4
figure 4

Three different experiments evaluating these IR models on the dataset

We use accuracy@3 (i.e. Acc@3) to evaluate the effectiveness of these IR models. For each model, the final accuracy result is the mean of Acc@3 over 3 annotators. Acc@3 is computed as follows:

$$\begin{aligned} Acc@3 =\frac{\text {X}*100}{N} \end{aligned}$$
(6)

where X is the number of predicted sentences appearing in the collective of sentences selected by annotators. N is the total number of annotators in this work.

Table 1 The results of information retrieval component evaluation according to Acc@3

The results of the evaluation of the IR models are shown in Table 1. We observed that in most models, the pre-trained model SXLM-R [18] gave the highest results in most experiments. Specifically, in Experiment 1, this model achieved 57.22, and in Experiment 2 and Experiment 3, the model attained an accuracy of 55.33 and 59.11, respectively. Therefore, the SXLM-R model is the core of our Information Retrieval component.

3.2 Pre-trained language models for NLI

To compare with our proposed method, we conduct experiments with several powerful baseline methods using state-of-the-art pre-trained language models.

3.2.1 Pre-trained language models

In this paper, we used four powerful pre-trained language models that are helpful for Vietnamese NLP tasks:

  • PhoBERT [8] is a monolingual pre-trained model for Vietnamese trained based on RoBERTa [64] with 135M parameters for the base version and 370M for the large version.

  • XLM-R [6] is an improved version of XLM based on RoBERTa model [64]. XLM-R is trained with a cross-lingual masked language modeling objective on data in 100 languages, including Vietnamese from Common Crawl.

  • InfoXLM [7] is a multilingual pre-trained model for over 100 languages with a new cross-lingual pre-training task named cross-lingual contrast (XLCO).

  • mBART [9] is a multilingual encoder-decoder model that is based on BART [39]. mBART is trained with a combination of span masking and sentence shuffling objectives on a subset of 25 languages, including Vietnamese from Common Crawl.

3.2.2 NLI methods using pre-trained language models

The NLI model structures of the encoder and encoder-decoder are illustrated in Fig. 5. For the encoder models (i.e., PhoBERT, XLM-R, and InfoXLM), following [38], given a premise p and a hypothesis h, we concatenate premise-hypothesis pair as a new sequence. However, in this work, due to the new length of the premise, and passage level, compared to other works, we set up a hypothesis and premise respectively instead of the premise, and then hypothesis. Specifically, the input is demonstrated [CLS]+h+[SEP]+p+[SEP], where [CLS] and [SEP] are special symbols for the classification token and separator token. After pre-training model encoding, the last layer’s hidden representation from the [CLS] token is fed in an MLP+softmax for classification. For the sequence-to-sequence model ((i.e., mBART), we feed the same sequence to both the encoder and the decoder, using the last hidden state for classification. The class corresponding to the highest probability is chosen as the model prediction.

Fig. 5
figure 5

The model structure of the encoder and the encoder-decoder. (“E” represents ENTAILMENT, “N” represents NEUTRAL, “C” represents CONTRADICTION, and “O” represents OTHER

4 Experiments and results

4.1 Dataset and experimental design

After determining the best IR model, we conduct experiment on various types of inputs of models to addressing our research questions. First and foremost, we design 4 different experiments as follows.

  • Experiment 1 - Hypothesis, Context: Due to the contexts involving premises, and premises’ contextual knowledge, we use the given context of a pair of corresponding premises and hypothesis as a premise. The average length of context is 319.9 words.

  • Experiment 2 - Hypothesis, Top(C, P): Premise is one of the sentences in a corresponding context that we are received additionally. Therefore, contextual knowledge is supposed to be obtained by applying the best IR SXLM-R [18] with the premise and its context as a premise.

  • Experiment 3 - Hypothesis, Top(C, H): As described in [5], the hypothesis was created based on the content of the premise, or the situation of the premise. Thus, we apply the best IR SXLM-R [18] with a hypothesis and its situation (i.e. its context) to obtain contextual knowledge as a premise.

  • Experiment 4 - Hypothesis, Top(C, H+P): Due to the relevance of the process of forming premise and hypothesis, we assume we could attain contextual knowledge as a premise by applying the best IR SXLM-R [18] with the combination of hypothesis and premise and its context.

In addition, motivated by how a hypothesis was written, we present a simple rule supporting the generation of better context-based external knowledge as the premise. Our rule is shown in Fig. 6.

Fig. 6
figure 6

Our rule in generating better context-based external knowledge for entailment only

As indicated in [5], a hypothesis was created based on the content or situation of the premise (i.e., the context of the premise). Therefore, we strongly believe that we can capture the semantic similarity between context and hypothesis by adapting IR methods. However, after running experiments on the processed dataset, we discovered that the performance of models deteriorates due to information confusion in labels’ samples except for entailment. Therefore, we propose the above rule (see Fig. 6) to avoid confusion but achieve our desired improvements, which are used in Experiment 5. Specifically, the inputs of the Experiment 5 are listed as follows:

  • Input 1: Hypothesis,

  • Input 2: Rule Fig. 6 (Premise, Top(C,H))

Most experiments are designed to extract more information from the context to incorporate the premise sentence as the input into the natural language inference model. Whereas the hypothesis statements are kept the same. To observe how these datasets vary in length compared to the ViNLI baseline dataset, we compute the full average length of input 1 (premise + Top_1, Top_2, and Top_3 sentence) of each experiment shown in Table 2. Most of the average length of the experiments’ premise is significantly longer than that of the original ViNLI dataset.

Table 2 The average length of the premise

Especially with Experiment 1, the length is quite long, with about 330 words. In addition, for each case of extracting 1,2, or 3 sentences in the context to add to the premise sentences of experiments 2,3,4, and 5, the average premise length increases significantly. The average length statistics are meaningful to us in choosing the max length input parameter of pre-trained transformer models appropriately.

As described above, we can see how the data generated for Experiment 5 differs from the others. The way to generate data for experiments 2, 3, and 4 is always to have context information added to the pairs of inference sentences (There are three cases of +1 sentences, +2 sentences, and +3 sentences) regardless of the difference in labels. Meanwhile, with Experiment 5, we focus on whether it is necessary or not to extract more contextual information to provide sentence pairs of the ENTAILMENT label with context-based external knowledge by setting thresholds and rules in the Context-based Sentence Extractor. In particular, after applying the best IR SXLM-R [18] with a hypothesis and its context, we check whether the label of the sample is Entailment. If yes, we continue to check whether the sample needs more contextual knowledge by using our rules. Therefore, not all sentence pairs of the ENTAILMENT label in the ViNLI dataset need additional contextual information. Figure 7 shows the number of sentence pairs belonging to the ENTAILMENT label in the ViNLI dataset that need and do not need additional contextual information. We found that more than 50% of the sentence pairs of the ENTAILMENT label in the training, development, and test sets of ViNLI need more context. With this considerable amount, we hope that the models trained on the new data can solve the difficult cases of the ENTAILMENT label.

Fig. 7
figure 7

The number of sentence pairs of ViNLI’s ENTAILMENT label needs more context information in Experiment 5

Data generation by adding contextual information to premise sentences, as in our experiments, leads to premise sentence length increasing. While the length of the hypothesis sentence remains the same, the number of words in the hypothesis sentence that do not appear in the premise will also change. We are interested in analyzing this feature of the data because the rate of new words affects the accuracy of the model. Specifically, the research of the author’s ViNLI dataset [5] found that the higher the rate of new words, the more difficult it is for the models to predict accurately. Therefore, we focused on analyzing the data of Experiment 5 to observe the new word rate on pairs of sentences of the label ENTAILMENT, as shown in the Table 3. We conduct statistics on all three data creation cases of Experiment 5, which are +Top_1, +Top_2, and +Top_3 sentences with the premise sentence of the ENTAILMENT label. First, we noticed that the ENTAILMENT label data in all three cases of Experiment 5 has a significantly low rate of new words compared to that on the ENTAILMENT label of the ViNLI dataset. This allows the model to capture better the semantic relationship between premise and hypothesis than the ViNLI dataset. Besides, I also noticed that the new word rate gradually decreased when the premise sentence added Top_1, Top_2, and Top_3, respectively. This can train models to make more accurate predictions when adding necessary context information.

4.2 Experimental settings

In all of our experiments, following the original work on the ViNLI dataset [5], we report the accuracy score as the primary evaluation metric.

As described in Section 4, our approaches depend on pretrained language models such as XLM-R, PhoBERT, mBART, and InfoXLM. Therefore, we use models namely XLM-R\(_{large}\), PhoBERT\(_{large}\), mBART\(_{large}\), InfoXLM\(_{large}\) respectively dowloaded from the Hugging Face LibraryFootnote 4.The network’s parameters are optimized using the AdamW [66] and a linear learning rate scheduler suggested by the Hugging Face default setup. The hyperparameters that we tuned include the number of epochs, batch size, and learning rate. In particular, we set a batch size of 16 and a learning rate of 1e-5 for all component models. Due to the length of input models, we set the max length to 256 for Top_1, Top_2, and for Top_3, the model is trained on max length 512, where Top_1, Top_2, and Top_3 are the amount of context-based external knowledge representing as sentences. All experiments in this paper are conducted on Google Colab Pro.

Table 3 The ratio of new words in the hypothesis sentence compared to the premise sentence on pairs of sentences labeled ENTAILMENT in Experiment 5 compared with the original dataset VINLI
Table 4 Experiment 5 results with the model’s input as Top_n(Information Retrieval\(_{Entailment}\)[Context, Hypothesis]), Hypothesis

4.3 Results and dicussions

According to the Table 4, transformer-based models with the encoder architecture using XLM-R outperform others (PhoBERT, InfoXLM, mBART). Besides, mBART, which is an encoder-decoder model almost performs better than PhoBERT, and InfoXLM. According to the table, PhoBERT, a monolingual pre-trained language model for Vietnamese, gives an accuracy of 85.25%, 80.21%, 85.65%, corresponding to Top_1, Top_2, and Top_3. Meanwhile, mBART provides Top_1, Top_2, and Top_3 with respective accuracy of 85.89%, 79.77%, 86.26%. InfoXLM gives an overall accuracy of Top_1, Top_2, Top_3 with 85.02%, 82.81%, and 84.78%, respectively. The top reported performance is given by the XLM-R model, with 89.5% accuracy. Despite our rule only focusing on the "Entailment" label, our approach successfully attains SOTA performance compared to 85.99% in the original.

4.3.1 Model performance on different premise lengths

As mentioned earlier, we designed our experiments beyond the sentence level based on context-based external knowledge, which is represented as multi-sentence. Therefore, after conducting experiments, we compare and contrast the performance of models trained on multiple sentence premise and the single sentence premise in the new dev dataset as in the original to get insight into how context length affects the performance of the transformer-based NLI models. The result is displayed in Fig. 8. When the premise length increases, the model performance drops accordingly. The best model XLM-R drops from 89.23% (Top_1) to 77.57% (All Context). The most visible model InfoXLM decreases from 85.21% to 25.59%. Consequently, the results demonstrate that the longer premise is integral in achieving better performance on the NLI task, but how valuable the information in the premise could affect the performance of models. A similar conclusion is pointed out in the results of the models on the new test.

Fig. 8
figure 8

Performance on different premise lengths

Table 5 Model performance per label in ViNLI

4.3.2 Model performance on different labels

We compare model performance across different labels in [5] and ours. Noteworthy, in the original work with four labels, models such as XLM-R are good at performing on Contradiction and Neutral, but struggling when deciding the relationship of Entrailment. However, as shown in Table 5, our approach can significantly improve the decision of models in examining the Entailment. In particular, the accuracy-based performances of InfoXLM, PhoBERT, mBART, XLM-R, on Entailment increase from 86.33% to 91.21%, 87.96%, 89.31%, 91.47% respectively. Furthermore, our approach not only enhances the performance on the Entailment label, but context-based external knowledge also improves other labels. Specifically, the accuracy on Contradiction, Neutral, and Other labels of XLM-R increased by 2.88%, 1.46%, and 0.41%, respectively. The performance of PhoBERT on Other label enhances 0.43%. mBART improves the Contradiction of 0.4%.

Additionally, Fig. 9 shows one of the prominent cases that our approach can tackle, but the original work did not. Figure 9 demonstrates an example of the challenges brought by analytical reasoning. Specifically, the original premise sentence concerns how good Cavani was in that season and the hypothesis describes the gifted of Cavani in that season. Models need to determine the facts after analyzing and deducting. Besides, the lexical overlap between the premise and hypothesis is low. The best model in the original work [5] XLM-R incorrectly chose the Neutral label, while our approach which adds context-based external knowledge mentioning the number of goals and the related records Cavani scored in that season can predict precisely the Entailment label.

Fig. 9
figure 9

Example of cases that context-based external knowledge can address, which the original did not. The green indicates the original prediction. The red indicates the correct label and our model’s prediction. Reasoning clues are highlighted in the context

4.3.3 Model performance on different inputs

To perform well on the NLI task, humans need more information about the context of the premise and hypothesis, and so do pre-trained language models. Therefore, we experimented with various types of model input, as displayed in the Fig. 10.

Besides our main experiment conducted on context-based external knowledge (i.e. Experiment 5), we designed 4 other different experiments, as described in Section 4.1. Despite the length, and the information that all context provides in Experiment 1, these models did not perform well on the NLI task, which gives 77.26%, 73.72%, 76.53%, 25.59% corresponding to XLM-R, PhoBERT, mBART, InfoXLM. In Experiment 2, Experiment 3, and Experiment 4, after defining how the premise was created in [5], we applied the best IR SXLM-R [18] to get the premise’s contextual information. However, compared to Experiment 5 (i.e. our main contribution), these experiments perform worse. In particular, InfoXLM gives the best accuracy of 83.98%, 85.08% in Experiment 2, and Experiment 4; For Experiment 3, XLM-R outperforms others with 78.37% of accuracy. Thus, we conclude that a longer premise is an indispensable factor in improving model performance in the NLI task; however, the information in the premise should be paid more attention.

Fig. 10
figure 10

Model performance on different inputs

5 Conclusion and future works

In this paper, we leverage the only open-domain and high-quality dataset for Vietnamese (ViNLI) to automatically create a long-premise Vietnamese NLI dataset to assess the efficiency of a longer premise. We demonstrate that our approach can obtain better performance in inferring semantic information due to infusing context-based external knowledge created by combining our rules with information retrieval techniques. Therefore, not only do we show that the longer premise is integral in achieving better performance on the NLI task, but we also further indicate how valuable information in the premise could affect the performance of models. Besides, we experiment with both the encoder and encoder-decoder models and point out that for the task, the encoder is more suitable than the transformer. Moreover, our approach successfully achieves state-of-the-art performance for the task of natural language inference with 4 classes on the ViNLI dataset.

However, there are still limitations in our work. In particular, our approach only focuses on the ‘Entailment’ class to improve the performance of the models. Therefore, in the future, designing a more general framework will be paid more attention to exploit relevant knowledge not only for the ‘Entailment’ but also for others based on the given dataset and context. Another direction worth mentioning involves exploring new ways to extract relevant knowledge efficiently to improve performance on the Vietnamese NLI task. Ultimately, we aspire to apply our system to address downstream NLP tasks such as question answering and summarization.