Keywords

1 Introduction

Today, many websites have QA forums, where users can post their questions and answer other users’ questions. However, they usually take time to wait for responses. Moreover, data for question answering has become enormous, which means new questions inevitably have duplicate meanings from the questions in the database. In order to reduce latency and effort, QA systems based on information retrieval (IR) retrieving a good answer from the answer collection is essential. QA relies on open domain datasets such as texts on the web or closed domain datasets such as collections of medical papers like PubMed [8] to find relevant passages. Moreover, in the COVID-19 pandemic, people care more about their health, and the number of questions posted on health forums has increased rapidly. Therefore, QA in the medical domain plays an important role. Lexical gaps between queries and relevant documents that occur when both use different words to describe similar contents have been a significant issue. Table 1 shows a typical example of this issue in our dataset. Previous studies applied word embeddings to estimate semantic similarity between texts to solve [26]. Various research studies approached deep neural networks and BERT to extract semantically meaningful texts [11]. Primarily, SBERT has recently achieved state-of-the-art performance on several tasks, including retrieval tasks [7]. This paper focuses on exploring fine-tuned SBERT models with MNR.

We contribute: (1) Introduce a ViHealthQA dataset containing 10,015 pairs in the medical domain. (2) Propose two-stage QA system based on SBERT with MNR loss. (3) Perform multiple experiments, including traditional models such as BM25, TF-IDF cosine similarity, and Language Model to compare our system.

Table 1. A typical example of Lexical gaps in ViHealthQA dataset.

2 Related Work

In early-stage works of QA retrieval, several studies [3] presented sparse vector models. Using unigram word counts, these models map queries and documents to vectors having many 0 values and rank the similarity values to extract potential documents. In 2008, Manning et al. [14] did many experiments to gain a deeper understanding of the role of vectors, including how to compare queries with documents. Moreover, many researchers [4, 19] pay attention to BM25 methods in IR tasks.

IR methods with sparse vectors have a significant drawback: lexical gap challenges. The solution to this problem is using dense embedding to represent queries and documents. This idea was proposed early with the LSI approach [2]. However, the most well-known model is BERT. BERT applied encoders to compute embeddings for the queries and the documents. Liu et al. [13] installed the final mean pooling layer and then calculated similarity values between outputs. Instead, Karpukhin et al. [9] used the initial CLS token. Many studies [10, 12] applied BERT and reached significant results. Significantly, SBERT [18] uses Siamese and triplet network structures to represent semantically meaningful sentence embeddings. Multiple research approaches have approached SBERT for Semantic Textual Similarity (STS) and Natural Language Inference (NLI) benchmarks. In 2021, Ha et al. [5] utilized SBERT to find similar questions in community question answering. They did several experiments on SBERT with multiple losses, including MNR loss.

Because of our task in the medical domain, we reviewed some related corpus. For example, CliCR [22] comprises around 100,000 gap-filling queries based on clinical case reports, and MedQA [28] includes answers for real-world multiple-choice questions. In Vietnam, Nguyen et al. 2021 [24] published ViNewsQA, including 22,057 human-generated question-answer pairs. This dataset supports machine reading comprehension tasks.

3 Task Description

There are n question-answer passage pairs in the database. We have a collection of questions \({Q = \{q_1, q_2, ...,q_n \}}\) and a collection of answer passages \({A = \{a_1,a_2,...,a_n\}}\). Our task is creating models with question i \({ (q_i)}\) belongs to collection Q \({ (q_i \in Q)}\) can retrieve precise answer passage \({a_i}\) \({ (a_i \in A)}\).

4 Dataset

4.1 Dataset Characteristics

We release ViHealthQA, a novel Vietnamese dataset for question answering and information retrieval, including 10,015 question-answer passage pairs. We collect data from VinmecFootnote 1 and VnExpressFootnote 2 websites by using the BeautifulSoupFootnote 3 library. These ones are forums where users ask health-related questions answered by qualified doctors. The dataset consists of 4 features: index, question, answer passage, and link.

4.2 Overall Statistics

After the collecting data phase, we divide our dataset into train, dev, and test sets. In particular, there are 7,009 pairs in Train, 993 pairs in Dev, and 2,013 pairs in Test (Table 2).

According to Table 3, most of the answer passages are in the range of 101–300 words (34.1%), the second ratio is the number of answer passages with 301–500 words (31.13%), followed by 501–700 words (15.88%), and 701–1000 words (9.98%). Longer answer passages (over 1000 words) comprise a small proportion (above 7.58%).

Table 2. Statistics of ViHealthQA dataset.
Table 3. Distribution of the answer passage length (%).

4.3 Vocabulary-Based Analysis

To understand the medical domain, we use the WordClouds toolFootnote 4 to display visual word frequency that appears commonly in the dataset (Fig. 1). Table 4 shows the top 10 words with the most frequency. These words are related to the medical domain. Besides, users ask many questions about Coronavirus (COVID-19), children, inflammatory diseases, and allergies.

Table 4. Top 10 common words in the ViHealthQA dataset.
Fig. 1.
figure 1

Word distribution of ViHealthQA.

5 SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers

In this paper, we propose a two-stage question answering system called SPBERTQA (Fig. 2), including BM25-based sentence retriever and SBERT using PhoBERT fine-tuning with MNR loss. After training, the inputs (the question and the document collection) feed into BM25-SPhoBERT. Then, we rank the top K cosine similarity scores between sentence-embedding outputs to extract top K candidate documents.

Fig. 2.
figure 2

Overview of our system.

5.1 BM25 Based Sentence Retriever

We aim to train the model by focusing on the meaningful knowledge of our dataset. Thus, we propose the sentence retriever stage that extracts the K sentences in every answer passage the most relevant to the corresponding question. Moreover, this stage helps solve the obstacle of the maximum length sequence of every pre-trained BERT model is 512 tokens (\(max\_seq\_length\) of PhoBERT = 256 tokens), while the number of answer passages over 300 tokens in Train accounts for above 65.47%.

We use BM25 for the first stage because BM25 mostly brings good results in IR systems [20]. Besides, most answer passages have below four sentences (Average number of sentences in every answer passage \(= 3.95\) in Table 2), so we choose \(K = 5\).

5.2 SBERT Using PhoBERT and Fine-Tuning with MNR Loss

Multiple Negatives Ranking (MNR) Loss: MNR loss works great for IR, and semantic search [7]. The loss function is given by Equation (1).

$$\begin{aligned} L=-\frac{1}{N} \cdot \frac{1}{K} \cdot \sum _{i=1}^{K}\left[ S\left( x_{i}, y_{i}\right) -\log \sum _{j=1}^{K} e^{S\left( x_{i}, y_{j}\right) }\right] \end{aligned}$$
(1)

In every batch, there are K positive pairs (\({x_i, y_i}\): question and positive answer passage), and each positive pair has \(K - 1\) random negative answer passages \({(y_j, i\ne j)}\). The similarity between question and answer passage (S(xy)) is cosine similarity. Moreover, N is the Train size.

In the second stage, we use the pre-trained PhoBERT model. PhoBERT [15] is the first public large-scale monolingual language model for Vietnamese. PhoBERT pre-training approach is based on RoBERTa, which optimizes more robust performance. Then, we fine-tune PhoBERT with MNR loss.

6 Experiments

6.1 Comparative Methods

We compare our system with traditional methods such as BM25, TFIDF-Cos, and LM; pre-trained PhoBERT; and fine-tuned SBERT such as BM25-SXMLR and BM25-SmBERT.

BM25. BM25 is an optimized version of TF-IDF. Equation (2) portrays the BM25 score of document D given a query q. \({d_{avg}}\) is the length of the average document. Moreover, BM25 adds two parameters: k helps balance the value between term frequency and IDF, and b adjusts the importance of document length normalization. In 2008, Manning et al. [14] suggested reasonable values are \(k = [1.2,2.0]\) and \(b = 0.75\).

$$\begin{aligned} B M 25(D, q)=\underbrace{\frac{f(q, D)*(k+1)}{f(t,D)+k *\left( 1-b+b * \frac{D}{d_{a v g}}\right) }}_{T F} *\underbrace{\log \left( \frac{N-N(q)+0.5}{N(q)+0.5}+1\right) }_{I D F} \end{aligned}$$
(2)

TF-IDF Cosine Similarity (TFIDF-Cos). Cosine similarity is one of the most popular similarity measures applied to information retrieval applications and is superior to the other measures such as the Jaccard measure and Euclidean measure [21]. Given a and b as the respective TF-IDF bag-of-words of question and answer passage. The similarity between a and b is calculated by Equation (3) [16].

$$\begin{aligned} Cos(a, b)=\frac{a \cdot b}{\Vert a\Vert \Vert b\Vert }=\frac{\sum _{1}^{n} a_{i} b_{i}}{\sqrt{\sum _{1}^{n} a_{i}^{2}} \sqrt{\sum _{1}^{n} b_{i}^{2}}} \end{aligned}$$
(3)

Language Model (LM). LM is a probabilistic model of text [23]. Questions and answers are modeled based on a probability distribution over sequences of words. The original and basic method for using LM is unigram query likelihood (Equation (4)).

$$\begin{aligned} P\left( q_{i} \mid D\right) =\left( 1-\alpha _{D}\right) *P\left( q_{i} \mid D\right) +\alpha _{D}*P\left( q_{i} \mid C\right) \end{aligned}$$
(4)

P(q|D) is the probability of the query q under the language model derived from D. P(q|C) denotes a background corpus to compute unigram probabilities to avoid 0 scores [27]. Besides, various smoothing based on how to handle \({\alpha _{D}}\) and \({ \alpha _{D} \in [0,1]}\).

PhoBERT. We directly use PhoBERT to encode question and answer passages. Then, we rank the top K answer passages having the highest cosine similarity scores with the corresponding question.

BM25-SXLMR. Similar to our model, but in the second stage, we use XLM-RoBERTa instead of PhoBERT. XLM-RoBERTa [1] was pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages (including Vietnamese).

BM25-SmBERT. Similar to our model, but in the second stage, we use BERT multilingual. BERT multilingual was introduced by [17]. This model is a transformers model pre-trained on the enormous Wikipedia corpus with 104 languages (including Vietnamese) using a masked language modeling (MLM) objective.

6.2 Data Preprocessing

We pre-process data such as lowercase, removing uninterpretable characters (e.g., new-line and extra whitespace). In order to tokenize data, we employ the RDRSegmenter of VnCoreNLP [25]. Moreover, stop-words can become noisy factors for traditional methods working well on pairs with high word matching between query and answer. Therefore, we conduct the removing stop-words phase. Firstly, we use TF-IDF to extract stop-words, and then we remove these words from the data.

6.3 Experimental Settings

We choose xlm-roberta-baseFootnote 5, bert-base-multilingual-casedFootnote 6, and vinai/phobert-baseFootnote 7. Then, we fine-tune SBERT with 15 epochs, batch size of 32, learning rate of \(2 e^{-5}\), and maximum length of 256. Our experiments are performed on a single NVIDIA Tesla P100 GPU on the Google Collaboratory serverFootnote 8.

6.4 Evaluation Metric

P@K (Equation (5)) is the percentage of questions for which the exact answer passage appears in one of the K retrieved passages [24].

$$\begin{aligned} P@K =\frac{1}{|Q|} \sum _{1}^{n}\left\{ \begin{array}{c} 1 \quad a_{q} \in A_{K}(q) \\ 0\quad Otherwise \end{array}\right. \end{aligned}$$
(5)

where, \({Q = {q_1, q_2,...,q_n }}\): collection of questions and \({q \in Q}\). \({A = {a_1, a_2,...,a_n }}\): collection of answer passages. \({a_q}\) is exact answer-passage of question q. \({A_K (q) \subseteq A}\) is the K most relevant passages extracted for question q.

Besides, mean average precision (mAP) is used to evaluate the performance of models.

Table 5. Results on Dev and Test with P@K score (%).
Table 6. Results on Dev and Test with mAP score (%).

7 Experiments

7.1 Results and Discussion

With the results shown in Tables 5 and 6, our system achieves the best performance with 62.25% mAP score, 50,92% P@1 score, and 83.76% P@10 score on the Test. BM25-SXLMR and BM25-SmBERT utilizing multilingual BERT do not work better than our system using monolingual PhoBERT. Compared to the PhoBERT model without fine-tuning with MNR loss, models fine-tuned with MNR (BM25-SXLMR, BM25-SmBERT, and our system) have good results, which proves that using MNR loss to fine-tune models for this task is suitable.

7.2 Analysis

To understand deeply about our system is more robust than traditional methods, and traditional methods have disadvantages in lexical gap issues, we run models on pairs having lexical overlap (the number of duplicate words between question and answer passage - X) from 0 to 10. As results are shown in Fig. 3, with \(X < 4\), bag-of-words methods cannot extract the precise answer. Especially with \(X = 0\), these models mostly do not work. While, with \(X = 0\), fine-tuned models have results with an upper 50% P@1 score. From \(K = 3\), these models have good scores with an upper 80% P@1 score. Moreover, we provide typical examples of Dev predicted by BM25, LM, and our system (Table 7). ID 169 has word matching between question and answer passage. The models that can retrieve precise answers are BM25, LM, and our system. In contrast, in ID 776, no words of question appear in the answer passage. Hence, the models must understand the semantic backgrounds instead of capturing high lexical overlap information to retrieve the precise answer. BERT models capture context and meaning better than bag-of-words methods [6]. In particular, SBERT can derive semantically meaningful sentence embeddings [18]. Therefore, our system based on sentence transformers can find the exact answer passage for the question with ID 776.

Fig. 3.
figure 3

Results of lexical overlap experiments with P@1 (%).

Table 7. Examples in Dev predicted by traditional methods and our system.

8 Conclusion and Future Work

In this paper, we created the ViHealthQA dataset that comprises 10,015 question-answer passage pairs in the medical domain. Every answer passage is a doctor’s reply to the corresponding user’s question, so the ViHealthQA dataset is suitable for real search engines. Secondly, we propose the SPBERTQA, a two-stage question answering system based on sentence transformers on our dataset. Our proposed system performs best over bag-of-word-based models and fine-tuned multilingual pre-trained language models. This system solves the problem of linguistic gaps.

In future, we plan to employ the machine reading comprehension (MRC) module. This module helps extract answer spans from answer passages so that users can comprehend the meaning of the answer faster.