Keywords

1 Introduction

Question answering (QA) is to provide accurate responses to questions based on a passage. In other words, QA systems enable users to ask questions and retrieve answers using natural language queries [1] and can be viewed as an advanced form of information retrieval [2]. Additionally, the QA has been utilized to create dialogue systems and chatbots designed to simulate human conversation. There are two main procedures for processing questions. The first step is to examine the structure of the user's query. The second step is to convert the question into a meaningful question formula that is compatible with the domain of QA [3]. The majority of modern NLP problems revolve around unstructured data. This entails extracting the data from the JSON file, processing it, and then using it as needed. An implementation approach categorizes the task of extracting answers from questions into one of four types:

  1. 1.

    IR-QA (Information retrieval based)

  2. 2.

    NLP-QA (Natural language processing based)

  3. 3.

    KB-QA (Knowledge based)

  4. 4.

    Hybrid QA.

2 General Architecture

The following is the architecture of the question answering system: The user asks a question. This query is then used to extract all possible answers for the context. The appropriate architecture of a question answering system is depicted in the Fig. 1.

Fig. 1
An architecture of the Q A system. The steps are. 1. question, 2. question processing, 3. document processing, 4. answer processing, 5. answer.

Question answering systems [4]

2.1 Question Processing

The overall function of the question processing module, given a question as an input, is to process and analyze the input question so that the machine can understand the context of the question.

2.2 Document Processing

After giving the question as an input, the next big task is to parse the entire context passage to find the appropriate answer locations. The related results that satisfy the given queries are collected in this stage in accordance with the rules and keywords.

2.3 Answer Processing

The similarity is checked after the document processing stage to display the related answer. Once an answer key has been identified, a set of heuristics is applied to it in order to extract and display only the relevant word or phrase that answers the question.

3 Background

“Can digital computers think?” was written by Alan Turing in 1951. He asserted that a machine could be said to be thinking if it could participate in a conversation using a teleprinter and imitate a human completely, without any telltale differences. In 1952, the Hodgkin–Huxley model [5] showed how the brain creates a system that resembles an electrical network using neurons. According to Hans Peter Luhn [6], “the weight of a term that appears in a document is simply proportional to the frequency of the term”. Artificial intelligence (AI), natural language processing (NLP), and their applications have all been influenced by these events. The BASEBALL program, created in 1961 by Green et al. [7] for answering questions about baseball games played in the American league over the course of a season, is the most well-known early question answering system. The LUNAR system [8], created in 1971 to aid lunar geologists in easily accessing, comparing, and evaluating the chemical composition of lunar rock and soil during the Apollo Moon mission, is the most well-known piece of work in this field. A lot of earlier models, including SYNTHEX, LIFER, and PLANES [9], attempted to answer a question. Figure 2 depicts the stages of evolution of the NLP models.

Fig. 2
A flow arrow diagram. From left to right, they are bag of words, T F-I D F, co-occurrence matrix, word 2 vec or glove, transformer models, and E L M o or B E R T or X L net.

Evolution of NLP models [10]

4 Benchmarks in NLP

Benchmarks are basically some set of some standard used for assessing the performance of different systems or models agreed upon by large community. To ensure that the benchmark is accepted by large community, people use multiple standard benchmarks. Some of the most renowned benchmarks that are used largely are as follows: GLUE, SuperGLUE, SQuAD1.1, and SQuAD2.0

4.1 GLUE (General Language Understanding Evaluation)

General Language Understanding Evaluation, also known as GLUE, is a sizable collection that includes a variety of tools for developing, testing, and analyzing natural language understanding systems. It was released in 2018, and NLP enthusiasts still find it to be useful today. The components are as follows:

  1. 1.

    A benchmark of nine sentence- or sentence-pair language understanding tasks constructed on well-established existing datasets and chosen to cover a wide range of dataset sizes, text genres, and degrees of difficulty;

  2. 2.

    A leaderboard to find the top overall model;

  3. 3.

    A diagnostic dataset to assess and analyze the model’s performance in relation to a variety of linguistic issues encountered in the natural language domain.

4.2 SuperGLUE

General Language Understanding Evaluation, also known as GLUE, is a large collection of dataset that includes a variety of tools for developing, testing, and analysis. SuperGLUE is an updated version of the GLUE benchmark. SuperGLUE benchmark is designed after GLUE but with whole new set of improved and more difficult language understanding tasks, improved reasoning, and a new canvas of public leaderboard. It was introduced in 2019. Currently, Microsoft Alexander v-team with Turing NLRv5 is leading the scoreboard with URL score of 91.2.

4.3 SQuAD1.1 (Stanford Question Answering Dataset 1.1)

SQuAD or Stanford Question Answering Dataset was introduced in 2016 which consists of Reading Comprehension Datasets. These datasets are based on the Wikipedia articles. The previous version of the SQuAD dataset contains 100,000+ question answer pairs on 500+ articles.

4.4 SQuAD1.1 (Stanford Question Answering Dataset 2.0)

SQuAD2.0 or Stanford Question Answering Dataset combines all the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written so that it may look similar to answerable ones. SQuAD2.0 tests the ability of a system to not only answer questions when possible, but also determine when no answer can be found in the comprehension. Currently, the IE-NET (ensemble) by RICOH_SRCB_DML is leading the scoreboard with EM score of 90.93 and F1 score of 93.21.

5 Research

In this systematic literature review (SLR), we tried to address the various steps based on the guidelines provided by the Okoli and Schabram [11], Keele [12], which emphasizing as: Purpose of the Literature Review, Searching various Literature, Practical Screen, Quality Appraisal, and Data Extraction. The amount of written digital information has increased exponentially, necessitating the use of increasingly sophisticated search tools. Pinto et al. [13], Bhoir and Potey [14]. Unstructured data is being gathered and stored at previously unheard-of rates, and its volume is growing. Bakshi et al. [15], Malik et al. [16], and Chali et al. [17], among others. The main difficulty is creating a model that can effectively extract data and knowledge for various tasks. The tendency in this situation of the question answering systems is to glean as many answers from the questions as you can. This SLR will be guided by the research questions in Table 1 in an effort to comprehend how question answering systems techniques, tools, algorithms, and systems work and perform, as well as their dependability in carrying out task.

Table 1 Research questions to be addressed

We gathered as many journals and papers written in English in different digital libraries and reputed publications through the various keywords and tried to provide some strong evidence related to the research questions that have been tabulated earlier.

RQ_1: Fig. 3 tried to show the popularity of various models on the basis of the number of paper published in the category in every year. Here, we can observe that the BERT-based model is the most popular in this category.

Fig. 3
A horizontal stacked bar graph of 5 models, from 2018 to 2021. Bert has the highest data value in all years with absence of data in 2018.

A graph showing the popularity of the models

RQ_2: Fig. 4 tries to show the various question answering fields the QA models are used. We can see that general domain QA is dominantly used here.

Fig. 4
A doughnut chart represents Q A data. General 88%, open domain 6%, community 2%, knowledge based 1%, generative 1%, and answer selection 2%.

Chart shows the different types of question answering area

RQ_3: The fine-tuning of different models have given rise to various improvements in the existing models. Moreover, using the different techniques over the existing model can give rise to different model which can improve the existing the model. For Example: The different BERT-based models like AlBERT, RoBERTa, DistilBERT with different parameters are used according to the need as shown in Table 2.

Table 2  Different application using different models

RQ_4: This is the main purpose of the literature review. This question is answered in support with Table 3. Many papers have been taken into consideration for this comparison [8, 18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. Here, we took only three models as these are the main base models that predominate the question answering domain.

Table 3 Table showing the area of working the models

6 Conclusion

Question answering system using NLP techniques is much complicated process as compared to other type of information retrieval system. The closed domain QA systems is able to give more accurate answer than that of open domain QA system but is restricted to a single domain only. After the screening phases, we can see that the attention-based model is must preferable among the researchers. We also observed that researchers have equally turned themselves to the hybrid approaches like graph attention and applying different styles of mechanism over the base model to make their job easy. The contributions of this work are a systematic outline of different question answering systems that are able to perform better in all the different tasks. The future should try to explore the possibility of any such model that can outperform all models.