Keywords

1 Introduction

Question Generation (QG) is defined as the task involving automatic generation of yes/no, factual and Wh-questions from different forms of data input that can be obtained from a database, raw text or semantic representation [1]. QG is not an easy task and requires not only an in-depth understanding of the input source and the context but also the ability to generate grammatical questions that are also semantically correct. Questions are generally constructed as well as assessed by tutors in education and are crucial for stimulating self-learning and evaluating Students’ knowledge [2, 3]. It has been seen that a student learns more deeply if prompted by questions [4].

QG can be used in an adaptive intelligent tutoring system or a dialog system [4], for improving question answering [5,6,7,8], in various text generation tasks for evaluating the factual consistency [9,10,11], or for automatic assessments [12,13,14,15] including course materials. One of the strategies used by a tutor for evaluating a learners’ comprehension is either by asking questions that the learner needs to provide answers for, that are based on some text already provided to the learner, or by asking the learner to generate questions from the available text. There are different types of questions that can be asked from a learner or that a learner may be asked to generate to gain an understanding of the learners’ comprehension. Some of these types of questions are gap fill questions (GFQs; [16,17,18]), multiple choice questions (MCQs; [19,20,21]), factoid-based questions (FBQs; [20,21,22,23]) and deep learning questions (DLQs; [24,25,26,27]).

Traditional QG has used syntactic rules with linguistic features to generate FBQs either from a sentence or a paragraph [28,29,30,31,32]. However, QG research has started to utilize “neural” techniques, to generate deeper questions [14, 24,25,26,27]. Some research more recently has relied on pre-trained Transformer-based models for generating questions that are more aware of the answers [33,34,35,36].

The goal of this study was to create a multilingual database (English and Spanish) of automatically generated sets of questions using artificial intelligence (AI), machine learning (ML), and large language models (LLMs) in particular for a culturally sensitive health intelligent tutoring system (ITS) for the Hispanic population. This ITS is being developed as a part of a bigger NSF-funded project. In order to build this database some of the tasks were carried out in this study to answer the following research questions (RQ):

  • RQ1: What existing systems or models can be used to automatically generate different sets of questions given the context for English as well as Spanish?

  • RQ2: Is the quality of questions generated consistent across different texts and thresholds? If the quality is not consistent, then what factors impact or determine the quality of questions generated?

  • RQ3: What are the optimal values for these factors impacting the question generation quality?

2 Methods

2.1 Data Acquisition

For the purposes of this research data needed to be fed into our question generators as text. Therefore, eleven short videos that were recorded in both English and Spanish were transcribed into 1–3 paragraphs of text. While the topic of each section varied, the theme of every video revolved around the domain of cancer survivorship. Otter.ai was originally implemented to perform transcription tasks but proved to be inconsistent in transcribing English and ineffective in transcribing Spanish dialog altogether. Human transcribers proficient in English and Spanish were used for our final transcriptions.

The data descriptives (Number of sentences and words) shown in Table 1 for each transcription were obtained using SpaCy, an open-source NLP python library.

Table 1. Data Descriptives for the English and Spanish Expert/Reference Transcriptions

This is to be noted that there are differences between the data descriptives for English and Spanish transcriptions. Some of these differences result due to some variations in the length of videos for the two languages. In addition, the other differences are caused due to linguistic differences between the two languages.

2.2 Data Preparation

The final transcriptions were then split into groups based on language and topic before they were printed into separate rows of a CSV file. Depending on the experiment the texts would be entered in the CSV file differently. Whole texts and individual texts were separated into English only, Spanish only and multilingual versions.

2.3 Language Models

Language models (LMs) are integral to the process of natural language processing (NLP) by which a computer is able to understand, analyze and generate human language. To do so LMs are trained on large datasets of text data gathered from a variety of resources such as books, articles, and Wikipedia. After being trained on this data, LMs are able to make predictions based on recognized patterns in natural language that can aid in a number of NLP tasks [37].

Large Language Models (LLMs).

LLMs are an evolution of LMs that are trained on considerably larger datasets and domains. They employ a self-attention model in trans-formers to capture long range dependencies in text and parallelization. Using in-context learning, models can be trained on a specific context or prompt allowing LLMs to create more coherent and human-like outputs taking major strides toward the advancement of NLP tasks and artificial intelligence [38, 39].

Chat GPT.

Chat GPT is an LLM that uses transformer architecture to process input data to create an adequate response. We implemented the ‘davinci-003 engine’ for its ability to handle large prompts and instruction-following tasks. This model was trained on Common Crawl, webtexts, books and Wikipedia [40].

Valhalla/t5-based-e2e-qg.

This is a LLM pre-trained on the SQuADv1 dataset which consists of questions created based on Wikipedia articles where the answer to each question is a segment of text from the corresponding reading passage. T5 stands for Text-to-Text-Transfer-Transformer model proposes reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings.

mrm8488/bert2bert-spanish-question-generation.

This is a LLM pre-trained on the SQuAD dataset that has been translated to Spanish. It utilizes a bert2bert model which means both the encoder and decoder are powered by BERT models.

T5 (Small, Base, and Large) and mT5 (Small and Base).

A transformer-based architecture that uses a text-to-text approach. Tasks including translation, question answering, and classification are used as input where it’s trained to generate some target text [41].

Flan-T5 (Small, Base and Large).

This model is similar to T5 but uses an instruction-tuned model and therefore is capable of performing various zero-shot NLP tasks, as well as few-shot in-context learning tasks.

BART (Base and Large) and mBART (Large).

A denoising autoencoder for pre-training sequence-to-sequence models trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard transformer-based neural machine translation architecture [42].

2.4 Experimentation

A number of LMs were tested in order to compare their capabilities when put to the task of multilingual question generation. The LMs Chat GPT, valhalla/t5-based-e2e-qg, T5 (Small, Base, and Large), mrm8488/bert2bert-spanish-question-generation, mT5 (Small and Base), Flan-T5 (Small, Base and Large), BART (Base and Large) and mBART (Large) were chosen for our experiments. Each of these models were given a prompt to produce a certain number of questions (3, 5, 7 and 10) using one of the transcribed texts as the context.

The length and combination of texts given to each model depends on the language that it was designed to handle. LMs that specialized in either English or Spanish were given each of the 11 texts individually and then these two language texts were combined into a single text to create multilingual text and the question generators were prompted to create either 3, 5, 7 or 10 questions based on the text. The multilingual models performed the same task in both languages plus a multilingual version where the individual Spanish and English texts that corresponded to the same topic were fed through at the same time. Some of the models couldn’t process the Spanish and English whole text combined so this test was eliminated. The evaluation of each model was determined by judging the outputs they produced based on coherency, spelling and accuracy.

3 Results

The evaluation of the models we have discussed so far was determined manually where the output text (questions generated in this case) meaningfulness, syntax and semantic soundness were all taken into consideration for each of the target languages. Chat GPT using the engine ‘davinci-003’ performed the best at producing 3, 5, 7 and 10 questions based on English, Spanish and multilingual texts of various sizes. Figure 1 shows an example output obtained from Chat GPT for a given context and the model was prompted to generate 3 questions based on the given context.

Fig. 1.
figure 1

Example output from Chat GPT for the given Context to generate 3 question

Although the model ‘mT5’ was capable of generating questions in both English and Spanish, it was not as consistent as the Chat GPT. Pre-trained models such as ‘T5-based-e2e-qg’, ‘bert2bert-spanish-question-generation’, ‘T5’, ‘Flan-T5’, ‘BART’ and ‘mBART’ worked well in their respective languages but were not able to handle multilingual question generation tasks and therefore failed this phase of testing. ‘T5-based-e2e-qg’ and ‘bert2bert-spanish-question-generation’ models also failed to create the correct number of questions when prompted to do so. Tables 2 and 3 summarize each model performance and sample outputs (questions generated) for monolingual and multilingual models.

Table 2. Question generation Performance for different Multilingual models
Table 3. Question generation Performance for different Monolingual models

4 Evaluation Metrics

There are no specific evaluation metrics designated to question generation and it is also challenging to define a gold standard of proper questions to ask. Some of the useful criteria to evaluate questions can be to determine the meaningfulness of a question, how syntactically correct a question is, and how semantically sound the generated questions are, but these criteria are very difficult to quantify. As a result, most QG systems rely on human evaluation, by randomly sampling a set of questions from the generated questions, and asking human experts or annotators to rate the quality of questions on a 5-point Likert scale.

It is also a well-known fact that like any task requiring human interference for creation of references including human question generation, human evaluation for generated questions is also costly and time-consuming. Therefore, some commonly used automatic evaluation metrics for natural language generation (NLG), such as BiLingual Evaluation Understudy [43], Metric for Evaluation of Translation with Explicit ORdering [44], National Institute of Standard and Technology [45], and Recall-Oriented Understudy for Gisting Evaluation [46] are also widely used for question generation. No matter how frequently these evaluation metrics are used even to date, some studies have shown that these evaluation metrics do not correlate well with adequacy, coherence and fluency [47,48,49,50] because these metrics evaluate by computing similarity between the source and the target sentence (in this case, the generated question) overlapping n-grams.

To overcome the issues encountered by these existing popular NLG evaluation metrics, recently a few metrics have been proposed [51,52,53]. Unlike the existing metrics, these new metrics [52] consider several question-specific factors such as named entities, content and function words, and the question type for evaluating the “answerability” of a question given the context [51, 52]. [53] proposed a dialogue evaluation model called ADEM that in addition to the word overlap statistics, uses a hierarchical RNN encoder to capture semantic similarity.

Some more recent metrics that also evaluate candidate questions given the reference questions are BLEURT [54,55,56], BERTScore [57] and MoverScore [58]. The BERTScore and MoverScore use Bidirectional Encoder Representations from Transformers [37] instead of n-gram overlap and use embeddings for token-level matching. BLEURT is a regression-based metric and uses supervised learning to train a regression layer that mimics human judgment.

The focus of this study was question generation as a result we used human evaluation to determine good quality questions for our questions corpora creation. However, to scale our work for a larger corpus we will be exploring these automatic evaluation metrics in our future work.

5 Discussion

This study was conducted to create a database of multilingual questions for both English and Spanish and answer the three research questions (RQ1–RQ3) discussed in the Introduction (Sect. 1).

5.1 RQ1: What Existing Systems or Models can be Used to Automatically Generate Different Sets of Questions Given the Context for English as Well as Spanish?

We found that there are several systems and models that exist that can be used to automatically generate questions given the context for both English and Spanish. The models that were tested for English in this study include ‘valhalla/t5-based-e2e-qg’, ‘T5 (Small, Base, and Large)’, ‘Flan-T5 (Small, Base and Large)’, ‘BART (Base and Large)’. For Spanish question generation ‘mrm8488/bert2bert-spanish-question-generation’ model was implemented and finally for mixed data (containing both English and Spanish) ‘Chat GPT’, ‘mT5 (Small and Base)’ and ‘mBART (Large)’ were tested. Chat GPT out performed all models for generating different sets of questions (3, 5, 7 and 10 for this study) in English, Spanish and multilingual texts of various sizes considering each question’s meaningfulness, syntax and semantic soundness. Other multilingual models such as ‘mT5’ showed promising results but the questions produced were repetitive and lacked meaningfulness as compared to questions generated by Chat GPT. While models such as T5 (Small, Base, and Large), Flan-T5 (Small, Base and Large), valhalla/t5-based-e2e-qg and BART (Base and Large) also produced coherent questions that were limited to single language contexts.

5.2 RQ2: Is the Quality of Questions Generated Consistent Across Different Texts and Thresholds? If the Quality is not Consistent, then What Factors Impact or Determine the Quality of Questions Generated?

The length of the transcript and the number of questions that our models were prompted to generate correlated with the overall quality of the outputs among all the models tested in the study where longer transcripts and fewer prompted questions generated higher quality outputs. It should also be noted that each of the models were pre-trained on different corpora and therefore performed better or worse than their counterparts during this experiment.

5.3 RQ3: What are the Optimal Values for These Factors Impacting the Question Generation Quality?

During our experiments we observed that a text/transcript of at least 100 words is good enough to generate 5–7 questions of reasonable quality. When models were prompted to produce 10 or more questions based on texts containing 100 words or less the meaningfulness, syntax and semantic soundness of outputs decreased notably.

5.4 Evaluation

Even though the focus of the study was not evaluation of questions but rather question generation, we feel that the next step of this study will be to evaluate these generated questions in order to build a corpus of high-quality questions for the healthcare domain. Due to the availability of small corpus used in this study, we considered human evaluation for determining the quality of generated questions. However, for scalability and future studies with larger corpora, we need to thoroughly investigate and improve upon the existing schemes to accurately measure the quality of questions, in particular deep questions.