Automatic Multilingual Question Generation for Health Data Using LLMs

Ackerman, Ryan; Balyan, Renu

doi:10.1007/978-981-99-7587-7_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1946))

Included in the following conference series:

International Conference on AI-generated Content

1290 Accesses
1 Citations

Abstract

Question Generation (QG) involves automatic generation of yes/no, factual and Wh-questions created from data sources such as a database, raw text or semantic representation. QG can be used in an adaptive intelligent tutoring system or a dialog system for improving question answering in various text generation tasks. Traditional QG has used syntactic rules with linguistic features to generate factoid questions. However, more recent research has proposed using pre-trained Transformer-based models for generating questions that are more aware of the answers. The goal of this study was to create a multilingual database (English and Spanish) of automatically generated sets of questions using artificial intelligence (AI), machine learning (ML), and large language models (LLMs) in particular for a culturally sensitive health intelligent tutoring system (ITS) for the Hispanic population. Several language models (LMs) including Chat GPT, valhalla/t5-based-e2e-qg, T5 (Small, Base, and Large), mrm8488/bert2bert-spanish-question-generation, mT5 (Small and Base), Flan-T5 (Small, Base and Large), BART (Base and Large) and mBART (Large) were chosen for our experiments that were given a prompt to produce a set of questions (3, 5, 7 or 10) using transcribed texts as the context. We observed that a text/transcript of at least 100 words was sufficient to generate 5–7 questions of reasonable quality. When models were prompted to produce 10 or more questions based on texts containing 100 words or less the meaningfulness, syntax and semantic soundness of outputs decreased notably.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Question and Answer Generation from Text Using Transformers

Answer Agnostic Question Generation in Bangla Language

Article Open access 03 January 2024

Automatic Question Generation System for English Reading Comprehension

Keywords

1 Introduction

Question Generation (QG) is defined as the task involving automatic generation of yes/no, factual and Wh-questions from different forms of data input that can be obtained from a database, raw text or semantic representation [1]. QG is not an easy task and requires not only an in-depth understanding of the input source and the context but also the ability to generate grammatical questions that are also semantically correct. Questions are generally constructed as well as assessed by tutors in education and are crucial for stimulating self-learning and evaluating Students’ knowledge [2, 3]. It has been seen that a student learns more deeply if prompted by questions [4].

QG can be used in an adaptive intelligent tutoring system or a dialog system [4], for improving question answering [5,6,7,8], in various text generation tasks for evaluating the factual consistency [9,10,11], or for automatic assessments [12,13,14,15] including course materials. One of the strategies used by a tutor for evaluating a learners’ comprehension is either by asking questions that the learner needs to provide answers for, that are based on some text already provided to the learner, or by asking the learner to generate questions from the available text. There are different types of questions that can be asked from a learner or that a learner may be asked to generate to gain an understanding of the learners’ comprehension. Some of these types of questions are gap fill questions (GFQs; [16,17,18]), multiple choice questions (MCQs; [19,20,21]), factoid-based questions (FBQs; [20,21,22,23]) and deep learning questions (DLQs; [24,25,26,27]).

Traditional QG has used syntactic rules with linguistic features to generate FBQs either from a sentence or a paragraph [28,29,30,31,32]. However, QG research has started to utilize “neural” techniques, to generate deeper questions [14, 24,25,26,27]. Some research more recently has relied on pre-trained Transformer-based models for generating questions that are more aware of the answers [33,34,35,36].

The goal of this study was to create a multilingual database (English and Spanish) of automatically generated sets of questions using artificial intelligence (AI), machine learning (ML), and large language models (LLMs) in particular for a culturally sensitive health intelligent tutoring system (ITS) for the Hispanic population. This ITS is being developed as a part of a bigger NSF-funded project. In order to build this database some of the tasks were carried out in this study to answer the following research questions (RQ):

RQ1: What existing systems or models can be used to automatically generate different sets of questions given the context for English as well as Spanish?
RQ2: Is the quality of questions generated consistent across different texts and thresholds? If the quality is not consistent, then what factors impact or determine the quality of questions generated?
RQ3: What are the optimal values for these factors impacting the question generation quality?

2 Methods

2.1 Data Acquisition

For the purposes of this research data needed to be fed into our question generators as text. Therefore, eleven short videos that were recorded in both English and Spanish were transcribed into 1–3 paragraphs of text. While the topic of each section varied, the theme of every video revolved around the domain of cancer survivorship. Otter.ai was originally implemented to perform transcription tasks but proved to be inconsistent in transcribing English and ineffective in transcribing Spanish dialog altogether. Human transcribers proficient in English and Spanish were used for our final transcriptions.

The data descriptives (Number of sentences and words) shown in Table 1 for each transcription were obtained using SpaCy, an open-source NLP python library.

Table 1. Data Descriptives for the English and Spanish Expert/Reference Transcriptions

Full size table

This is to be noted that there are differences between the data descriptives for English and Spanish transcriptions. Some of these differences result due to some variations in the length of videos for the two languages. In addition, the other differences are caused due to linguistic differences between the two languages.

2.2 Data Preparation

The final transcriptions were then split into groups based on language and topic before they were printed into separate rows of a CSV file. Depending on the experiment the texts would be entered in the CSV file differently. Whole texts and individual texts were separated into English only, Spanish only and multilingual versions.

2.3 Language Models

Language models (LMs) are integral to the process of natural language processing (NLP) by which a computer is able to understand, analyze and generate human language. To do so LMs are trained on large datasets of text data gathered from a variety of resources such as books, articles, and Wikipedia. After being trained on this data, LMs are able to make predictions based on recognized patterns in natural language that can aid in a number of NLP tasks [37].

Large Language Models (LLMs).

LLMs are an evolution of LMs that are trained on considerably larger datasets and domains. They employ a self-attention model in trans-formers to capture long range dependencies in text and parallelization. Using in-context learning, models can be trained on a specific context or prompt allowing LLMs to create more coherent and human-like outputs taking major strides toward the advancement of NLP tasks and artificial intelligence [38, 39].

Chat GPT.

Chat GPT is an LLM that uses transformer architecture to process input data to create an adequate response. We implemented the ‘davinci-003 engine’ for its ability to handle large prompts and instruction-following tasks. This model was trained on Common Crawl, webtexts, books and Wikipedia [40].

Valhalla/t5-based-e2e-qg.

This is a LLM pre-trained on the SQuADv1 dataset which consists of questions created based on Wikipedia articles where the answer to each question is a segment of text from the corresponding reading passage. T5 stands for Text-to-Text-Transfer-Transformer model proposes reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings.

mrm8488/bert2bert-spanish-question-generation.

This is a LLM pre-trained on the SQuAD dataset that has been translated to Spanish. It utilizes a bert2bert model which means both the encoder and decoder are powered by BERT models.

T5 (Small, Base, and Large) and mT5 (Small and Base).

A transformer-based architecture that uses a text-to-text approach. Tasks including translation, question answering, and classification are used as input where it’s trained to generate some target text [41].

Flan-T5 (Small, Base and Large).

This model is similar to T5 but uses an instruction-tuned model and therefore is capable of performing various zero-shot NLP tasks, as well as few-shot in-context learning tasks.

BART (Base and Large) and mBART (Large).

A denoising autoencoder for pre-training sequence-to-sequence models trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard transformer-based neural machine translation architecture [42].

2.4 Experimentation

A number of LMs were tested in order to compare their capabilities when put to the task of multilingual question generation. The LMs Chat GPT, valhalla/t5-based-e2e-qg, T5 (Small, Base, and Large), mrm8488/bert2bert-spanish-question-generation, mT5 (Small and Base), Flan-T5 (Small, Base and Large), BART (Base and Large) and mBART (Large) were chosen for our experiments. Each of these models were given a prompt to produce a certain number of questions (3, 5, 7 and 10) using one of the transcribed texts as the context.

The length and combination of texts given to each model depends on the language that it was designed to handle. LMs that specialized in either English or Spanish were given each of the 11 texts individually and then these two language texts were combined into a single text to create multilingual text and the question generators were prompted to create either 3, 5, 7 or 10 questions based on the text. The multilingual models performed the same task in both languages plus a multilingual version where the individual Spanish and English texts that corresponded to the same topic were fed through at the same time. Some of the models couldn’t process the Spanish and English whole text combined so this test was eliminated. The evaluation of each model was determined by judging the outputs they produced based on coherency, spelling and accuracy.

3 Results

The evaluation of the models we have discussed so far was determined manually where the output text (questions generated in this case) meaningfulness, syntax and semantic soundness were all taken into consideration for each of the target languages. Chat GPT using the engine ‘davinci-003’ performed the best at producing 3, 5, 7 and 10 questions based on English, Spanish and multilingual texts of various sizes. Figure 1 shows an example output obtained from Chat GPT for a given context and the model was prompted to generate 3 questions based on the given context.

Although the model ‘mT5’ was capable of generating questions in both English and Spanish, it was not as consistent as the Chat GPT. Pre-trained models such as ‘T5-based-e2e-qg’, ‘bert2bert-spanish-question-generation’, ‘T5’, ‘Flan-T5’, ‘BART’ and ‘mBART’ worked well in their respective languages but were not able to handle multilingual question generation tasks and therefore failed this phase of testing. ‘T5-based-e2e-qg’ and ‘bert2bert-spanish-question-generation’ models also failed to create the correct number of questions when prompted to do so. Tables 2 and 3 summarize each model performance and sample outputs (questions generated) for monolingual and multilingual models.

Table 2. Question generation Performance for different Multilingual models

Full size table

Table 3. Question generation Performance for different Monolingual models

Full size table

4 Evaluation Metrics

There are no specific evaluation metrics designated to question generation and it is also challenging to define a gold standard of proper questions to ask. Some of the useful criteria to evaluate questions can be to determine the meaningfulness of a question, how syntactically correct a question is, and how semantically sound the generated questions are, but these criteria are very difficult to quantify. As a result, most QG systems rely on human evaluation, by randomly sampling a set of questions from the generated questions, and asking human experts or annotators to rate the quality of questions on a 5-point Likert scale.

It is also a well-known fact that like any task requiring human interference for creation of references including human question generation, human evaluation for generated questions is also costly and time-consuming. Therefore, some commonly used automatic evaluation metrics for natural language generation (NLG), such as BiLingual Evaluation Understudy [43], Metric for Evaluation of Translation with Explicit ORdering [44], National Institute of Standard and Technology [45], and Recall-Oriented Understudy for Gisting Evaluation [46] are also widely used for question generation. No matter how frequently these evaluation metrics are used even to date, some studies have shown that these evaluation metrics do not correlate well with adequacy, coherence and fluency [47,48,49,50] because these metrics evaluate by computing similarity between the source and the target sentence (in this case, the generated question) overlapping n-grams.

To overcome the issues encountered by these existing popular NLG evaluation metrics, recently a few metrics have been proposed [51,52,53]. Unlike the existing metrics, these new metrics [52] consider several question-specific factors such as named entities, content and function words, and the question type for evaluating the “answerability” of a question given the context [51, 52]. [53] proposed a dialogue evaluation model called ADEM that in addition to the word overlap statistics, uses a hierarchical RNN encoder to capture semantic similarity.

Some more recent metrics that also evaluate candidate questions given the reference questions are BLEURT [54,55,56], BERTScore [57] and MoverScore [58]. The BERTScore and MoverScore use Bidirectional Encoder Representations from Transformers [37] instead of n-gram overlap and use embeddings for token-level matching. BLEURT is a regression-based metric and uses supervised learning to train a regression layer that mimics human judgment.

The focus of this study was question generation as a result we used human evaluation to determine good quality questions for our questions corpora creation. However, to scale our work for a larger corpus we will be exploring these automatic evaluation metrics in our future work.

5 Discussion

This study was conducted to create a database of multilingual questions for both English and Spanish and answer the three research questions (RQ1–RQ3) discussed in the Introduction (Sect. 1).

5.1 RQ1: What Existing Systems or Models can be Used to Automatically Generate Different Sets of Questions Given the Context for English as Well as Spanish?

We found that there are several systems and models that exist that can be used to automatically generate questions given the context for both English and Spanish. The models that were tested for English in this study include ‘valhalla/t5-based-e2e-qg’, ‘T5 (Small, Base, and Large)’, ‘Flan-T5 (Small, Base and Large)’, ‘BART (Base and Large)’. For Spanish question generation ‘mrm8488/bert2bert-spanish-question-generation’ model was implemented and finally for mixed data (containing both English and Spanish) ‘Chat GPT’, ‘mT5 (Small and Base)’ and ‘mBART (Large)’ were tested. Chat GPT out performed all models for generating different sets of questions (3, 5, 7 and 10 for this study) in English, Spanish and multilingual texts of various sizes considering each question’s meaningfulness, syntax and semantic soundness. Other multilingual models such as ‘mT5’ showed promising results but the questions produced were repetitive and lacked meaningfulness as compared to questions generated by Chat GPT. While models such as T5 (Small, Base, and Large), Flan-T5 (Small, Base and Large), valhalla/t5-based-e2e-qg and BART (Base and Large) also produced coherent questions that were limited to single language contexts.

5.2 RQ2: Is the Quality of Questions Generated Consistent Across Different Texts and Thresholds? If the Quality is not Consistent, then What Factors Impact or Determine the Quality of Questions Generated?

The length of the transcript and the number of questions that our models were prompted to generate correlated with the overall quality of the outputs among all the models tested in the study where longer transcripts and fewer prompted questions generated higher quality outputs. It should also be noted that each of the models were pre-trained on different corpora and therefore performed better or worse than their counterparts during this experiment.

5.3 RQ3: What are the Optimal Values for These Factors Impacting the Question Generation Quality?

During our experiments we observed that a text/transcript of at least 100 words is good enough to generate 5–7 questions of reasonable quality. When models were prompted to produce 10 or more questions based on texts containing 100 words or less the meaningfulness, syntax and semantic soundness of outputs decreased notably.

5.4 Evaluation

Even though the focus of the study was not evaluation of questions but rather question generation, we feel that the next step of this study will be to evaluate these generated questions in order to build a corpus of high-quality questions for the healthcare domain. Due to the availability of small corpus used in this study, we considered human evaluation for determining the quality of generated questions. However, for scalability and future studies with larger corpora, we need to thoroughly investigate and improve upon the existing schemes to accurately measure the quality of questions, in particular deep questions.

References

Rus, V., Cai, Z., Graesser, A.: Question generation: example of a multi-year evaluation campaign. In: Proc WS on the QGSTEC (2008)
Google Scholar
Divate, M., Salgaonkar, A.: Automatic question generation approaches and evaluation techniques. Curr. Sci. 113, 1683–1691 (2017)
Article Google Scholar
Pan, L., Lei, W., Chua, T.S., Kan, M.Y.: Recent advances in neural question generation. arXiv preprint arXiv:1905.08949 (2019)
Lindberg, D., Popowich, F., Nesbit, J., Winne, P.: Generating natural language questions to support learning on-line. In: Proceedings of the 14th European Workshop on Natural Language Generation, Sofia, Bulgaria, pp. 105–114 (2013)
Google Scholar
Cheng, Y.: Guiding the growth: difficulty-controllable question generation through step-by-step rewriting. arXiv preprint arXiv:2105.11698 (2021)
Puri, R., Spring, R., Patwary, M., Shoeybi, M., Catanzaro, B.: Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599 (2020)
Du, X., Cardie, C.: Harvesting paragraph-level question-answer pairs from wikipedia. arXiv preprint arXiv:1805.05942 (2018)
Duan, N., Tang, D., Chen, P., Zhou, M.: Question generation for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 866–874 (2017)
Google Scholar
Fabbri, A.R., Wu, C.S., Liu, W., Xiong, C.: QAFactEval: improved QA-based factual consistency evaluation for summarization. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2587–2601. Association for Computational Linguistics, Seattle, United States (2022)
Google Scholar
Scialom, T., et al.: Questeval: summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021)
Scialom, T., Lamprier, S., Piwowarski, B., Staiano, J.: Answers unite! Unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610 (2019)
Rebuffel, C., et al.: Data-QuestEval: a referenceless metric for data-to-text semantic evaluation. arXiv preprint arXiv:2104.07555 (2021)
Lee, H., Scialom, T., Yoon, S., Dernoncourt, F., Jung, K.: QACE: asking questions to evaluate an image caption. arXiv preprint arXiv:2108.12560 (2021)
Chen, G., Yang, J., Hauff, C., Houben, G.J.: LearningQ: a large-scale dataset for educational question generation. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 12 (1) (2018)
Google Scholar
Heilman, M., Smith, N.A.: Good question! Statistical ranking for question generation. In: Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, pp. 609–617 (2010)
Google Scholar
Aldabe, I., Maritxalar, M.: Automatic distractor generation for domain specific texts. In: The International Conference on Natural Language Processing, pp. 27–38. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)
Google Scholar
Kumar, G., Banchs, R.E., D’Haro, L.F.: Revup: automatic gap-fill question generation from educational texts. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 154–161 (2015)
Google Scholar
Agarwal, M., Mannem, P.: Automatic gap-fill question generation from text books. In: Proceedings of the sixth workshop on innovative use of NLP for building educational applications, pp. 56–64 (2011)
Google Scholar
Narendra, A., Agarwal, M., Shah, R.: Automatic cloze-questions generation. In: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 511–515 (2013)
Google Scholar
Mitkov, R.: Computer-aided generation of multiple-choice tests. In: Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing, pp. 17–22 (2003)
Google Scholar
Aldabe, I., De Lacalle, M.L., Maritxalar, M., Martinez, E., Uria, L.: Arikiturri: an automatic question generator based on corpora and nlp techniques. In: Intelligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceedings 8, pp. 584-594. Springer Berlin Heidelberg (2006)
Google Scholar
Ali, H., Chali, Y., Hasan, S.A.: Automatic question generation from sentences. In: Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts, pp. 213–218 (2010)
Google Scholar
Chali, Y., Hasan, S.A.: Towards automatic topical question generation. In: Proceedings of COLING 2012, pp. 475–492 (2012)
Google Scholar
Corley, M.A., Rauscher, W.C.: Deeper learning through questioning. TEAL Cent. Fact Sheet 12, 1–5 (2013)
Google Scholar
Yao, X., Bouma, G., Zhang, Y.: Semantics-based question generation and implementation. Dialogue Discourse 3(2), 11–42 (2012)
Article Google Scholar
Liu, M., Calvo, R.A.: An automatic question generation tool for supporting sourcing and integration in students’ essays. ADCS 2009, 90 (2009)
Google Scholar
Adamson, D., Bhartiya, D., Gujral, B., Kedia, R., Singh, A., & Rosé, C.P.: Automatically generating discussion questions. In: Artificial Intelligence in Education: 16^th (2013)
Google Scholar
Khullar, P., Rachna, K., Hase, M., Shrivastava, M.: Automatic question generation using relative pronouns and adverbs. In: Proceedings of ACL 2018, Student Research Workshop, pp. 153–158 (2018)
Google Scholar
Rus, V., Lester, J.: The 2nd workshop on question generation. In: Artificial Intelligence in Education, p. 808. IOS Press (2009)
Google Scholar
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., Moldovan, C.: Overview of the first question generation shared task evaluation challenge. In: Proceedings of the Third Workshop on Question Generation, pp. 45–57 (2010)
Google Scholar
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., Moldovan, C.: Question generation shared task and evaluation challenge–status report. In: Proceedings of the 13th European Workshop on Natural Language Generation, pp. 318–320 (2011)
Google Scholar
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., Moldovan, C.: A detailed account of the first question generation shared task evaluation challenge. Dialogue Discourse 3(2), 177–204 (2012)
Article Google Scholar
Murakhovs’ ka, L., Wu, C.S., Laban, P., Niu, T., Liu, W., Xiong, C.: Mixqg: neural question generation with mixed answer types. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2022)
Google Scholar
Lelkes, A.D., Tran, V.Q., Yu, C.: Quiz-style question generation for news stories. In: Proceedings of the Web Conference 2021, pp. 2501–2511 (2021)
Google Scholar
Qi, W., et al.: Prophetnet: predicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063 (2020)
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: (2019). Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Google Scholar
Chang, Y., et al.: A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Cretu, C.: How Does ChatGPT Actually Work? An ML Engineer Explains, Scalable Path. https://www.scalablepath.com/data-science/chatgpt-architecture-explained (2023)
Alammar, J.: The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ (2018)
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23, 105–115 (2009)
Article Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on Human Language Technology Research, pp. 138–145 (2002)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Google Scholar
Ananthakrishnan, R., Bhattacharyya, P., Sasikumar, M., Shah, R.M.: Some issues in automatic evaluation of english-hindi mt: more blues for bleu. Icon 64 (2007)
Google Scholar
Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp. 286–295 (2009)
Google Scholar
Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of BLEU in machine translation research. In: The 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 249–256 (2006)
Google Scholar
Liu, C.W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023 (2016)
Mohammadshahi, A., et al.: RQUGE: reference-free metric for evaluating question generation by answering the question. arXiv preprint arXiv:2211.01482 (2022)
Nema, P., Khapra, M.M.: Towards a better metric for evaluating question generation systems. arXiv preprint arXiv:1808.10192 (2018)
Lowe, R., et al.: Towards an automatic turing test: learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149 (2017)
Ushio, A., Alva-Manchego, F., Camacho-Collados, J.: An empirical comparison of LM-based question and answer generation methods. arXiv preprint arXiv:2305.17002 (2023)
Ushio, A., Alva-Manchego, F., Camacho-Collados, J.: A practical toolkit for multilingual question and answer generation. arXiv preprint arXiv:2305.17416 (2023)
Sellam, T., Das, D., Parikh, A.P.: BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert. In: The International Conference on Learning Representations (2020)
Google Scholar
Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S.: MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 563–578, Association for Computational Linguistics, Hong Kong, China (2019)
Google Scholar

Download references

Acknowledgement

This work was supported by grants from the National Science Foundation (NSF; award# 2131052 and award# 2219587). The opinions and findings expressed in this work do not necessarily reflect the views of the funding institution. Funding agency had no involvement in the conduct of any aspect of the research.

Author information

Authors and Affiliations

State University of New York Old Westbury, Old Westbury, NY, 11568, USA
Ryan Ackerman & Renu Balyan

Authors

Ryan Ackerman
View author publications
You can also search for this author in PubMed Google Scholar
Renu Balyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renu Balyan .

Editor information

Editors and Affiliations

University of Science and Technology of China, Hefei, China
Feng Zhao
Tongji University, Shanghai, China
Duoqian Miao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ackerman, R., Balyan, R. (2024). Automatic Multilingual Question Generation for Health Data Using LLMs. In: Zhao, F., Miao, D. (eds) AI-generated Content. AIGC 2023. Communications in Computer and Information Science, vol 1946. Springer, Singapore. https://doi.org/10.1007/978-981-99-7587-7_1

Download citation

DOI: https://doi.org/10.1007/978-981-99-7587-7_1
Published: 02 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7586-0
Online ISBN: 978-981-99-7587-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Multilingual Question Generation for Health Data Using LLMs

Abstract

Similar content being viewed by others

Question and Answer Generation from Text Using Transformers

Answer Agnostic Question Generation in Bangla Language

Automatic Question Generation System for English Reading Comprehension

Keywords

1 Introduction

2 Methods

2.1 Data Acquisition

2.2 Data Preparation

2.3 Language Models

Large Language Models (LLMs).

Chat GPT.

Valhalla/t5-based-e2e-qg.

mrm8488/bert2bert-spanish-question-generation.

T5 (Small, Base, and Large) and mT5 (Small and Base).

Flan-T5 (Small, Base and Large).

BART (Base and Large) and mBART (Large).

2.4 Experimentation

3 Results

4 Evaluation Metrics

5 Discussion

5.1 RQ1: What Existing Systems or Models can be Used to Automatically Generate Different Sets of Questions Given the Context for English as Well as Spanish?

5.2 RQ2: Is the Quality of Questions Generated Consistent Across Different Texts and Thresholds? If the Quality is not Consistent, then What Factors Impact or Determine the Quality of Questions Generated?

5.3 RQ3: What are the Optimal Values for These Factors Impacting the Question Generation Quality?

5.4 Evaluation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation