Introduction

The introduction of innovative artificial intelligence (AI) tools brings exciting possibilities. A noteworthy example of such tools is the Generative Pretrained Transformer (GPT), a large language (LLM) model developed by OpenAI. ChatGPT, a chatbot variant of GPT-3.5, was publicly introduced at the end of November 2022. It garnered a user base of one million within just five days [1]. Consequently, there have been suggestions to consider the release date of ChatGPT as a pivotal milestone, marking the division between the pre-ChatGPT era and the post-ChatGPT era [2].

GPT operates based on the principles of natural language processing (NLP), an area that has witnessed substantial advancements in recent years, particularly with the emergence of LLMs [3]. These models undergo extensive training with textual data, equipping them with the ability to produce text that closely resembles human writing, provide precise responses to queries, and perform other language-related tasks with a high level of accuracy [4].

AI has become integrated not only into medical education [5, 6] but also into higher education through diverse applications [7, 8]. In the specific context of assessment in medical education, ChatGPT showed different levels of performance in various national medical exams [9,10,11,12,13]. A most recent study showed that GPT-4 version of ChatGPT answered more than 85% of the questions in the United States Medical Licensing Examination correctly [14]. Despite the presence of studies that focus on providing human-generated questions to ChatGPT, there has been a lack of research focused to asking ChatGPT-generated questions to humans.

Using ChatGPT to generate questions can be classified as a type of automatic item generation (AIG). There are two main groups of methods in AIG [15]: template-based and non-template-based. The template-based method, as demonstrated in the literature [16,17,18], has showed satisfying levels of validity evidence, even in national examinations [19]. Moreover, they have demonstrated promising outcomes not only in English but also in Chinese, French, Korean, Spanish, and Turkish [20, 21]. Despite this impressive success, the template-based AIG process continues to depend more on expert effort than on the use of NLP techniques (non-template-based) in AIG.

Although a study highlights the ongoing integration of ChatGPT into the practices of medical school members, including the use of ChatGPT for writing multiple-choice questions [22], none of the existing studies have made an effort to assess the quality of assessment content generated by ChatGPT. Only three studies proposed prompts for generating multiple-choice questions using ChatGPT [23,24,25]. Given that ChatGPT has exhibited academic hallucinations [2] and made inaccurate claims, such as asserting that “the human heart only has two chambers” [26], a thorough evaluation is necessary. Therefore, there is a need for studies that examine the quality of multiple-choice items generated by ChatGPT.

This study aimed to determine the feasibility of generating multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels.

Methods

Study setting and participants

This psychometric research was carried out at the Gazi University Faculty of Medicine, Ankara, Turkey. The psychometric analyses were conducted as a part of internal evaluation process to inform the related faculty board about the exam. This study was a part of a research project related to automatic item generation in different languages. Gazi University Institutional Review Board approved the project (code: 2023–1116). This study constitutes the part involved ChatGPT-generated questions in English.

During the fourth year of the six-year undergraduate medical program, a clerkship that consists of a series of small group activities were carried out to help students to learn the principles of rational prescribing using the WHO 6-Step Model [27]. These activities focused on cases primarily related to hypertension. Following the training, students participated in a written examination that consisted of multiple-choice questions. As the language of the program was English, both the training and the examination were carried out in English. As part of their curriculum, students were required to take the exam. A total of 99 fourth-year medical students enrolled in the undergraduate medical program were considered eligible for participation in the study. As our aim was to include all eligible students, we did not conduct a sample size calculation.

Question generation

The multiple-choice questions were created in August–September 2023 using the “Free Research Preview of ChatGPT” (August 3 Version). We opted not to utilize GPT-4, even though it offers enhanced capabilities compared to GPT-3.5 (offered as a free research preview), primarily due to GPT-4’s monthly subscription cost, which could impede its accessibility in developing countries. Table 1 presents the prompt template that we utilized. The prompt asks users to fill these two parts: “[PLEASE INSERT A TOPIC]” and “[PLEASE INSERT A DIFFICULTY LEVEL (E.G. EASY, DIFFICULT)].”

Table 1 The prompt template

The prompt’s origins can be traced back to Esh Tatla, a medical student who initially developed it for medical students [28]. Subsequently, it was further refined and incorporated into the academic literature by a medical education researcher [24].

In the process of question generation, we took into account the specific requirements of the examination aligned with local needs. Given that the training primarily focused on essential hypertension cases, our goal was to generate questions by considering the subjects listed in Table 2. For each of these topics, we tasked ChatGPT with generating both an easy and a difficult multiple-choice question.

Table 2 The inserted elements to the prompt template

Expert panel and test administration

The questions generated by ChatGPT underwent a review process conducted by a panel of experts, comprising members of the rational pharmacotherapy board and also other subject matter experts. Each of these experts had over five years of experience in rational prescribing training and in the development of questions in medical school assessments. They evaluated each question based on two key criteria:

  • Criterion 1: “Is there any problem in terms of scientific/clinical knowledge? Is the question clear? Is there only one correct answer? Is the information provided in the question sufficient to find the correct answer? Is the question high-quality?”

  • Criterion 2: “Is this question suitable for the unique context of rational drug prescribing training carried out in the school?”

Reviewers were tasked with evaluating the scientific acceptability of the questions through their expertise (Criterion 1) and verifying their suitability for integration into the official clerkship exam (Criterion 2). Importantly, it was explicitly emphasized that they were not authorized to make any changes to the questions. All ten questions were considered scientifically sound and clear. Each question had only one correct answer. The information provided in the questions was sufficient to find the correct answer. However, eight of them were excluded due to their unsuitability for our medical school context (Criterion 2). This decision was based on various factors, one of which was the inclusion of a correct option related to “The Dietary Approaches to Stop Hypertension” (DASH diet). DASH diet is a terminology based on the USA and was not covered in our training. Two questions (#3 and #10) left for inclusion in the exam.

We integrated the questions (Table 3) generated by ChatGPT into the test. To address cultural considerations, we replaced patient names with generic terms such as “a patient” or “the patient” by eliminating specific names like “Mr. Johnson.” The questions themselves remained unchanged without any further modifications. In total, the test comprised 25 single best answer multiple-choice questions, combined with questions written by human authors. This test was conducted in physical classroom settings, supervised by proctors.

Table 3 The ChatGPT-generated questions that were included in the exam

Statistical analysis

We conducted a psychometric analysis based on Classical Test Theory. We performed item-level analysis to determine two parameters: item difficulty and item discrimination indices. Item difficulty was calculated by dividing the cumulative score of examinees by the maximum attainable score. Item discrimination was calculated by using point-biserial correlation (using the Spearman correlation in SPSS 22.0 for Windows, Chicago, IL, USA). This allowed us to determine an individual item's capacity to effectively differentiate between high-performing students and their lower-performing students.

Although large-scale standardized tests require a point-biserial correlation of no less than 0.30 for an item, values in the mid to high 0.20 s can be considered acceptable for locally written classroom-type assessments [29]. Furthermore, we assessed the response distribution for each answer option to identify non-functioning distractors. We adhered to the established criterion on functional distractors as those chosen by examinees at a rate exceeding 5% [29].

Results

Both of the items demonstrated point-biserial correlations that exceeded the acceptable threshold of 0.30. Although Question #3 had three options (A, D, and E) that did not exceed 5% level of response, Question #10 did not have any non-functional distractors. The specific values for these indices and response percentages can be found in Table 4. The mean difficulty and discrimination levels of remained 23 items were 0.68 and 0.21, respectively.

Table 4 Item difficulty and discrimination values and response percentages

Discussion

In automatic item generation (AIG) for medical education, template-based AIG methods have been favored over non-template-based ones by researchers because non-template-based methods could not provide feasible multiple-choice questions [21]. In this study, we found that ChatGPT is able to generate multiple-choice questions with acceptable levels of item psychometrics. Our findings showed that the questions can effectively differentiate between students who perform at high and low levels. To our best knowledge, this is the first study that reveals psychometric properties of ChatGPT-generated questions in the English language administered within an authentic medical education context.

The findings point out the beginning of an AI-driven era for AIG instead of using template-based methods. This transformation is readily observable in our new ability, as humans, to produce appropriate case-based multiple-choice questions with minimal human efforts, accomplished by the simple process of inputting a prompt and hitting the enter key. The efficiency achieved through AI would appear remarkable to test developers from a decade ago, who were engaged in the effortful task of manually writing multiple-choice questions.

The increased potential observed in GPT-3 is likely due to its ten times larger dataset compared to previous models [30]. The data may not perfectly align with the test’s purpose, but it suggests that specialized language models could help generate better multiple-choice questions. However, it is essential to recognize that the quality of multiple-choice questions is closely tied to prompt quality. This emphasizes the importance of prompt engineering skills for medical teachers and test developers in the future. They can take the prompt we used as a starting point because it is customizable to generate various types of single best answer multiple-choice questions, in contrast to the prompt developed only for generating NBME-style (National Board of Examiners) questions [23].

AI-based AIG has some drawbacks as well. While our findings have shown ChatGPT’s ability to generate multiple-choice questions with acceptable psychometrics, it is not infallible. It is crucial for test developers to remember that ChatGPT, like any AI model, relies on the data it has been trained on, and it may sometimes provide inaccurate or outdated content. For instance, certain explanations provided by ChatGPT contained contradicting content regarding the effect of beta-blockers in gout patients, despite the absence of issues with the questions themselves. We did not encounter any problem because we did not use the explanations but if they are used for, for example, formative purposes, using it with caution and expert oversight remain essential [25] to ensure the correctness and relevance. Hence, while it is efficient to generate questions with ChatGPT, constructing an entire exam without any revisions can be challenging. The questions still necessitate subject matter experts to review and revise [31]. This difficulty arises because the generated questions may, for example, lack scientific validity or include elements that do not align with the specific context of a medical school. For instance, the term “DASH diet” is unfamiliar within our training, which led subject matter experts to opt against including that question in the exam.

Another significant drawback is the “black box” nature of AI models. Although this enables us to generate unique questions each time, we cannot know how our input will precisely affect the output. In contrast, template-based AIG offers an appropriate level of control and customization that can be valuable in revising and correcting hundreds of questions at once [21]. While generating questions using AI can be efficient, test developers must consider a balance between the efficiency offered by AI and the need for the level of control.

There are some limitations in our study. The first limitation is that it is based on a limited set of questions and low number of participants from a single university. Although the inclusion of more questions would have been preferable, the need for compliance with official regulations constrained our ability to expand the number of questions included in the exam. Future studies with more questions are necessary to determine the applicability of the findings across a wider range fields. However, it is important to recognize that extending these results to different subjects or medical institutions may present challenges due to the constantly evolving nature of LLMs. Another limitation is that relying solely on the point-biserial correlation as the primary measure of quality may not have encompassed all relevant quality measures. A detailed qualitative analysis of item content would provide valuable information.

Conclusion

This study investigated the feasibility of a ChatGPT prompt for generating clinical multiple-choice questions because a major challenge faced by medical schools is the labor-intensive task of writing case-based multiple-choice questions for assessing the higher-order skills. The findings showed that ChatGPT-generated questions exhibit psychometric properties that meet acceptable standards. It presents a significant opportunity to make test development more efficient. However, further research is essential to either corroborate or question these findings.