Introduction

The millennial years have shown that multiple choice questions are one of the best ways of assessing higher levels of cognitive domain. For most comprehensive, licencing, and screening tests, MCQs are used as a versatile tool to gauge the competencies of a medical student. If constructed appropriately, they can assess higher cognitive processing of Bloom’s taxonomy like interpretation, synthesis, and knowledge application rather than merely testing recall of facts [1]. In the dynamic field of medical education, it is indeed cumbersome and time-consuming to frame MCQs as compared to descriptive questions [2].

The most commonly used type of MCQ is the single best response type, i.e. type A MCQ with four options [3]. The major problems with MCQs are difficulty in constructing plausible distractors, especially in assessing higher cognitive skills; ambiguity with more than one correct answer; score getting influenced by the reading ability of the student; probability of guessing the correct answer; and the inability of a MCQ to differentiate between high and low performers. Difficulty comprehending the reason for opting for an incorrect answer by the student and overinterpreting an MCQ (item) are the other problems encountered by the students. These problems are well noted in the MCQs framed in most medical undergraduate subjects, including Obstetrics and Gynaecology.

Although it is a fact that MCQs are not a preferred tool (for psychomotor and affective domains), if properly constructed, MCQs can easily overcome these above-mentioned flaws [4]. How each MCQ (item) functions as a level of difficulty and in identifying the spread of high and low performers is decided by item analysis [4]. This will help in meeting all learning outcomes, providing highly structured, well-designed tasks and meeting the uniform standards. Also validity, reliability, and educational impact are taken care of.

The MCQ item analysis consists of the difficulty index (DIF I) (percentage of students that correctly answered the item), discrimination index (DI) (distinguishes between high achievers and nonachievers), distractor effectiveness (DE) (whether the items are well constructed), and internal consistency reliability (how well the items are correlated to one another). By items, one means questions, statements, or scenarios that are used as an assessment instrument. Each item is evaluated for these indices because if an item is flawed, it can become a distractor and the assessment can fail [5]. Item analysis is a relatively simple and valuable procedure that provides a method for analysing observation, the interpretation of the knowledge achieved by the students and information regarding the quality of test items. In this study, we have performed item analysis of single best response type MCQs, as they are seen as an efficient tool for assessing the student's level of academic learning.

The index study would provide a platform to change the way MCQs are selected in the formative assessment strategy as a part of undergraduate curriculum implementation. It would also help in the preparation of a standard question bank in Obstetrics and Gynaecology. The objectives of our study are to perform post-validation item analysis of MCQs constructed by medical faculty for formative assessment of final-year medical students and to explore the association between difficulty index (p value) and discrimination indices (DI) with distractor efficiency (DE). We also assessed their feedback on a 5-point Likert scale.

Methodology

This was a Prospective and Analytical study carried out at the Department of Obstetrics and Gynaecology in the year 2021, from January to December. The study was carried out involving the first 50 students of the final year of M.B.B.S. who gave their consent to take the MCQ test at the semester's end. It also involved all the faculty members of the Department of Obstetrics and Gynaecology, Ananta Medical College & Research Centre, Rajsamand, Rajasthan.

Item Analysis: [6]

The result of the student’s performance in the formative assessment was used to find out FV (facility value), DI( discrimination index), and DE(distractor efficiency).

Facility Value (FV) or Difficulty Index or Facility Index:

$$ {\text{FV}} = {\text{HAG}}\;\left( {{\text{High}}\;{\text{achievers}}} \right) + {\text{LAG}}\;\left( {{\text{Low}}\;{\text{achievers}}} \right)/{\text{N}}\;\left( {{\text{Total}}\;{\text{students}}\;{\text{in}}\;{\text{two}}\;{\text{groups}}} \right)\;{\text{represented}}\;{\text{in}}\;\% $$
$$ {\text{HAG }} = {\text{ First }}30\% {\text{ scorers}};{\text{ LAG }} = {\text{ Last }}30\% {\text{ scorers }}\left( {\text{they were classified as per their performance in the MCQ test}} \right). $$
$$ {\text{FV}}\;{\text{ranges}}\;{\text{from}}\;0\;{\text{to}}\;{1}00;\;{\text{FV}} > / = {85}\% \;{\text{means}}\;{\text{easy}}\;{\text{question}};\;{51}{-}{84}\% \;{\text{moderate}};\;/ = {5}0\;{\text{hard}}. $$

Discrimination Index (DI):

$$ {\text{DI}} = {2}*\left( {{\text{HAG}} - {\text{LAG}}} \right)/N $$

Its maximum value is 1.0. Its value > / = 0.35 is considered good, while < 0.2 is unacceptable, and in between is intermediate, depending on the type and intention of the test.

Distractor Efficiency

Those incorrect options in the MCQ are distractors. A poor distractor (NFD or non-functional distractor) is the one that is not even picked by 5% of the students, while a good one (Functional Distractor, FD) is selected by 5% or more students. This suggests the distractors are plausible and not dummies [7].

Validation

After the item analysis presentation, feedback was taken from the involved faculties. All the MCQs submitted by faculties were peer reviewed before the test was given to the undergraduate students. All the students were given a five-point Likert questionnaire as feedback on the formative assessment. The feedback questionnaire consisted of a 5-point Likert scale.

The following phases were observed while conducting the study:

Phase 1: After obtaining due permission from the Principal and Controller and ethical approval from the IEC, the MEU department was informed of the seminar followed by a group activity. A written circular was dispatched through HOD (Department of Obstetrics and Gynaecology) to all faculties in the department of Obstetrics and Gynaecology.

A sensitization seminar was conducted in two sessions: the first covered the theory aspect of designing an MCQ along with item analysis, and the second half consisted of a group activity for setting an MCQ and applying item analysis with examples.

Phase 2: A total of 25 type A MCQs from topics already covered in previous classes (taking into account 5 topics with 5 questions from each topic) were collected. The questions were prepared after a pooling of peer-reviewed (by Professors and Associate professors) 25 MCQs from junior faculty members (Senior residents and Assistant professors). All MCQs were a combination of recall, image-based, and case-based questions. Every type A MCQ consisted of a stem and four options. All 50 students in the final year of M.B.B.S. had to select the best answer out of four choices. Each correct answer was given 1 mark, and there was no negative marking in this test. The duration of this assessment was 60 min. The result of the student's performance was used to determine the level of difficulty and power of discrimination using Microsoft Office Excel. Based on the marks obtained, students were divided into 3 groups: high achievers (top 33%), mid achievers (middle 33%), and low achievers (bottom 33%). After the assessment, feedback was obtained from students and faculty in the form of a five-point Likert questionnaire.

Phase 3: Applying item analysis for post-validation of the MCQ question paper.

Phase 4: To prepare a question bank and resource material after applying item analysis.

Statistical Analysis

The indices were calculated using the formulae referred to in Methods. All the values have been expressed as the mean ± SD of the total number of items. The correlation at the 0.01 level was considered significant. Analysis was done using IBM SPSS 23.0 software.

Results

The OBGYN MCQ test, consisting of 25 single best response MCQs, was taken by 50 final-year M.B.B.S. students. Their mean score was 11.58 ± 4.18 (maximum marks: 25). The mean score in the two groups, i.e. LAG and HAG, was 8.44 ± 2.22 and 14.72 ± 3.21, and the difference was statistically significant (p = 0.001) (Table 1). The highest score was 22, and the lowest was 6. The respective values of FV, DI, and DE for all 25 MCQs are given in (Fig. 1). The mean value of FV is 46.3% ± 19.4%, which indicates that the test paper was moderately difficult. There were no questions that were easy (FV > 85%), while 36% of questions were moderately (FV between 51 and 84%) difficult. Sixty-four per cent of the questions were hard (FV < 50%) for the students. (Fig. 2).

Table 1 Distribution of mean score of HAG, LAG, FV, DI & DE
Fig. 1
figure 1

Distribution of scores of all the students according to the MCQs

Fig. 2
figure 2

Facility value and distribution

The mean DI of the test was 0.3 ± 0.1, which is in the acceptable range of the discrimination index. Out of 25 MCQs (items), two items (8%) had DI < 0.2, which is unacceptable. The remaining 23 items (92%) could discriminate between HAG and LAG. Nine (36%) MCQs had excellent DI, as shown in (Fig. 3). Though items 4, 10, 13, 17, and 18 were too easy, they had acceptable DI to differentiate between HAG and LAG. (Fig. 1).

Fig. 3
figure 3

Distribution of discrimination index [DI] in the study

Out of 25 items, there were 75 distractions. The mean DE was 82 ± 19.8%, which is quite good. Forty-eight per cent of the items had functional distractors (Fig. 4). Only 20% of items had two NFDs (DE 50%).

Fig. 4
figure 4

Distribution of NFDs [non-functional distractors] with the distractor efficiency [DE]

Feedback was taken from students and faculty on item analysis and the quality of MCQ questions. Questions included whether they were confident item analysis would be useful, how interesting the whole exercise was, how much effort was involved in making MCQs, if doing item analysis is important, if this whole exercise was satisfying, and how frequently such an activity should be done by faculty. According to the faculty, 4.7/5 efforts were required for constructing valid MCQs, and they found this whole exercise less interesting (3.1/5) (Fig. 5).

Fig. 5
figure 5

Feedback on 5-point Likert scale

Discussion

Framing MCQs has always been challenging, as multiple parameters are supposed to be kept in mind. The majority of high-stakes assessments follow this pattern. Many studies have been conducted on item analysis of MCQs so that they become more valid, reliable, and have a measurable educational impact.

Our question paper had a mean FV score of 46, which means it was relatively difficult. In our study, 23 (92%) MCQs had acceptable to excellent Discriminating power, and 2 items had unacceptable DI. This means those 2 items need to be revised; the rest were good at discriminating between HAG and LAG students. The DI values of the present study are comparable with studies on item analysis by Date et al. [8] and others [9], as similar findings with 78% of items having acceptable to excellent discriminating power (DI > 0.20) and 24% having poor discriminating power (DI ≤ 0.20) were reported. Sometimes DI can be negative, i.e. low achievers answer a particular item more correctly than high achievers, as reported in some studies [10, 11]. The reasons for negative DI can be the wrong key, ambiguous framing of questions, or generalised poor preparation of students. Items with negative DI decrease the validity of the test and should be removed from the collection of questions. Three items that are either too easy or too difficult have poor DI. In item analysis, FV and DI should be interpreted together.

Distractor analysis tells whether the distractors used in items are functional or non-functional. We have a mean DE of 82%, which indicates that around half of the items had functional distractors. More NFDs make an item easier and decrease DE. In a similar study by Garg et al. [12] among medical students in Delhi, the mean DI was 0.3 ± 0.17 and the mean DE was 63.4 ± 33.3. The DI here is similar to that of the index study, but the DE is much lower, indicating the presence of better distractors in the items of our study.

Pande et al. [13], Shete et al. [14], and Karelia et al. [15] showed the difficulty index correlated positively with the discrimination index, which was not significant statistically. These studies replicated the findings of our study. Sim and Rasiah et al. [16] and Mitra et al. [17] studies showed a poor correlation between difficulty index and discrimination index. Similar to ours, Khilnani et al. [18] also observed a positive correlation between the DI-DE pair and the FV-DI pair. These authors also observed a negative correlation between FV and DE, like us.

Questions that are too easy or too difficult are less discriminating. Hence, these questions need to be reconstructed to a moderate level of difficulty, either by changing the stem or by supplying better plausible distractors that will not test the interpretative or language skills of students as also inferred by Rao et al. [19] in their study [19]. A properly developed MCQ to suit a particular group of students will have moderate difficulty and high discrimination as also concluded by Izah SC et al. [20] in their article [20]. Thus, the difficulty and discrimination index serve as an indicator of the functional quality of each item [21].

Carneson et al. [1] found in their study that if appropriately constructed, MCQs can assess higher functions of Bloom's taxonomy like interpretation, synthesis, and application of knowledge [1]. Case et al. expressed the item index in percentages ranging from 0 to 100. Accordingly, the higher the percentage, the easier the item, with the recommended range being 30–70% [2]. Another famous text book by Singh et al. in their chapter ‘Item analysis’ define it as a process of assembling, summarising and using information from students’ responses to assess the quality of the given test [6]. Kaur et al. [22] in their findings also support the use of this technique to modify/remove a faulty item from subsequent tests. Thus a valid, reliable question bank of any speciality can be prepared [22].

Strengths of the study: It is the first study of its kind on item analysis in a major clinical subject, Obstetrics and Gynaecology. Most of the items were of acceptable difficulty and good to excellent discrimination. The majority of the items had acceptable to excellent discrimination scores. Around 50% of the items had functional distractions. We took feedback from students and faculty on a 5-point Likert scale.

Limitations: Sixty-four per cent of the items framed were hard as per facility value. A well-constructed MCQ questionnaire should have the maximum number of questions in the moderate difficulty range. The index study involved a smaller number of items and only a few students. It included only one subject out of various medical subjects. Increasing the number of items can improve the reliability of the study design. PBS (point bi-serial correlation) identifies those items that are odd ones out, i.e. they do not test the same construct as the remaining test. This parameter could also be included in further studies as it increases both the reliability and validity of the test. Similarly, reliability coefficients and standard errors of measurement can be used to further make the items more reliable. The refined items can thus be part of a unique question bank consisting of type A MCQs for both formative and summative assessment purposes.

Conclusion

There is a dire need to train medical faculty in constructing MCQs for formative and summative assessment, as MCQs constitute the majority of the professional exam pattern. Very easy and very difficult MCQs have low discrimination scores.

Item analysis should be an integral and regular activity for medical faculty in order to build subject-specific question banks. Items having average difficulty and high discrimination with functioning distractors should be incorporated into tests to improve the validity of the tests as well as the effectiveness of the questions.