1 Introduction

Question plays a significant role in the teaching-learning process [28]. Preparing the questions and assessing their answers manually are time-consuming and laborious task [12]. Therefore, automatic question generation and grading the answer automatically catch the attention of educationalists and researchers [17]. Questions are of two types: objective and subjective [14]. In the case of objective questions, the examinees are asked to select the correct answer from a set of options or fill the blanks with words to answer a question. Multiple-choice, true-false, and fill-in-the-blank are the popularly used assessment tools [16].

Multiple-choice question (MCQ) has many advantages, including quick evaluation, uniform scoring, and less testing time [12]. Therefore, many competitive examinations use MCQ papers for assessing the candidate’s merit. MCQ is also effective in active learning environment [49] and outcome-based education (OBE) [43] system.

MCQ has three main components [46]. These are stem, answer key, and distractors. A stem forms the body of a MCQ, which is an interrogative sentence (for wh-question) or a sentence with gap (for fill-in-the-blank question). An answer key is the correct option of an MCQ, and the distractors are the wrong options that confuse the examinee to select the correct answer.

All sentences of a text are not suited to generate MCQ stems [36]. A sentence with appropriate information can lead to a stem. Therefore, identifying informative sentences from a text plays an important role in MCQ generation. Different techniques are employed in the literature for selecting informative sentences such as sentence length [21], appearance of a particular word [50], parts-of-speech pattern [14], summarization [9] and parse structure [36].

Similarly, all words of an informative sentence are not chosen for the answer key. Therefore, the answer key selection is a task that determines which word or phrase will be replaced/removed from the sentence to generate stem [36]. Term frequency (TF) is the starting and probably the efficient approach to determine the key in a sentence [13]. Sometimes TF-IDF is applied as an alternative to term frequency [25]. The other techniques such as part-of-speech matching [42], parse structure [23], pattern matching [23], and semantic information [1] are used in the literature for selecting key for MCQ.

After selecting the key from an informative sentence, the next task is transforming it into a question form (stem). Several approaches are used such as appropriate wh-word [35], dependency structure [1], discourse connectives [3], and semantic information [40] in the literature to generate the stem for MCQ.

The distractors are also important in MCQ generation [22]. The quality of distractors improves the quality of an MCQ. The examinees choose the correct answers easily when the distractors are not able to confuse them. As a result, the quality of the MCQ degrades. Parts-of-speech information [2], frequency count [13], WordNet [27], domain ontology [29], distributional hypothesis [1], and semantic analysis [4, 41] are used in the literature to generate distractors for the MCQ.

After a lot of efforts by the researchers, generating MCQs with suitable distractors is still a challenging task and also not effective in real educational applications [46]. We have noted that the simple sentences are more useful to generate MCQs than the complex and compound sentences. In this paper, we have used a pipeline for simple sentence generation [15]. Next, the simple sentences are ranked based on topic-words to select informative sentences for creating MCQ stem. The topic-words are identified using the rapid automatic keyword extraction (RAKE) [48]. The distractors generation technique is proposed here using feature-based unsupervised clustering. Finally, the string similarity and semantic similarity are explored within clusters for selecting final distractors, which are closest to the answer key. The salient features of the article, which contributes to the literature in multiple-ways are as follows:

  • We have proposed a complete framework for MCQ generation that includes stem generation, answer-key identification, and distractor generation from a text-based learning material for educational assessment.

  • We have proposed a semantic feature-based clustering approach for distractor generation that improves state-of-the-art accuracy.

  • The system can able to generate multiword distractors, which makes it more attractive.

2 Related work

This section presents the related existing methods found in the literature. Table 1 shows methods and limitations used for MCQ generation. We also discuss the challenges and bridge gaps by our proposed method.

Table 1 Comparative analysis of MCQ generation methods in the previous literature
NLP based methods: :

Agarwal and Mannem [2] proposed a system, which generates gap-filling questions from a textbook. They used syntactic and lexical features of the document for generating questions without relying on any external resource. Narendra et al. [44] employed a summarizer (MEAD) to select informative sentences for cloze question generation. They proposed an approach to select distractors using a knowledge-base for a specific domain. Bhatia et al. [10] described a pattern-based approach to select sentences for generating MCQ. They used a set of patterns of the existing questions for selecting sentences from Wikipedia. They also proposed an approach for generating named-entity distractors. Afzal and Mitkov [1] proposed a dependency-based unsupervised approach for extracting semantic-relations to generate MCQs automatically. They generated questions using these semantic relations and finally, generated distractors using a distributional similarity measure. Majumder and Saha [35] presented a parse-tree matching approach for selecting informative sentences. They mainly focused on selecting suitable-sentences for generating MCQs and considered distractors generation as their future work. In another work, Majumder and Saha [36] also applied topic modeling and parse structure similarity for selecting informative sentences. The distractors were generated using a name-dictionary and a set of rules. Alsubait et al. [5] proposed an ontology-based MCQ generation system and evaluated the approach by domain experts. Pugh et al. [47] developed a framework for generating high quality MCQs employing cognitive models. This approach created quality test-items that assess clinical decision-making. Santhanavijayan et al. [49] proposed an automatic system for generating MCQs on any user-defined domain. Their system transformed summary-sentences into the stem for generating MCQs. They used similarity-metrics such as hypernyms and hyponyms to generate distractors. Patra and Saha [46] presented a method to generate named entity distractors for generating MCQs.

ML based methods: :

Goto et al. [24] developed a system for multiple-choice cloze-question generation from text and it’s evaluation. The system extracted informative sentences for generating questions based on preference learning. It estimated blank parts using a sequence labeling model, conditional random field (CRF). It was unable to generate distractors for the blank part, which required more than two words. Du et al. [19] proposed the attention mechanism framework and later use an encoder-decoder [20] for generating questions from a given paragraph. The method did not address the problems of distractor generation. Yuan et al. [56] proposed a text-to-text learning method for question generation. Subramanian et al. [51] suggested a key-phrase detection framework for question generation, where the key-phrase was detected using a neural network. Liu et al. [32] proposed a regression model using orthographic, phonological, and semantic features. It automatically generated Chinese MCQs using a mixed similarity strategy. They employed a machine learning approach for generating Chinese MCQ distractors. Sun et al. [52] utilized a sequence-to-sequence model considering the answer as a cue for the question. Kim et al. [26] also suggested an answer separation module for generating questions.

Challenges and bridge gaps: :

The primary challenges of MCQ generation from a text are: selecting suitable-sentences for questions, answer-phrase identification, and relevant distractors selection. Existing NLP-based methods mainly addressed this problem of question generation but suffers from many poorly performed sub-tasks such as sentence selection and simplification. We have noted that suitable distractor generation for MCQs needs much attention.

Recently, the sharp advancement in computational hardware and machine learning algorithms open up new possibilities in NLP. The main drawback of it is that it demands a large volume of training corpus, which is difficult in many cases. The selection of question sentences, answer keys, distractors are also dependent on the content of the corpus and require learning beyond sequence-to-sequence. Therefore, most of the researches in the last decade focused on NLP-based methods to solve the problem. This study proposed an automatic system for generating MCQs from text-based learning materials. It also focused on distractor generation for MCQs from the same learning materials using a novel distance metric approach. This research will help teachers or organizations to generate MCQs automatically from the learning content to assess learners automatically.

3 Proposed method

MCQ stems are generated from simple sentences using topic-words. The topic-words or keywords define the domain or topic of the corpus. In this paper, first, we have used a technique that focuses on identifying the existing simple sentences from the text corpus and generating simple sentences from complex and compound sentences. Next, the useful keywords are fetched from this corpus. The simple sentences are ranked based on the keywords for identifying informative sentences. The system also used a preprocessing step for resolving co-references [37]. The best keyword of an informative sentence is selected as an answer key. Finally, a new feature-based clustering approach is proposed for distractor generation. Figure 1 shows the overall view of the proposed MCQ generation system. In the next subsections, we have elaborated the steps of the system. The complete system is shown in Algorithm 1.

figure a
Fig. 1
figure 1

The overall view of the proposed MCQ generation system

3.1 Simple sentence identification

A simple sentence is built of one independent clause; on the other hand, a compound or complex sentence is consisted of minimum two clauses [6]. First, we have separated all existing simple sentences from other sentences using the identification of one independent clause in the sentence, using the technique as described in [14]. Das et al. [15] analyzed the dependency structure [38] of input sentences and proposed a technique to generate simple sentences from complex and compound sentence. A compound sentence consists of two or more independent clauses. Their approach generated two or more simple sentences by splitting a compound sentence. A complex sentence has at least one independent clause and one or more dependent clauses. Their approach also generated one or more simple sentences from a complex sentence by extracting the independent clauses with ignoring dependent clauses. This technique is inherited here for generating simple sentences from complex and compound sentences.

3.2 Keywords identification

A keyword is a word or a set of words that provides the content clue of a document. The term frequency (TF) is a popular approach to determine the keywords in a document [33]. Sometimes, the term frequency and inverse document frequency (TF-IDF) is applied alternately to identify the keywords from an individual document [18]. But the TF and TF-IDF are not useful for finding multiword keywords. Several statistical association measures such as Pointwise mutual information (PMI), Dice-coefficient [14], Jaccard similarity are used most often to determine the multiword keywords in a document. A well-known approach TextRank used Jaccard similarity for extracting keywords [31]. Another popularly known technique is RAKE (Rapid Automatic Keyword Extraction) [55]. It is an unsupervised statistical method used for extracting keywords, which is independent of the corpus domain and language. It can generate more complicated keywords that might have more meaning than individual words. The RAKE is computationally more effective than TextRank while obtaining comparable higher precision and recall scores. We have used the RAKE method to identify the keywords from our corpus. We have customized the RAKE method and considered the keywords tagged with ‘NNP’ or ‘NNPS’ (proper nouns: required POS = [‘NNP’,‘NNPS’]) and ‘CD’ (numbers: required POS = [‘CD’]) to generate more suitable distractors. The RAKE score of a word is calculated in equation (1), where deg(w) is the degree, and freq(w) is the frequency of a word w in the corpus.

$$ \rho(w)=\frac{deg(w)}{freq(w)} $$
(1)

This problem is represented by an undirected graph considering the words as nodes. The degree of a word deg(w) is defined by the degree of a node or vertex (deg(v)) in the graph. Two nodes are connected via an undirected edge when they are linked with the same candidate keyword. The higher-degree of a node means that it has more connections in the graph. It means that the word occurs more often and appear in the longer candidate keywords. Therefore, the degree of a word presents, how frequently it co-occurs with other words in the candidate keywords. To find the multiword keyword, the RAKE looks for pairs of words that are adjacent to one another in the same order and at least twice in the same document. Next, a new candidate keyword is formed as a combination of those words. The RAKE score ρ(k) of the keyword (k) is computed by summing the score of adjacent member words ρ(wi), which is shown in equation (2), where wi is the ith adjacent member-word of the keyword (1 ≥ in), and n is the number of individual words present in the keyword.

$$ \rho(k)=\sum\limits^{n}_{i=1}\rho(w_{i}) $$
(2)

3.3 Stem generation and answer key identification

A sentence consists of some meaningful keywords (one or more words) and stopwords. Since the stopwords do not have any weight-information in the sentence, we exclude them for calculating the sentence weight for informative sentence selection. Therefore, the sentence weight w(s) is calculated by combining the weights of individual keywords that belong to the sentence. We assign a higher weight to a keyword, which has more words. The weights of ith keyword w(ki) in a sentence is defined by the number of individual words present in the keyword. Finally, the weight of the sentence s(w) is calculated using equation (3), where p is the number of keywords present in the sentence (s). Top-ranked sentences are selected as informative sentences to generate MCQ stems.

$$ w(s)=\sum\limits^{p}_{i=1}w(k_{i}) $$
(3)

Among several candidates, the best keyword is identified as an answer key depending on the word length and the RAKE score from an informative sentence. We have noticed that the multiword keyword has more significant meaning than the single word keyword to act as the answer key. After the answer key is identified, the Stanford Named Entity Recognizer (NER) is used to identify the category of answer key.Footnote 1 The stem is formed by replacing the answer key with a suitable ‘wh-word’. Parse-tree structure of the identified informative sentence is interviewed to place the ‘wh-word’ at an appropriate position or how long the sentence is taken before truncating it [36]. Figure 2 shows the parse tree structure of an informative sentence using Stanford Tregex [30]. For example, ‘who’, ‘where’ and ‘when’ are appropriately used for ‘person’, ‘location’, and ‘time/date’ respectively. The system also generates fill-in-the-blank MCQ stems from the informative sentences by simply omitting the answer-key with a blank when the answer key cannot be categorized by the NER.

Fig. 2
figure 2

The parse tree structure of a sentence using Stanford Tregex [30]

3.4 Distractors generation

Several researchers have proposed different approaches for generating distractors, but they could not achieve adequate success [46]. Generating distractors for multiword answer key is more complex than the unigram key [12]. Here, we have proposed a method to identify multiword distractors using the K-means clustering algorithm. The number of clusters for candidate distractors is identified automatically using the elbow method [11]. The elbow method is applied to determine the nearly optimal number of clusters K in K-means. Next, we have selected the final three distractors from a cluster which contains answer key. The overall distractors generation technique is presented in Algorithm 2.

figure b

The bag of words (BOW) [58] model is the simplest approach for clustering words. The word2vec [34] is a well-known method in feature learning and language modeling techniques in natural language processing (NLP). Therefore, we have used word2vec instead of the bag of words (BOW) model for clustering the keywords. Figure 3 shows the sample clustering result using word2vec features. The result shows that word2vec features are not adequate for generating distractors. For example, we have found ‘Raja Ram Mohan Roy’ and ‘1931’ are grouped into the same cluster.

Fig. 3
figure 3

Toy example of K-means clustering (K = 4) on keywords (PCA-reduced data using word2vec features)

Feature selection is one of the biggest challenges in distractors generation. We have combined different features to generate our proposed feature set. The feature set is used in the experiment for evaluating the technique of distractors generation using unsupervised K-means clustering. The more refined feature set can generate more accurate clusters for candidate distractors. In the first stage of the experiment, RAKE score (ρ(k)), unigram keyword (ku), bigram keyword (kb), trigram keyword (kt), and quadgram keyword (kq) are taken as the features. Then we have added part-of-speech (e.g., noun (nn), proper-noun (nnp), number (cd) etc.) and named-entity (e.g., person (per), organization (org), location (loc) and date (date)) features for the experiment. Finally, we have taken these twelve features of keywords to group them. The feature set of a keyword f(k) is represented in the equation (4). For example, the feature set of the keyword ‘Raja Ram Mohan Roy’ is presented by {9, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0}. Figure 4 shows the clustering results using our proposed feature set.

$$ f(k)=\{\rho(k), k_{u}, k_{b}, k_{t}, k_{q}, nn, nnp, cd, per, org, loc, date\} $$
(4)
Fig. 4
figure 4

Toy example of K-means clustering (K = 4) on keywords (PCA-reduced data using our proposed feature set)

It is difficult to determine the appropriate number of clusters for the candidate distractors. It depends on the corpus. Here, we have used the state-of-the-art elbow method [11] that automatically identifies the number of clusters for candidate distractors depending on the features of keyword in the corpus. Figure 5 shows the typical elbow-based cutoff to determine the number of clusters. The distractors are selected from a cluster when the answer key is also present in the same cluster.

Fig. 5
figure 5

The elbow method to find the optimal number of clusters (K = 21) for candidate distractors

The efficiency of the clustering is measured using the Rand Index (RI) [54]. The RI computes the similarity between two clustering results, considering all pairs of samples and counting the pairs that are assigned in the same or different clusters in the predicted and true-clusters. The true-clusters of keywords are generated manually based on the relevance of candidate distractors. The RI score is then transformed into ‘adjusted for chance’ ARI score using the equation (5).

$$ ARI = (RI - Expected\_RI) / (max(RI) - Expected\_RI) $$
(5)

The ARI is thus assured to have a score close to 0.0 for random labeling, independently of the number of clusters and samples, and 1.0 when the clusters are identical (upto a permutation). The ARI score of cluster similarity with true cluster is shown in Fig. 6. We have noticed that the ARI score is maximum when the number of clusters (K) is 21.

Fig. 6
figure 6

The ARI Score between predicted and true cluster

4 Results

This section evaluates the result of the proposed MCQ generation system. The system has different modules. Therefore, we have taken three experiments to test the system quality in different ways. This section has four subsections: performance evaluation metrics, used dataset, the experiments, and the discussion of results.

4.1 Performance evaluation metrics

The effectiveness is mainly measured using the following set of metrics [45]. The precision and recall metrics are defined as follows, where TP is the True positive rate, FP is the False Positive rate, FN is the False Negative rate, and TN is the True Negative rate (Fig. 7)

Fig. 7
figure 7

The confusion matrix

$$ Precision~(PE) = \frac{TP}{TP + FP} $$
(6)
$$ Recall~(RE) = \frac{TP}{TP + FN} $$
(7)

Another popular metric is the F1 score. It is the harmonic mean of precision and recall. It could be applicable when a balance between the precision and recall is needed, and the class distribution is uneven (i.e., high TN +FP). F1 score is defined as follows:

$$ F1~Score (FS) = 2 * \frac{PE * RE}{PE+RE} $$
(8)

The accuracy of the proposed system is evaluated using the following equation (9).

$$ Accuracy~(ACC) = \frac{(TP + TN)}{TP + FP + FN + TN} $$
(9)

4.2 Dataset

Several in-house datasets are used in the literature to measure the correctness of MCQ generation systems, and most of the MCQ generation systems are evaluated by human evaluators [46]. There is no openly available gold-standard data to evaluate the proposed system [12]. Therefore, we have created a test dataset to check the performance of the system using human evaluators. We employed five evaluators to check the correctness of the system generated results. We have tested the system using web documents. The test corpus was created by extracting the web pages of fourteen Indian leaders and eleven Indian social reformer’s Footnote 2. The test corpus has 25 documents that consist of 1893 sentences.

4.3 Experiments

We have taken three different experiments to assess the system quality. Experiment 1 evaluates the accuracy of informative sentences. Experiment 2 evaluates the accuracy of system generated stem with answer key, and Experiment 3 evaluates the accuracy of distractors generation.

Experiment 1

In the first experiment, we have evaluated the selected informative sentences for generating stem. The sentence selection task mainly depends on the keywords and simple sentences. The visualization of the extracted top-ranked keywords is illustrated in Fig. 8. The simple sentences are ranked based on keywords to select informative sentences for creating suitable stems. After selecting the informative sentences, five experts are asked to mark relevant/irrelevant sentences, and ground truths are generated based on the average of their voting. The average accuracy of informative sentence selection is shown in Fig. 9.

Fig. 8
figure 8

Keywords identification (Required POS [‘NNP’, ‘NNPS’]) using RAKE score. Purple colours denote the name of the persons, Blacks are the name of locations, greens denote dates, and blues belong to the miscellaneous category identified by the NER with 4 classes

Fig. 9
figure 9

The accuracy of the selection of top ranked informative sentences. The accuracy varied from 99% to 92% when we choose informative sentences from 10% to 50%

Experiment 2

In the next experiment, we have evaluated the relevant stems of the MCQs. The stem generation depends on the accuracy of the informative sentences. Four experts are asked to mark relevant/irrelevant stems, and ground truths are generated based on their voting. Figure 10 shows the stem generation result from top-ranked 50% informative sentences.

Fig. 10
figure 10

The accuracy of stem generation with respect to top 50% informative sentences

Experiment 3

After selecting the candidate set of distractors, the string similarity is checked using Levenshtein Distance [57], and semantic similarity is checked using Latent Semantic Analysis (LSA) [7] to generate a set of distractors which are close enough to answer key. Then, aggregate the scores for ranking the candidate distractors in a category. The top three are chosen as a final set of distractors for the answer key. For the evaluation purpose, the correctness of distractors is measured by the average scoring of distractors. The distractor’s score of a question qt is denoted by δ(qt). The δ(qt) = 1 for one, δ(qt) = 2 for two and δ(qt) = 3 for three correct distractors of a question qt. If the number of questions is z, then the total distractors are 3z. The accuracy of distractors (α) is measured in the equation (10). Table 2 presents the accuracy of the proposed cluster-based distractor generation method with the different state-of-art methods using our dataset. Four system-generated sample MCQs are shown in Table 3.

$$ \alpha=\frac{{\sum}^{z}_{t=1}\delta(q_{t})}{3z}\times 100 $$
(10)
Table 2 Accuracy of distractor generation
Table 3 The sample MCQs that generated automatically by the system. The asterisk (*) indicates the correct answer and the other three options are the distractors

4.4 Discussion of results

The system is tested in various ways in the Experiments in Section 4.3. The accuracy of top-ranked keywords and sentences are adequate. It is mentioned that here we only considered the precision. For question generation, precision is more important than recall because the exactness of generating questions is more important than completeness [1]. Figure 9 presents a curve that indicates an upper-ranked sentence has more potential to be selected as informative. The reason is an upper-ranked sentence has more meaningful keywords, which make the sentence more informative. Figure 10 similarly shows the linear curve. Top-ranked informative sentences can generate more suitable stems for MCQ generation. We considered the top half of the informative sentences for stem generation to increase the precision. Due to the lack of dataset, we used a clustering approach based on keyword features for generating distractors.

4.5 Computational efficiency of the system

The proposed system is a two-step process: question generation (stem and answer key), and distractors generation. The question generation step depends on the size of sentences in the dataset. This step has a complexity of O(n), where n is the number of sentences. The distractors generation method is executed by K-mean clustering and depends on the number of keywords that are used for clustering. We have quantified the execution time of the proposed method on the dataset. We have used Intel i7 processor (3.2 GHz speed) with 8 GB of RAM for the experiments. Figure 11 shows the execution time of the system with varying numbers of sentences. It is observed that the system is almost linear to the number of sentences in the dataset.

Fig. 11
figure 11

The execution time varying number of sentences

5 Conclusion

To meet the increasing demand of MCQs in competitive examinations and educational assessment, especially in e-learning and active learning framework, automatic generation of multiple choice test items from text-based course material has become a popular research area among educationalists and natural language processing researchers. In this paper, we have proposed an approach to generate MCQ with auto generated distractors. First, we have extracted the topic-words from the corpus. Then, we have identified the simple sentences and ranked them based on the topic-words. The question sentence (stem) is generated by replacing the answer key with an appropriate wh-word or a blank (gap). To generate the option set of the MCQ, dictractors are selected such a way that they are closely related to answer key. Feature based clustering technique is employed to select the candidate set of distractors. The distractors category is identified automatically using the elbow method in K-means clustering. Levenshtein Distance and Latent Semantic Analysis are explored to combine string similarity and semantic similarity for selecting the final set of distracters. The findings suggest that the proposed approach can produce good quality MCQs and be useful at various levels of assessment.