1 Introduction

Answer sheet evaluations for the heart and soul of the examination system around the globe. Every now and then examinations are being carried out for various classes starting from nursery to higher education in the form of end-term examination, internal examination as well as weekly tests. This puts a lot of pressure on the evaluator to check and allot marks to the students in a given time frame. Since the evaluator is also involved in teaching, this leaves him/her very little time to completely dedicate themselves towards answer sheet evaluations. But this task is crucial and needs proper focus from the evaluator. Also, this needs to be an impartial task. It is common to see that sometimes due to the prejudices of the teacher; the marks of the student get affected. For instance, the teachers are slightly more inclined to giving good marks to the more obedient students. This is a psychological fact and can’t be ignored.

In the recent years, technology has crept in the classrooms, making way for an efficient learning environment. From online lectures to e-classrooms, digitization has led to the evolution of a teacher and student friendly learning ambience. Automation of answer sheet evaluation has also been aimed in the past but high levels of accuracy have not been achieved. Earlier attempts in this field have either been not so accurate or they have been extensively time consuming. Through this paper, we aim at amalgamating the concepts of natural language processing such as context identification and text similarity into question answering, to facilitate the whole process of short answer-based script evaluation in an accurate and time efficient manner. Text similarity is a widely used technique for finding the amount of relatedness between two texts [1,2,3,4,5,6,7,8,9]. This concept has been studied by researchers worldwide for applications in various domains [10,11,12,13,14,15]. Text similarity approaches are being refined on a regular basis to provide optimum results for various applications [16,17,18,19,20,21,22,23,24,25].

Some of the latest research works in the field of short answer evaluation highlight that Fuzzy WordNet graphs play a significant role in the analysis. Vii et al. [26] depicts that the WordNet graph for the ideal answer can be generated and then it may be used to create a set of keywords that are essential for evaluating the students answer sheet. Since this paper uses WordNet as the sense repository hence it established context in a more elaborate manner. But this paper lacks the presentation of fruitful results as the testing is done only on a synthetic dataset. In order to add more relevance to such works, it becomes essential for us to include in this paper, a set of more elaborate results which are obtained after testing on a larger dataset.

In order to hand out meaningful explanations for sheet analysis, handwriting recognition can also be amalgamated in the process as depicted in [27]. Sijimol and Varghese [27] presents a model that can be used for acquiring a model that learns on the previous data based on the handwriting of an individual. But this is not a practical approach. Also, testing data is not sufficient for this analysis. Cosine sentence similarity was used in [27].

Van Hoecke [28] works on the algorithm that aims at utilizing sentence-based summarization techniques for performing grading of short answers of students. But this poses a limitation that sentence based ranking is not always accurate. So, the error due to faulty sentence ranking is relayed and carry forward in grading as well. Also, mostly sentence based ranking algorithms are based on machine translation and similarity scores which not very accurate. Hence, these types of approached are practically useful. Roy et al. [29] compares and contrasts the various existing techniques for performing short answer evaluation in terms of grading. This type of study is useful for us as it enables us to briefly outline the cons of the existing state-of-art techniques.

This paper focuses on proposing a novel method of automatically evaluating the answer sheets of students using a machine learning based approach. The technique adopted for the same is the generation of WordNet graphs. WordNet [5, 23] developed at Princeton university, is a computational lexicon consists of words and various relations between words. WordNet can be viewed as a graph where nodes represent words and edges represents relation between words. WordNet is widely being used in the literature for resolving several natural language processing tasks including word sense disambiguation, machine translation, information retrieval etc. [20, 21]. WordNet graphs play a significant role in information retrieval and hence they help in incorporating semantic significance and structural dependencies [22].

In this paper, WordNet is used for finding out the text similarity between the ideal answer provided by the teacher and the answer provided by the students in their answer sheet. The WordNet graphs are constructed to represent ideal answer and the answer for evaluation. Now the similarity between these two graphs is computed based on common nodes appearance. The marks for the answer to be evaluated are assigned in proportional to the similarity between these two graphs. The results for the proposed method are obtained on a dataset consisting of 400 students answer sheets. Answer sheets for evaluation are selected in a way to incorporate similarity and diversity in the data set.

The rest of the paper is framed as follows: Sect. 2 highlights the background study related to text similarity. Section 3 describes the proposed approach. Section 4 explains the results obtained. Section 5 concludes the work and states the relevant future scope.

2 Background Study

The main concept utilized in this paper for answer sheet evaluation is finding the text similarity between the ideal answer and the answer provided by the student. In order to study the latest recent trends in the field of text similarity, Web of Science (WoS) is taken as the data source. The below mentioned query was used for extracting the research papers pertaining to this field:

$$ {\text{TI}} = \left( {{\text{``Text Similarity''}}} \right) $$

The research papers were obtained through the above-mentioned query from the year 1989–2017 [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. The keywords occurring in these research papers were analyzed to visualize a keyword co-occurrence network as shown in Fig. 1. These keywords depict the various research topics associated with text similarity.

Fig. 1
figure 1

Keyword co-occurrence network visualization for research papers in “text similarity”

It can be observed from Fig. 1 that graph theory/technique, semantic dependencies and structural dependencies are closely associated to this field. Hence in this paper, a combination of these is taken to propose a novel method for calculating text similarity applied to answer sheet evaluation.

3 Proposed Approach

This section highlights the proposed approach adopted for evaluating the answer sheets of students in an automated manner. As concluded from the previous section, graph theory, semantic and structural dependencies play a significant role in text similarity calculation. Hence in this paper, a machine learning oriented WordNet graph-based method is proposed for answer sheet evaluation. WordNet is an online lexical dictionary type of database that consists of senses of a word according to its various part of speech tags. It consists of various semantic relationships intertwined to make a huge lexical database. WordNet graphs have been widely used in the literature for resolving lexical issues such as word sense disambiguation [20, 22]. The WordNet graph generated in this paper uses semantic relations hypernym, hyponym, meronym and holonym.

Siddiqi et al. [24] highlights that several types of short answer evaluations occur like the ones dealing with “True–False” type questions, fill in the blanks, sentence completion, “description required”, “justification required”, “example required” etc. the method proposed in this paper deals with short answer evaluation for the type of questions where a brief description is to be provided by the student with relevant short explanation if needed. The context can be well established for short answer evaluation using WordNet but for larger queries the context dissolves. For instance, it is difficult to automatically evaluate answers that have technical words in it since all of them are not available in WordNet. Other types of questions may be handled in the future. The method is explained as in Table 1.

Table 1 Proposed method for automated answer sheet evaluation using WordNet graph-based text similarity

To illustrate this, let us take the following text as the question to be evaluated:

  • Question: What is a car?

  • Answer: (Ideal, as provided by the teacher): Car is a vehicle with four wheels.

This answer text is treated as query Q1. The implementation of the proposed method is carried out in python using Natural Language Tool Kit (NLTK) libraries.

Q1 is tokenized and POS Tagging is done for the same. The result of the POS tagging set is as follows:

Tagged words set for Q1 = [(‘car’, ‘NN’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘vehicle’, ‘NN’), (‘with’, ‘IN’), (‘four’, ‘CD’), (‘wheels’, ‘NNS’)].

For the sake of simplicity, the content words chosen for generating the WordNet graph are (‘car’, ‘NN’), (‘vehicle’, ‘NN’), and (‘wheels’, ‘NNS’). The semantic relations hypernym, hyponym, meronym and holonym are taken for this purpose. The WordNet graph is generated using depth first search algorithm [20]. It is shown as in Fig. 2 and has 49 nodes.

Fig. 2
figure 2

WordNet graph for ideal answer

The node set of this graph (NS) is as follows:

  • NS= {“Synset(‘Wheeled_Vehicle.N.01’)”, “Synset(‘Car.N.01’)”, “Synset(‘Wheel.V.02’)”, “Synset(‘Wheel.N.01’)”, “Synset(‘Travel.V.01’)”, “Synset(‘Valve.N.03’)”, “Synset(‘Car_Wheel.N.01’)”, “Synset(‘Handwheel.N.02’)”, “Synset(‘Rack.N.04’)”, “Synset(‘Car.N.04’)”, “Synset(‘Cable_Car.N.01’)”, “Synset(‘Minivan.N.01’)”, “Synset(‘Helm.N.01’)”, “Synset(‘Ride.V.02’)”, “Synset(‘Compartment.N.02’)”, “Synset(‘Van.N.05’)”, “Synset(‘Vehicle.N.01’)”, “Synset(‘Steering_System.N.01’)”, “Synset(‘Steering_Wheel.N.01’)”, “Synset(‘Wheel.V.03’)”, “Synset(‘Wagon_Wheel.N.01’)”, “Synset(‘Lathe.N.01’)”, “(‘Vehicle’, ‘NN’)”, “Synset(‘Bicycle_Wheel.N.01’)”, “Synset(‘Instrumentality.N.03’)”, “Synset(‘Sprocket.N.02’)”, “Synset(‘Wheel.N.04’)”, “Synset(‘Conveyance.N.03’)”, “Synset(‘Cab.N.01’)”, “Synset(‘Bicycle.N.01’)”, “Synset(‘Vehicle.N.03’)”, “Synset(‘Bicycle.V.01’)”, “Synset(‘Medium.N.01’)”, “Synset(‘Passenger_Van.N.01’)”, “Synset(‘Vehicle.N.02’)”, “Synset(‘Motor_Vehicle.N.01’)”, “Synset(‘Roulette_Wheel.N.01’)”, “(‘Wheels’, ‘NNS’)”, “(‘Car’, ‘NN’)”, “Synset(‘Wheel.N.03’)”, “Synset(‘Car.N.03’)”, “Synset(‘Fomite.N.01’)”, “Synset(‘Self-Propelled_Vehicle.N.01’)”, “Synset(‘Wagon.N.01’)”, “Synset(‘Car.N.02’)”, “Synset(‘Handwheel.N.01’)”, “Synset(‘Travel.V.05’)”, “Synset(‘Wheel.V.01’)”, “Synset(‘Truck.N.01’)”}

The answer sheets will be evaluated based on these nodes. There may exist, 2 basic types of answer sheets:

  1. (a)

    When the answer written by the student matches logically with the ideal answer

  2. (b)

    When the answer written by the student does not match with the ideal answer and is not relevant to the context either.

    • Case 1: When the student has written an accurate and logical answer according to the context

Let us suppose that the 1st candidate has put up the answer as:

  • Q2: Car has wheels and an engine.

Now, in order to evaluate the 1st candidate answer sheet, Q2 is tokenized and tagged as follows:

$$ {\text{Tagged}}\;{\text{words}}\;{\text{set}}\;{\text{for}}\;Q_{2} = \left[ {\left( {{\text{`car'}},\;{\text{`NN'}}} \right),\left( {{\text{`has'}}, {\text{`VBZ'}}} \right),\left( {{\text{`wheels'}},{\text{`NNS'}}} \right),\left( {{\text{`and'}},{\text{`CC'}}} \right),\left( {{\text{`an'}},{\text{`DT'}}} \right),\left( {{\text{`engine'}},{\text{`NN'}}} \right)} \right] $$

where NN = Noun, VBZ = Verb, DT = Determiner, IN = Preposition, CD = Cardinal Digit

For the sake of simplicity, the content words chosen for generating the WordNet graph are (‘car’, ‘NN’), (‘wheels’, ‘NNS’), and (‘engine’, ‘NN’). The WordNet graph is generated as shown as in Fig. 3. The total number of nodes in this WordNet graph is 53.

Fig. 3
figure 3

WordNet graph for 1st candidate answer sheet

The node set of this graph (NS1) is as follows:

  • NS1= {“Synset(‘Engine.N.02’)”, “Synset(‘Wheeled_Vehicle.N.01’)”, “Synset(‘Car.N.01’)”, “Synset(‘Wheel.V.02’)”, “Synset(‘Wheel.N.01’)”, “Synset(‘Travel.V.01’)”, “Synset(‘Valve.N.03’)”, “Synset(‘Motor.N.01’)”, “Synset(‘Car_Wheel.N.01’)”, “Synset(‘Handwheel.N.02’)”, “Synset(‘Instrument_Of_Torture.N.01’)”, “(‘Engine’, ‘NN’)”, “Synset(‘Rack.N.04’)”, “Synset(‘Engine.N.04’)”, “Synset(‘Car.N.04’)”, “Synset(‘Cable_Car.N.01’)”, “Synset(‘Minivan.N.01’)”, “Synset(‘Helm.N.01’)”, “Synset(‘Ride.V.02’)”, “Synset(‘Compartment.N.02’)”, “Synset(‘Van.N.05’)”, “Synset(‘Automobile_Engine.N.01’)”, “Synset(‘Steering_System.N.01’)”, “Synset(‘Steering_Wheel.N.01’)”, “Synset(‘Wheel.N.04’)”, “Synset(‘Locomotive.N.01’)”, “Synset(‘Instrument_Of_Punishment.N.01’)”, “Synset(‘Lathe.N.01’)”, “Synset(‘Bicycle.V.01’)”, “Synset(‘Sprocket.N.02’)”, “Synset(‘Machine.N.01’)”, “Synset(‘Instrument.N.01’)”, “Synset(‘Bicycle_Wheel.N.01’)”, “(‘Wheels’, ‘NNS’)”, “Synset(‘Cab.N.01’)”, “Synset(‘Wagon_Wheel.N.01’)”, “Synset(‘Bicycle.N.01’)”, “Synset(‘Engine.N.01’)”, “Synset(‘Passenger_Van.N.01’)”, “Synset(‘Wheel.V.03’)”, “Synset(‘Motor_Vehicle.N.01’)”, “Synset(‘Roulette_Wheel.N.01’)”, “Synset(‘Wheel.N.03’)”, “(‘Car’, ‘NN’)”, “Synset(‘Car.N.03’)”, “Synset(‘Travel.V.05’)”, “Synset(‘Self-Propelled_Vehicle.N.01’)”, “Synset(‘Wagon.N.01’)”, “Synset(‘Device.N.01’)”, “Synset(‘Car.N.02’)”, “Synset(‘Handwheel.N.01’)”, “Synset(‘Wheel.V.01’)”, “Synset(‘Truck.N.01’)”}

Now, find out the nodes that match between NS1 and NS and put them in N:

  • N = {“Synset(‘Wheeled_Vehicle.N.01’)”, “Synset(‘Car.N.01’)”, “Synset(‘Wheel.V.02’)”, “Synset(‘Wheel.N.01’)”, “Synset(‘Travel.V.01’)”, “Synset(‘Valve.N.03’)”, “Synset(‘Car_Wheel.N.01’)”, “Synset(‘Handwheel.N.02’)”, “Synset(‘Rack.N.04’)”, “Synset(‘Car.N.04’)”, “Synset(‘Cable_Car.N.01’)”, “Synset(‘Minivan.N.01’)”, “Synset(‘Helm.N.01’)”, “Synset(‘Ride.V.02’)”, “Synset(‘Compartment.N.02’)”, “Synset(‘Van.N.05’)”, “Synset(‘Steering_System.N.01’)”, “Synset(‘Steering_Wheel.N.01’)”, “Synset(‘Wheel.N.04’)”, “Synset(‘Lathe.N.01’)”, “Synset(‘Bicycle.V.01’)”, “Synset(‘Sprocket.N.02’)”, “Synset(‘Bicycle_Wheel.N.01’)”, “(‘Wheels’, ‘NNS’)”, “Synset(‘Cab.N.01’)”, “Synset(‘Wagon_Wheel.N.01’)”, “Synset(‘Bicycle.N.01’)”, “Synset(‘Passenger_Van.N.01’)”, “Synset(‘Wheel.V.03’)”, “Synset(‘Motor_Vehicle.N.01’)”, “Synset(‘Roulette_Wheel.N.01’)”, “Synset(‘Wheel.N.03’)”, “(‘Car’, ‘NN’)”, “Synset(‘Car.N.03’)”, “Synset(‘Self-Propelled_Vehicle.N.01’)”, “Synset(‘Wagon.N.01’)”, “Synset(‘Car.N.02’)”, “Synset(‘Handwheel.N.01’)”, “Synset(‘Wheel.V.01’)”, “Synset(‘Truck.N.01’)”}

It can be observed that N consists of 40 nodes (|N|) which means that out of 49 nodes in the ideal answer sheet graph, 40 matches with the 1st candidate answer sheet. This means that the answer is very relevant to the given context, and hence it can be marked for a 10-mark question as (40*10/49) = 8.1.

  • Case 2: When the answer written by the student does not match with the ideal answer and is not relevant to the context either.

Now the 2nd candidate answer sheet has put up the answer as:

  • Q3: Car is used for transportation.

Now, in order to evaluate the 2nd candidate answer sheet, Q3 is tokenized and tagged as follows:Tagged words set for Q3 = [(‘car’, ‘NN’), (‘is’, ‘VBZ’), (‘used’, ‘VBN’), (‘for’, ‘IN’), (‘transportation’, ‘NN’)]where NN = Noun, VBZ = Verb, IN = Preposition, VBN = Verb (past participle)

For the sake of simplicity, the content words chosen for generating the WordNet graph are (‘car’, ‘NN’) and (‘transportation’, ‘NN’). The WordNet graph is generated as shown as in Fig. 4. The total number of nodes in this WordNet graph is 27.

Fig. 4
figure 4

WordNet graph for 2nd candidate answer sheet

The node set of this graph (NS2) is as follows:

  • NS2= {“Synset(‘Be.V.02’)”, “Synset(‘Exist.V.01’)”, “Synset(‘Equal.V.01’)”, “Synset(‘Practice.V.04’)”, “Synset(‘Exploit.V.01’)”, “Synset(‘Be.V.10’)”, “Synset(‘Secondhand.S.02’)”, “Synset(‘Used.A.01’)”, “(‘Is’, ‘VBZ’)”, “Synset(‘Constitute.V.01’)”, “Synset(‘Be.V.12’)”, “Synset(‘Be.V.11’)”, “Synset(‘Exploited.S.02’)”, “Synset(‘Use.V.01’)”, “Synset(‘Use.V.02’)”, “Synset(‘Stay.V.01’)”, “Synset(‘Embody.V.02’)”, “Synset(‘Be.V.03’)”, “Synset(‘Use.V.06’)”, “Synset(‘Cost.V.01’)”, “Synset(‘Take.V.02’)”, “Synset(‘Be.V.05’)”, “Synset(‘Use.V.03’)”, “Synset(‘Use.V.04’)”, “Synset(‘Be.V.01’)”, “Synset(‘Be.V.08’)”, “(‘Used’, ‘VBN’)”}

Now, find out the nodes that match between NS2 and NS and put them in N:

$$ N = \left\{ \varPhi \right\}//{\text{NULL}}\;{\text{SET}} $$

It can be observed that in this case, N doesn’t consist of any nodes which means that out of 49 nodes in the ideal answer sheet graph, none matches with the 2nd candidate answer sheet i.e. |N| = 0. This means that the answer is not relevant to the given context, and hence it would be marked zero. The results for the example taken for illustration are summarized as in Table 2.

Table 2 Results for the considered example

4 Results and Evaluation

To test the effectiveness of this approach, a dataset was considered in which answer sheets of 400 students were collected. The answer sheets belong to the subject social studies. This was observed through experimentation that the proposed system does not apply well to technical subjects like computer science engineering. This is so because WordNet doesn’t contain all the technical words and definitions. For the result evaluation, these 400 answer sheets were checked beforehand by the teachers. These sheets were scanned, and their text was converted into a machine-readable format using OCR (Optical Character Recognition). The answers in these sheets were analyzed according to the proposed method and were re-evaluated. The marks obtained by the proposed method and the actual marks were compared to calculate the Root Mean Square Error (RMSE) using Eq. 1.

$$ RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {(X_{obs,i} - X_{mo\,del,i} )^{2} } }}{n}} $$
(1)

here Xobs,i = Marks of the answer sheet as evaluated by the teacher, Xmodel,i = Marks of the answer sheet as calculated by the proposed method, n = Number of observations = 400.

Table 3 summarizes the performance of the proposed method as compared with the state-of-art, when applied to the considered dataset. Better results are obtained as compared to the state-of-art owing to the novelty of the proposed algorithm that takes into consideration the degree of semantic relatedness of the candidate answer to the ideal answer decided and provided by the teacher/evaluator. This would in turn help in impartial evaluations of the answer sheets.

Table 3 Standard deviation for accuracy and time for proposed method vs. state-of-art methods when tested on synthetic dataset

Hence, it can be concluded that the proposed method yields promising results. This can be attributed to the fact that the state-of-art doesn’t take into consideration the semantic relationships and lexical expansion, but the proposed method does. It should also be highlighted here that IndusMarker [24] generates the word cloud in an automated manner which is to be manually analyzed by the evaluator. The proposed system on the other hand generates the WordNet graphs and assigns the scores automatically. This in turn assists in reducing the time of evaluation which is another significant aspect of answer sheet checking. In order to further increase the accuracy, there is a need to incorporate more measures of semantic relatedness.

5 Conclusion and Future Scope

This paper proposes a novel concept for answer sheet evaluation using the concept of text similarity applied to WordNet graphs. The answer sheets are evaluated by identifying the common nodes that occur between the node set of the ideal answer WordNet graph and the candidate answer WordNet graph. This kind of an evaluation combines the various significant concepts related to text similarity like semantic and structural dependencies. The root mean square error for the proposed approach was found to be 0.319 when tested on a dataset consisting of 400 students answer sheets. Unlike the state-of-art, the proposed method generates the WordNet graphs and assigns the scores automatically which in turn assists in reducing the time of evaluation. This shows that the proposed approach of answer sheet evaluation yields promising results in terms of both accuracy and time of evaluation. This work is suitable in scenarios where the student enters the correct spelling of the concerned words. The WordNet graph for erroneous non-words won’t be generated. In the future, this work might be extended to incorporate measures to resolve this issue. Although marks are deducted in manual evaluation too for incorrect spellings, but some partial assignment is possible.