Keywords

1 Introduction

Dialogue-based tutoring (DBT) platforms such as AutoTutor [6], Rimac [1], DeepTutor [24] and the Watson Tutor [28] have shown great promise in meeting individual student’s needs. In such systems, the tutoring platform interacts with the student by asking questions and provides individual feedback based on all student answers. To provide appropriate feedback and rectify student mistakes, accurately understanding student answers is crucial. However, devising a generic short answer grading system that performs well across different questions and domains of study is a challenge due to data distribution variations (differences in used language, length and depth of answers, use of non-sentential answers, among other issues).

Various Deep Learning (DL) based techniques have been explored for short answer grading [2, 11, 12, 17, 25]. However, availability of limited labeled data (reference and student answer pairs) often prohibits meaningful training; furthermore, due to domain discrepancy between the public corpora and short answer grading corpus, the utilization of the former by augmentation is not efficient. Lately, transfer learning has largely supplanted the use of the older DL techniques, and have had a substantial impact on the state of Natural Language Processing (NLP) [16]. The main concept within transfer learning is to apply the knowledge from one or more source tasks to a target task [18]. Broadly, a target task can use the knowledge of labeled data from other tasks or from unlabeled data called self-taught learning [21]. In NLP, word embedding is one of the most influential transfer models due to its capability of capturing semantic context of a word by producing vector representations of words from large unlabeled corpora such as Wikipedia and news articles [13].

As a transition of a robust transfer learning model, Peters et al. introduced contextualized word representations (called Embeddings from Language Models or ELMo) [19]. ELMo captured contextual information from word representations by combining the hidden states of multiple bidirectional LSTMs and initial embeddings. In 2018, diverse novel fine-tuning language models such as Universal Language Model Fine-tuning (ULMFiT) [9] and OpenAI\('\)s Generative Pre-Training (GPT)Footnote 1 [20] were proposed followed by a robust transfer language model called Bidirectional Encoder Representations from Transformers, or BERT [5]. OpenAI\('\)s GPT and BERT adapted the Transformer architecture to learn the text representations, a novel and efficient language model architecture based on a self-attention mechanism [27]. However, while OpenAI\('\)s GPT used an unidirectional attention approach (the decoder in Transformer), BERT used a bidirectional one (the encoder in Transformer) to better understand the text context. BERT can be trained in two phases. In the pre-training phase, deep bidirectional representations inherited by the nature of the Transformer Encoder can use unlabeled huge corpora. In the fine-tuning phase, task-specific labeled data and parameter tuning is performed to optimize results for a specific problem, such as question answering or short answer grading.

In this work, we experiment with fine-tuning a pre-trained BERT language model and explore the following questions:

  • How well do Transformer-based DL approaches (we use BERT as it is the latest iteration of such models) apply to short answer grading?

  • How much does fine-tuning, involving the collection of domain-specific labeled answers, impact the results obtained?

  • What is the amount of training (number of epochs) needed in order to produce an optimized model using this approach?

  • How well does the same fine-tuned Transformer-based model work across different domains of study for the short answer scoring task?

We begin with an overview of recent approaches in short answer grading, and an overview of BERT and the BERT model architecture, before presenting details on our experiments designed to answer these questions.

2 Related Work

Broadly speaking the literature pertaining to the problem of short answer grading can be categorized into two: (1) earlier approaches that relied heavily on hand-crafted features, and (2) recent deep learning approaches that require minimum, if not none at all, feature engineering.

2.1 Hand-Crafted Features

Mohler and Mihalcea [15] and Mohler et al. [14] are among the earliest research works towards automatic short answer grading. These approaches relied on various word similarity measures, corpus-based measures, and alignment of parses of reference and student answers. A benchmark in the field was established with the Student Response Analysis Challenge as part of SemEval-2013 [7]. Participating approaches relied on a range of hand-crafted features including corpus-based word similarities, WordNet based word similarities, part-of-speech tags, sentence parsing, and n-grams; one of the participants also explored domain adaptation. Broadly, the problem of Student Response Analysis is modeled as a special case of Textual Entailment or Semantic Textual Similarity. Ramachandran et al. [22] proposed to extract phrase patterns from reference answers to form basis of scoring approach. The approach improves over earlier approaches in that it explicitly extracts semantic information at sentence as opposed to earlier word similarity metrics. Ramachandran and Foltz [23] proposed a short answer grading based on text summarization.

2.2 Deep Learning Approaches

With the advances in deep learning approaches, various works leveraged these approaches. Sultan et al. [26] represented a sentence as sum of word embeddings [13] of its tokens in conjunction with other features. The approach uses word embeddings obtained by deep learning on large corpus; however, obtaining feature representations of a sentence as sum of word embeddings ignores the structural information. Thus, as a logical extension subsequent works have explored more sophisticated ways to obtain feature representations of answer sentences. Mueller and Thyagarajan [17] proposed a Long Short-Term Memory (LSTM) based Siamese network to compare student answer against reference answer. They observe that one of the major limitations in training LSTM networks is the lack of large amount of training data. They generate additional pairs of answers by replacing words in the original dataset. The extended dataset is used for training LSTM networks for short answer grading. The data intensive nature of deep learning approaches has emerged as an interesting issue for research, particularly in data-starved problems such as short answer grading.

Transfer Learning has evolved into a promising research direction to address this. It claims that a generic learning of natural language can be obtained from a data-rich generic task, which can be then transferred to downstream tasks which may have limited data. Research efforts to learn universal sentence embeddings for task-specific transfer have yielded impressive improvements on various benchmarks. Notable works include InferSent [4], ELMo [19], ULMFiT [9], GPT [20], and BERT [5]. Saha et al. [25] explored sentence embedding features from InferSent in conjunction with traditional token features. In another recent work, Marvaniya et al. [12] showed that short answer grading based on sentence embedding features can be further improved by leveraging their proposed scoring rubric approach. The current state of the short answer grading research has shown that transfer of sentence embeddings is useful, yet non-contextual approaches encounter their limitations at downstream tasks. In this study, we aim to demonstrate the ability and various characteristics of BERT (a latest and robust transfer language model) for short answer grading with limited domain-specific training data.

3 BERT for Short Answer Grading

The broad premise of BERT [5] is that there is a high-level language model that needs to be encoded into the network irrespective of the downstream task. The high-level language model is learned based on two semi-supervised objectives of (1) Masked Language Model (MLM) for a deep bidirectional representation and (2) Next Sentence Prediction (NSP) for understanding relationship between sentences; this training leverages multiple corpora. The resultant model, often called the pre-trained BERT model, forms the basis for downstream target tasks. For the task of short answer grading, we perform fine-tuning in the form of Sentence Pair Classification. This model allows to classify a pair of reference and student answers into desired categories of correct, incorrect, contradiction, and so on.

Fig. 1.
figure 1

BERT model architecture for short answer grading. We employed the Sentence Pair Classification task specific model using BERT. To describe the details of the model we used the same colors for the same representations as in [5, 27].

3.1 BERT Model Architecture

As described in Devlin et al. [5], BERT takes a single token sequence from a single text sentence for the MLM objective or from a pair of text sentences (adding [SEP] token between them as a separator) for the NSP objective. The special classification embedding [CLS] is added in front of each sequence and it is used as input to the classification-task layer. As shown in Fig. 1, the input representations are obtained by combining the token, segment, and learned position embeddings. The segment embeddings identify which sentence tokens are from and the position embeddings relative positioning of tokens. This is the input to the first Transformer Encoder layer and the output of this layer is fed into the next Transformer layer. BERT may have a stack of multiple Transformer layers. Each Transformer Encoder is composed of two major parts: a self-attention layer with multiple attention heads, followed by token-wise feed-forward layers. Each attention head acts akin to a convolution in a convolutional neural network (ConvNet), except for a weighted average. As part of self-attention mechanism, BERT computes three vectors from each token (called query, key, and value) by multiplying three trainable weight matrices (\(W^{Q}\), \(W^{K}\), \(W^{V}\) respectively). The weight matrices emphasize different location values of the input as the role of kernels in ConvNet and they are adjusted for every head.

$$\begin{aligned} \varvec{q}_{j}^{i}=\varvec{x}_{j} W_{i}^{Q} \quad \quad \varvec{k}_{j}^{i}=\varvec{x}_{j} W_{i}^{K} \quad \quad \varvec{v}_{j}^{i}=\varvec{x}_{j} W_{i}^{V} \end{aligned}$$
(1)

where, \(\varvec{q}_{j}^{i}\), \(\varvec{k}_{j}^{i}\), and \(\varvec{v}_{j}^{i}\) are the query, key, and value vectors (projections) respectively for jth token \(x_{j}\) in ith head. Then, with the query and key vectors BERT calculates attention weights by: (1) the dot product of the query vector of a particular token and all the key vectors (\(\varvec{k}_{1}^{i} ... \varvec{k}_{n}^{i}\) in ith head where n is the number of tokens), (2) an adjustment of the dot products by \(\frac{1}{\sqrt{d_{k}}}\) where \(d_{k}\) is the dimension of the key vectors, and (3) a softmax normalization sequentially. The scaling factor of \(\frac{1}{\sqrt{d_{k}}}\) helps finely adjust larger vectors to avoid extremely small gradients from the softmax.

$$\begin{aligned} aw_{j1}^{i},...,aw_{jn}^{i}=\texttt {softmax}((\varvec{q}_{j}^{i}\cdot \varvec{k}_{1}^{i},..., \varvec{q}_{j}^{i}\cdot \varvec{k}_{n}^{i})\frac{1}{\sqrt{d_{k}}})) \end{aligned}$$
(2)

where \(aw_{jk}^{i}\) is kth normalized attention weight for jth token in ith head. The attention weights capture how much all tokens are related to a particular token in head\(_{i}\). BERT multiplies each value vector by the corresponding attention weight and sums up the weighted results. The output vector contains the bi-directional attention information, the value vectors of related tokens contributing more than others.

$$\begin{aligned} \varvec{z}_{j}^{i}=aw_{j1}^{i}\varvec{v}_{1}^{i}+...+aw_{jn}^{i}\varvec{v}_{n}^{i} \end{aligned}$$
(3)

where \(\varvec{z}_{j}^{i}\) is the output of a self-attention layer for jth token and \(\varvec{v}_{k}^{i}\) is kth value vector in ith head. There may be multiple \(\varvec{z}_{j}\) from multiple attention heads. To aggregate these results, BERT concatenates all \(\varvec{z}_{j}\) vectors, multiplying them by a weight matrix. The result vector having all attention information along all heads is summed with the original token representations, followed by layer normalization [3]. Each of the final vectors (representing a particular token) discretely goes to the corresponding fully connected feed-forward network. This full procedure repeats as many as the number of Transformer Encoders and at the last Transformer Encoder the final output for the [CLS] token is used as the sequence representation. Up to this point, this is the pre-training model and BERT can leverage an unlabeled huge corpus of text to construct a high-level language model. Then, BERT adapts the labeled data for short answer grading not only for fine tuning the pre-training model but also constructing a classification model through the feed-forward classification layer on the pre-training model.

4 Experiments

We evaluated our proposed approach on two datasets:

  1. 1.

    -3way dataset of SemEval-2013 [7]: We used SciEntsBank dataset for the 3-way task in SemEval 2013 challenge. The data consists of questions, reference answers, student answers, and three-way labels (correct, incorrect, and contradictory or in short co, ic, and cd respectively) in the science domain. The SemEval 2013 challenge involves three classification subtasks on three given test sets: unseen answers (UA), unseen questions (UQ), and unseen domains (UD).

  2. 2.

    Two psychology domain datasets: The datasets contain a collection of questions, reference answers, student answers, and three-way labels (correct, partially-correct, and incorrect or in short co, pc, and ic respectively). These are based on student answers from two psychology-related textbooks (one is from behavioral physiology and has a lot of technical language and the other is from developmental psychology with mostly non-technical material). Each student response is manually annotated by three experts. Groundtruth is obtained as majority voting of the three annotations.

As shown in the Table 1, the class distribution of both datasets is highly skewed. Due to the class imbalance we select a macro-average-F1 method to observe how our proposed approach preforms overall across the latest other approaches. The macro-average-F1 computes the F1 score independently for each class and then takes the average of all F1 scores. Moreover, we report results in terms of accuracy and weighted-average-F1, but due to the class-imbalance in the datasets, these two metrics may provide biased evidences.

Table 1. Details of class distribution and train-test split protocols for SciEntsBank 3-way dataset of SemEval 2013 challenge and our psychology domain 1 and 2 datasets. The test set of SciEntsBank is divided into three different test sets for the three subtasks: unseen answers (UA), unseen questions (UQ), and unseen domains (UD).

4.1 Pre-training Setup

We chose BERT \(_\mathbf{BASE }\), UncasedFootnote 2 pre-trained model, which used the concatenation of BooksCorpus (800M words) and English Wikipedia (2,500M words) for pre-training. Uncased means that the text has been converted to lower-case before tokenization, dropping any accent markers. BERT uses WordPiece embeddings [29] using a 30,000 token vocabulary and up to 512 tokens are supported for the input sequence. The details of the BERT\(_{\textsc {base}}\) model can be found in [5].

4.2 Fine-Tuning Setup

For fine-tuning the pre-trained BERT\(_{\textsc {base}}\) model and a classification layer, we generated the two datasets in tab-separated values (TSV) files. We changed the learning rate of Adam optimizer to 2e–5 for SemEval-2013 and 3e–5 for two psychology domain datasets with the same batch size 32. We have also gradually reduced the training size up to 20% of the entire set to observe how many labeled data are required for fine-tuning. We changed the number of epochs from 4 to 12 to observe how many epochs the BERT and classifier are required to complete fine-tuning. For the fine-tuning process, we used two NVIDIA Tesla P100 GPUs (Graphics Card RAM 16 GB) and 120-GB memory.

Table 2. Performance on SciEntsBank Dataset of SemEval-2013 [7]. All results of \(\ddag \) are as reported in [25]. MEAD [23], Graph [23] and Marvaniya et al. [12] reported results on unseen answer protocol only as their approaches are designed for this scenario. Accuracy (Acc), macro-average-F1 (M-F1), and weighted-average-F1 (W-F1) are reported in percentage.
Table 3. Performance comparison of human agreements and the proposed method on our two psychology (psych.) domain datasets. Accuracy (Acc), macro-average-F1 (M-F1), and weighted-average-F1 (W-F1) are reported in percentage.

4.3 Results and Analysis

We performed a set of experiments to study various aspects of the proposed BERT\(_{\textsc {base}}\) model for the problem of short answer grading, including (1) performance comparison with published literature and human agreements, (2) sufficiency of fine-tuning in terms of supervised data requirement and the number of training epochs, (3) applicability of fine-tuned model on different domain, and (4) ability to jointly fine-tune for multiple domains. Based on the various experiments and their results presented on benchmark SciEntsBank dataset and our two psychology domain datasets, we make following key observations:

Effectiveness of Transfer Learning: As shown in Tables 2 and 3, on all the datasets the fine-tuned model yields impressive results. On SciEntsBank dataset, we establish state-of-the-art results. Compared to state-of-the-art, Saha et al. [25], which includes sentence embeddings of InferSent [4] along with token features, we report improvements ranging from 6% up to 10% in macro-average-F1. Note that, unsupervised pre-training of BERT helps to leverage a huge amount of existing natural language material. This puts the approach at an advantage over techniques such as InferSent [4] that requires large supervised (and therefore expensive and limited) corpus for pre-training.

Fig. 2.
figure 2

Macro-average-F1 scores with different size of training sets of two domains, overlaid human performance. Evaluations are done on a held-out test set of 20%.

On our datasets, we obtain impressive macro-average-F1 of from 80% up to 85%, indicating the robustness of the model’s transferability to the target task of short answer grading. On our datasets, we report human performance as a baseline against which the model can be compared. As outlined earlier, each student response is annotated by three experts. The variability in the annotation enables us to establish a human performance baseline. Table 3 lists each human annotation’s comparison against the majority vote (MV) in terms of accuracy (Acc), macro-averaged-F1 (M-F1), and weighted-average-F1 (W-F1).

Effectiveness for Data-Starved Problems: Task-specific supervised fine-tuning is possible with small number of samples. On SciEntsBanks dataset, the training set includes \(\sim \)5K samples; which yields results better than task-specific learning. To further study this property of the model, we design an experiment to train the model with small portions of training data. Figure 2 shows the performance in terms of macro-average-F1, when the training data is reduced from 80% of the whole set to mere 20%. Evaluation is done on a constant held-out test set consisting of 20% samples. Note the decrease in the slope as the training set expands, suggesting diminishing returns as training data is added. The increase in M-F1 is about 10% as the training set increases from 20% to 80%. For data-starved problems, a rather generous trade-off can be made to obtain a reasonably good performance with limited task-specific fine-tuning data. Interestingly, the M-F1 with 40% training data is in same range as human performance (shown in Table 3).

Effectiveness of Training Epoch on Fine-Tuning: We also performed experiments for fine-tuning BERT with varying number of epochs. We observed that fine-tuning for 4 and 12 epochs does not yield significantly different results on macro-average-F1 (85.7 and 85.4 on domain 1, 82.2 and 83.7 on domain 2 respectively), indicating that task-specific transfer takes place within initial few epochs only.

Table 4. Cross- and joint- domain fine-tuning. Accuracy (Acc), macro-average-F1 (M-F1), and weighted-average-F1 (W-F1) are reported in percentage.

Effectiveness in Cross- and Joint- Domain Fine-Tuning: We further evaluated the fine-tuned model’s ability to generalize to unseen domains. Table 4 reports the performance of fine-tuned models on both domains. It shows that the model fine-tuned using domain 1 yields very poor results on domain 2, and vice versa. This suggests that domain specific supervised data is indeed required for efficient fine-tuning. As a follow-up, we fine-tuned a model using a combined set of both domain data; which yields results relatively similar to domain specific tuning. It provides evidence that the model can be jointly fine-tuned for both models.

5 Conclusion

This paper conclusively demonstrates that Transformer-based pre-trained models push the state-of-the-art in short answer grading to a level that may be approaching the ceiling of what is possible. In comparison with human scorers, the model learns the “wisdom of the crowd”, surpassing the performance of any individual human scorer on our datasets. The amount of fine-tuning needed is reasonable; even with just a few thousand labeled samples, we are able to get superior results. We also show that while applying a model fine-tuned on data associated with one domain cannot directly apply to grading other domains, it is possible to create a single model fine-tuned using data from multiple domains that works for each of them. Going forward, we expect to investigate whether adding an additional domain-specific text corpus to a pre-trained model improves the ability to process language for that domain. We will continue to experiment with ways to minimize the amount of fine-tuning (e.g., through characterization of what types of labeled samples yield the highest marginal improvement during fine-tuning, thus allowing for more efficient data collection for automated grading). Finally, work on model management, reuse of models, and devising efficient methods to add new labeled samples to existing fine-tuned methods will be of interest so that a model adapts over time.