1 Introduction

Assessing the knowledge acquired by students is one of the most important aspects of the learning process as it provides feedback to help students correct their misunderstanding of knowledge and improves their overall learning performance. Traditionally, the assessing paradigm is often conducted by instructors or teachers. However, this access paradigm is not suitable in many cases especially when teaching resources are not readily available. To address this gap, many computer-assisted assessment approaches are developed to automate the assessment process [1].

One specific task, automatic short answer grading (ASAG), whose objective is to automatically score the free-text answers from students according to the corresponding reference answer [9], has attracted great attentions from a variety of research communities and some promising results have been already obtained [5, 7,8,9,10]. However, ASAG still remains challenging mainly for two reasons. Firstly, the student answers are expressed in different ways of free texts. Thus, it requires the ASAG approach to have a deep semantic understanding of the student answers. Secondly, the questions or assessments (and the corresponding reference answers) usually are open-ended and across different domains. The ASAG approach should be general and applicable into different scenarios.

In this paper, to address challenges above, we take the advantage of recent advances in natural language processing field [2, 12] and propose a deep learning framework to tackle the ASAG problem in an end-to-end approach. Specifically, our framework utilizes attention mechanisms to understand the semantics of student and reference answers with most relevant information and is very flexible and efficient as it can be easily extended with extra neuron layers while still maintaining fast training speed thanks to its high parallelization ability. Our main contributions are summarized as follows: (1) We propose an end-to-end approach that does not require any feature engineering effort to tackle the short answer grading problem; (2)We develop a novel framework that is able to modeling the relation between student and reference answers by accurately identifying matching information and understanding the semantic meaning; and (3) The proposed framework can be used in a wide range of domains and is easily scalable for large-scale datasets. It is demonstrated on a large-scale real-world dataset collected from millions of K-12 students.

2 Our Approach

In this section, we introduce our proposed framework, the overall structure is shown in Fig. 1. Before detailing each component next, we first introduce the notations. We use bold lower case letters for vectors and bold upper case letters for matrices. We use subscript to represent the vector index, which is the index of word in each sentence in most cases. We also use superscript to represent the category of vectors.

Fig. 1.
figure 1

The overview of our model (better viewed in color).

Transformer Layer. The input of the transformer layer is the student and reference answer, which are two sequences of words and denoted as \(\{\mathbf{w}_1^q, \mathbf{w}_2^q, \cdots , \mathbf{w}_n^q\}\) and \(\{\mathbf{w}_1^p, \mathbf{w}_2^p, \cdots , \mathbf{w}_n^p\}\), respectively, where \(\{\mathbf {w}_i^q\}\) and \(\{\mathbf {w}_i^p\}\) are the pre-trained word embeddings. Next, the transformer [12] model is applied as: \(\{\mathbf{h}_1^*, \mathbf{h}_2^*, \cdots , \mathbf{h}_n^*\}\) = transformer(\(\mathbf{w}_1^*, \mathbf{w}_2^*, \cdots , \mathbf{w}_n^*)\), where \(*\in \{p, q\}\) and each \(\{\mathbf {h}_i^q\}\) and \(\{\mathbf {h}_i^p\}\) are the word embeddings that contain its contextual sentence information in the student and reference answers, respectively.

Multiway Attention. We design the multiway attention layer to capture the relations between student and reference answers. Specifically, it consists of two blocks. The first is self-attention block where each \(\mathbf{h}_i^*\) will attend every \(\mathbf{h}_j^*, j \in \{1,2,\cdots , n\}\) to obtain new representation \(\mathbf{s}_i^*\), \(*\in \{p, q\}\). The second is cross-attention block in which each \(\mathbf{h}_i^q\) will attend every \(\mathbf{h}_j^p, j \in \{1,2,\cdots , n\}\) to obtain another set of new representations \(\mathbf{h}_i^t, t \in \{a, s, m, d\}\), where asmd are addictive, subtractive, multiplicative, and dot-product attention mechanisms, respectively [11].

Inside Aggregation. This layer is designed to aggregate multiway attention layer outputs to a single representation \(\mathbf{z}\). Specifically, we first concatenate the outputs from cross-attention and self-attention blocks by positions respectively and feed them to different position-wise feed forward networks to obtain the compressed representations \(\mathbf{g}_i^*\), \(*\in \{p, q, c\}\), where pqc represent student answer sequence, reference answer sequence, and cross-attention sequence, respectively. We concatenate the outputs \(\mathbf{g}_i^*\) by positions and after another Transformer block, we get new sequence representation \(\mathbf{Z} = transformer([g_i^p, g_i^q, g_i^c]), i \in \{1,2,\cdots , n\}\) which contains the information in student and reference answers and the relations between them.

Prediction Layer. The evaluation of student answer will be produced by this layer. Specifically, we first convert the aggregated sequence representation \(\mathbf{Z}\) to a fixed-length vector with self-attention pooling layer. This transformation is defined as: \(\mathbf{x} = softmax(\mathbf{w_1^z} tanh(\mathbf{W_2^z} \mathbf{Z^T}))\mathbf{Z}\), where \(\mathbf{w_1^z}\) and \(\mathbf{W_2^z}\) are learned parameters during training step. Then we build a feed forward network that takes \(\mathbf{x}\) as input and outputs a two-dimensional vector. The output vector is sent to a softmax function to obtain the final probabilistic evaluation vector. The first entry gives the probability of wrong answer while the second entry gives right answer probability. The objective is to minimize the cross entropy of the relevance labels.

3 Experiments

In this section, we conduct experiments on a large real-world educational data, which contains 120,000 pairs of student answers and question analysis from an online education platform, each labeled with binary value indicating whether the student has the right answer. The positive and negative instances are balanced and we randomly select 30,000 samples as our test data and use the rest for validation and training. The hyperparameters of our model are selected by internal cross validation. We use both AUC and accuracy as our evaluation metrics and for both metrics, a higher value indicates better performance.

We compare our model with several state-of-the-art baselines. More specifically, we choose: (1) Logistic regression (LR). (2) Gradient boosted decision tree (GBDT) [3, 13]. (3) Multichannel convolutional neural networks (TextCNN) [4]. (4) Sentence embedding by Bidirectional Transformer block (Bi-Transformer) [12]. (5) Multiway Attention Network (MAN) [11]. And (6) Manhattan LSTM with max pooling (MaLSTM) [6].

3.1 Experimental Results

We report the experimental results in Table 1. From the table, we observe that our model outperforms all of the baselines. We argue that this is because our model is able to effectively capture the semantic information between student and reference answers. This is confirmed by the fact that MAN shows the superior performance among all baselines, as it not only aggregates sentence information within Transformer block, but matches words in both query sentence and answer sentence from multiple attention functions.

Table 1. ASAG performance comparison on a real-world K-12 dataset.

4 Conclusion

In this paper we present our multi-way attention network for automatic short answer grading. We use transformer blocks and attention mechanisms to extract answer matching information. To comprehensively capture the semantic relations between the reference answer and the student answers, we apply multiway attention functions instead of single attention channel. Experiment results on a large real-world education dataset demonstrate the effectiveness of the proposed framework. There are several directions that need further exploration. We may use one attention mechanism with multiple heads instead of multiple attention mechanisms and we may replace transformer block with other type of sentence encoder like self-attention network or hierarchical attention network.