Automatic Short Answer Grading via Multiway Attention Networks

Liu, Tiaoqiao; Ding, Wenbiao; Wang, Zhiwei; Tang, Jiliang; Huang, Gale Yan; Liu, Zitao

doi:10.1007/978-3-030-23207-8_32

Tiaoqiao Liu²⁰,
Wenbiao Ding²⁰,
Zhiwei Wang²¹,
Jiliang Tang²¹,
Gale Yan Huang²⁰ &
…
Zitao Liu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11626))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

3557 Accesses
27 Citations

Abstract

Automatic short answer grading (ASAG), which autonomously score student answers according to reference answers, provides a cost-effective and consistent approach to teaching professionals and can reduce their monotonous and tedious grading workloads. However, ASAG is a very challenging task due to two reasons: (1) student answers are made up of free text which requires a deep semantic understanding; and (2) the questions are usually open-ended and across many domains in K-12 scenarios. In this paper, we propose a generalized end-to-end ASAG learning framework which aims to (1) autonomously extract linguistic information from both student and reference answers; and (2) accurately model the semantic relations between free-text student and reference answers in open-ended domain. The proposed ASAG model is evaluated on a large real-world K-12 dataset and can outperform the state-of-the-art baselines in terms of various evaluation metrics.

Z. Wang—Work was done when the authors did internship in TAL AI Lab.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Improving Short Answer Grading Using Transformer-Based Pre-training

Incorporating Question Information to Enhance the Performance of Automatic Short Answer Grading

Going deeper: Automatic short-answer grading by combining student and question models

Article 01 January 2020

1 Introduction

Assessing the knowledge acquired by students is one of the most important aspects of the learning process as it provides feedback to help students correct their misunderstanding of knowledge and improves their overall learning performance. Traditionally, the assessing paradigm is often conducted by instructors or teachers. However, this access paradigm is not suitable in many cases especially when teaching resources are not readily available. To address this gap, many computer-assisted assessment approaches are developed to automate the assessment process [1].

One specific task, automatic short answer grading (ASAG), whose objective is to automatically score the free-text answers from students according to the corresponding reference answer [9], has attracted great attentions from a variety of research communities and some promising results have been already obtained [5, 7,8,9,10]. However, ASAG still remains challenging mainly for two reasons. Firstly, the student answers are expressed in different ways of free texts. Thus, it requires the ASAG approach to have a deep semantic understanding of the student answers. Secondly, the questions or assessments (and the corresponding reference answers) usually are open-ended and across different domains. The ASAG approach should be general and applicable into different scenarios.

In this paper, to address challenges above, we take the advantage of recent advances in natural language processing field [2, 12] and propose a deep learning framework to tackle the ASAG problem in an end-to-end approach. Specifically, our framework utilizes attention mechanisms to understand the semantics of student and reference answers with most relevant information and is very flexible and efficient as it can be easily extended with extra neuron layers while still maintaining fast training speed thanks to its high parallelization ability. Our main contributions are summarized as follows: (1) We propose an end-to-end approach that does not require any feature engineering effort to tackle the short answer grading problem; (2)We develop a novel framework that is able to modeling the relation between student and reference answers by accurately identifying matching information and understanding the semantic meaning; and (3) The proposed framework can be used in a wide range of domains and is easily scalable for large-scale datasets. It is demonstrated on a large-scale real-world dataset collected from millions of K-12 students.

2 Our Approach

In this section, we introduce our proposed framework, the overall structure is shown in Fig. 1. Before detailing each component next, we first introduce the notations. We use bold lower case letters for vectors and bold upper case letters for matrices. We use subscript to represent the vector index, which is the index of word in each sentence in most cases. We also use superscript to represent the category of vectors.

Transformer Layer. The input of the transformer layer is the student and reference answer, which are two sequences of words and denoted as \(\{\mathbf{w}_1^q, \mathbf{w}_2^q, \cdots , \mathbf{w}_n^q\}\) and \(\{\mathbf{w}_1^p, \mathbf{w}_2^p, \cdots , \mathbf{w}_n^p\}\), respectively, where \(\{\mathbf {w}_i^q\}\) and \(\{\mathbf {w}_i^p\}\) are the pre-trained word embeddings. Next, the transformer [12] model is applied as: \(\{\mathbf{h}_1^*, \mathbf{h}_2^*, \cdots , \mathbf{h}_n^*\}\) = transformer(\(\mathbf{w}_1^*, \mathbf{w}_2^*, \cdots , \mathbf{w}_n^*)\), where \(*\in \{p, q\}\) and each \(\{\mathbf {h}_i^q\}\) and \(\{\mathbf {h}_i^p\}\) are the word embeddings that contain its contextual sentence information in the student and reference answers, respectively.

Multiway Attention. We design the multiway attention layer to capture the relations between student and reference answers. Specifically, it consists of two blocks. The first is self-attention block where each \(\mathbf{h}_i^*\) will attend every \(\mathbf{h}_j^*, j \in \{1,2,\cdots , n\}\) to obtain new representation \(\mathbf{s}_i^*\), \(*\in \{p, q\}\). The second is cross-attention block in which each \(\mathbf{h}_i^q\) will attend every \(\mathbf{h}_j^p, j \in \{1,2,\cdots , n\}\) to obtain another set of new representations \(\mathbf{h}_i^t, t \in \{a, s, m, d\}\), where a, s, m, d are addictive, subtractive, multiplicative, and dot-product attention mechanisms, respectively [11].

Inside Aggregation. This layer is designed to aggregate multiway attention layer outputs to a single representation \(\mathbf{z}\). Specifically, we first concatenate the outputs from cross-attention and self-attention blocks by positions respectively and feed them to different position-wise feed forward networks to obtain the compressed representations \(\mathbf{g}_i^*\), \(*\in \{p, q, c\}\), where p, q, c represent student answer sequence, reference answer sequence, and cross-attention sequence, respectively. We concatenate the outputs \(\mathbf{g}_i^*\) by positions and after another Transformer block, we get new sequence representation \(\mathbf{Z} = transformer([g_i^p, g_i^q, g_i^c]), i \in \{1,2,\cdots , n\}\) which contains the information in student and reference answers and the relations between them.

Prediction Layer. The evaluation of student answer will be produced by this layer. Specifically, we first convert the aggregated sequence representation \(\mathbf{Z}\) to a fixed-length vector with self-attention pooling layer. This transformation is defined as: \(\mathbf{x} = softmax(\mathbf{w_1^z} tanh(\mathbf{W_2^z} \mathbf{Z^T}))\mathbf{Z}\), where \(\mathbf{w_1^z}\) and \(\mathbf{W_2^z}\) are learned parameters during training step. Then we build a feed forward network that takes \(\mathbf{x}\) as input and outputs a two-dimensional vector. The output vector is sent to a softmax function to obtain the final probabilistic evaluation vector. The first entry gives the probability of wrong answer while the second entry gives right answer probability. The objective is to minimize the cross entropy of the relevance labels.

3 Experiments

In this section, we conduct experiments on a large real-world educational data, which contains 120,000 pairs of student answers and question analysis from an online education platform, each labeled with binary value indicating whether the student has the right answer. The positive and negative instances are balanced and we randomly select 30,000 samples as our test data and use the rest for validation and training. The hyperparameters of our model are selected by internal cross validation. We use both AUC and accuracy as our evaluation metrics and for both metrics, a higher value indicates better performance.

We compare our model with several state-of-the-art baselines. More specifically, we choose: (1) Logistic regression (LR). (2) Gradient boosted decision tree (GBDT) [3, 13]. (3) Multichannel convolutional neural networks (TextCNN) [4]. (4) Sentence embedding by Bidirectional Transformer block (Bi-Transformer) [12]. (5) Multiway Attention Network (MAN) [11]. And (6) Manhattan LSTM with max pooling (MaLSTM) [6].

3.1 Experimental Results

We report the experimental results in Table 1. From the table, we observe that our model outperforms all of the baselines. We argue that this is because our model is able to effectively capture the semantic information between student and reference answers. This is confirmed by the fact that MAN shows the superior performance among all baselines, as it not only aggregates sentence information within Transformer block, but matches words in both query sentence and answer sentence from multiple attention functions.

Table 1. ASAG performance comparison on a real-world K-12 dataset.

Full size table

4 Conclusion

In this paper we present our multi-way attention network for automatic short answer grading. We use transformer blocks and attention mechanisms to extract answer matching information. To comprehensively capture the semantic relations between the reference answer and the student answers, we apply multiway attention functions instead of single attention channel. Experiment results on a large real-world education dataset demonstrate the effectiveness of the proposed framework. There are several directions that need further exploration. We may use one attention mechanism with multiple heads instead of multiple attention mechanisms and we may replace transformer block with other type of sentence encoder like self-attention network or hierarchical attention network.

References

Daradoumis, T., Bassi, R., Xhafa, F., Caballé, S.: A review on massive e-learning (MOOC) design, delivery and assessment. In: 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), pp. 208–213. IEEE (2013)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Friedman, J.H.: Stochastic gradient boosting (1999)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Mitchell, T., Russell, T., Broomhead, P., Aldridge, N.: Towards robust computerised marking of free-text responses (2002)
Google Scholar
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: AAAI, vol. 16, pp. 2786–2792 (2016)
Google Scholar
Nielsen, R.D., Ward, W., Martin, J.H.: Recognizing entailment in intelligent tutoring systems. Nat. Lang. Eng. 15(4), 479–501 (2009)
Article Google Scholar
Ramachandran, L., Cheng, J., Foltz, P.: Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 97–106 (2015)
Google Scholar
Saha, S., Dhamecha, T.I., Marvaniya, S., Sindhgatta, R., Sengupta, B.: Sentence level or token level features for automatic short answer grading?: use both. In: Penstein Rosé, C., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 503–517. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_37
Chapter Google Scholar
Sultan, M.A., Salazar, C., Sumner, T.: Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1070–1075 (2016)
Google Scholar
Tan, C., Wei, F., Wang, W., Lv, W., Zhou, M.: Multiway attention networks for modeling sentence pairs. In: IJCAI, pp. 4411–4417 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Ye, J., Chow, J.H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: CIKM 2009 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

TAL AI Lab, Beijing, China
Tiaoqiao Liu, Wenbiao Ding, Gale Yan Huang & Zitao Liu
Data Science and Engineering Lab, Michigan State University, East Lansing, USA
Zhiwei Wang & Jiliang Tang

Authors

Tiaoqiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenbiao Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiliang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Gale Yan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zitao Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zitao Liu .

Editor information

Editors and Affiliations

University of Sao Paulo, Sao Paulo, Brazil
Seiji Isotani
University of Malaga, Málaga, Spain
Eva Millán
Carnegie Mellon University, Pittsburgh, PA, USA
Amy Ogan
DePaul University, Chicago, IL, USA
Peter Hastings
Carnegie Mellon University, Pittsburgh, PA, USA
Bruce McLaren
University College London, London, UK
Rose Luckin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, T., Ding, W., Wang, Z., Tang, J., Huang, G.Y., Liu, Z. (2019). Automatic Short Answer Grading via Multiway Attention Networks. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds) Artificial Intelligence in Education. AIED 2019. Lecture Notes in Computer Science(), vol 11626. Springer, Cham. https://doi.org/10.1007/978-3-030-23207-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-23207-8_32
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23206-1
Online ISBN: 978-3-030-23207-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Short Answer Grading via Multiway Attention Networks

Abstract

Similar content being viewed by others

Improving Short Answer Grading Using Transformer-Based Pre-training

Incorporating Question Information to Enhance the Performance of Automatic Short Answer Grading

Going deeper: Automatic short-answer grading by combining student and question models

1 Introduction

2 Our Approach

3 Experiments

3.1 Experimental Results

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Short Answer Grading via Multiway Attention Networks

Abstract

Similar content being viewed by others

Improving Short Answer Grading Using Transformer-Based Pre-training

Incorporating Question Information to Enhance the Performance of Automatic Short Answer Grading

Going deeper: Automatic short-answer grading by combining student and question models

1 Introduction

2 Our Approach

3 Experiments

3.1 Experimental Results

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation