Keywords

1 Introduction

Automated essay scoring (AES) is the task of employing computer programs to assign grades to essays based on their content, grammar, and structure. It has become an important educational application of natural language processing (NLP). For example, Educational Testing Service (ETS) uses AES systems to evaluate the writing ability of students. Such systems can also be applied for quality assessment as well as pricing on User Generated Content. Typically, AES systems regard the task as a regression problem based on handcrafted features (e.g., length-based features and lexical features) and most of them have achieved good results [1, 10, 16]. However, such systems require feature engineering, which costs lots of time and effort. Therefore, a large number of researchers focus on neural networks that are capable of modeling complex patterns without human assistance [3, 6, 11, 14].

Previous works mainly focus on the text itself [6, 13, 14], ignoring to investigate the topic information of the essays with prompts. Prompts indicate the requirements and topics for students’ writing. As is observed, essays off the prompt always receive low scores while high score essays are relevant to the prompt. Chen and Li [2] extracted the similarity of the essay with the topic on document-level for scoring and achieved good performance. But only using document-level features for scoring may lose some information in detail. To learn how each part of the essay sticks to the prompt more accurately, Zhang and Litman [17] proposed the Co-Attention Based Neural Network to model the similarity of essays at sentence level. However, some prompts are highly abstract, making it hard to score the essay only based on the similarity between the essay and the prompt. Thus, we introduce the example-based learning as auxiliary task to capture the hidden features.

Our main contributions are as follows:

  • We design a dynamic semantic matching block to capture the hidden features with example-based learning, which is an auxiliary task for AES.

  • We provide a hierarchical model that can extract semantic features at both sentence-level and document-level, which are useful for evaluating coherence and relevance in the essays.

  • Experimental results show that our model achieves higher Quadratic Weighted Kappa (QWK) scores on five of the eight prompts compared with previous methods on the ASAP dataset.

2 Related Work

Automated Essay Scoring (AES) systems have been deployed for assigning grades to essays since decades ago. The first AES system created in 1996 is Project Essay Grade which uses linguistic surface features [12]. Recent works mainly use neural networks for automated essay scoring. Dong and Zhang [5] employed a two-layer CNN model to learn sentence representations and essay representations. Differently, Taghipour and Ng [14] used LSTM in their model which effectively learned features for scoring. However, these works only focus on the essay itself, despite the relatedness of the essay to the topic.

High score essays always keep to the prompt closely. Some researchers consider the relevance of the essay to the given prompt for scoring since an essay cannot get a high score if it is not relevant to the prompt. There are many ways to compute the relevance of an essay to the prompt. Higgins et al. [8] extracted sentence features based on semantic similarity measures and the discourse structure, to capture breakdowns in coherence. Chen et al. [2] proposed hierarchical neural networks and used the similarity between the essay and topic as auxiliary information for scoring. All of them take prompt relevance into account as it is an important part of the guidelines. However, it is hard to do semantic matching with the prompt because the prompt is composed of abstract and general sentences. In our approach, we generate relevance features by performing semantic matching with the high score essays. The relevance features are used as auxiliary features for prediction.

3 Model

In this section, we describe the proposed hierarchical structured model named AES-SE, which contains three parts: 1) coherence modeling block, 2) relevance modeling block, 3) dynamic semantic matching block (Fig. 1).

Fig. 1.
figure 1

An overview of our model. There are three parts: coherence modeling block, relevance modeling block and dynamic semantic matching block. All the extracted features are concatenated and sent to a dense layer for the final score.

3.1 Coherence and Relevance Modeling

For semantic coherence within a document and the relevance to the prompt, we apply the coherence modeling block and the relevance modeling block. It is not enough only considering features within cliques [7, 9]. Instead, we use the self-attention mechanism to capture semantic changes within the whole document.

Sentence Representation. To capture lexical-semantic relations among words, we use pre-trained BERT [4] to get the sentence representation \(S_i\).

$$\begin{aligned} S_i=BERT(W_e) \end{aligned}$$
(1)

where \(W_e\) are the words of each sentence in the essay.

Coherence Modeling.To extract the coherence feature of the essay, we use self-attention mechanism to compute the similarity between sentences:

$$\begin{aligned} score(S_i,S_j)=S_i^{\mathrm {T}}W_aS_j \end{aligned}$$
(2)

where \(S_i\) and \(S_j\) are sentences from the essay \(\{S_1,S_2,S_3,...,S_n\}\), \(W_a\) is the weight matrix to be learnt and the score function \(score(S_i,S_j)\) tells how much similar the two sentences are.

$$\begin{aligned} \alpha _{ij}=\frac{exp(score(S_i,S_j))}{\sum _{k=1}^{n}exp(score(S_i,S_k))} \end{aligned}$$
(3)

where \(\alpha _{ij}\) represents the attention weight between \(S_i\) and other sentences.

$$\begin{aligned} S_{i}^{coh}=\sum _{j=1}^{n}\alpha _{ij}S_j \end{aligned}$$
(4)

Finally, we use weighted sum of sentences as the coherence \(S_{i}^{coh}\).

Relevance Modeling. It is observed that essays with high score always stick to the topic. To model the prompt relevance, we compute the similarity of essays with the assigned prompt. This process is almost the same as coherence modeling, where we compute the similarity between sentences from the essay and its prompt. The obtained relevance representation is \(S_{i}^{rel}\).

3.2 Example-Based Learning

There are some consistent features that high-scoring essays usually have. Therefore, we design a dynamic semantic matching block to capture the hidden feature from high score essays as auxiliary information for holistic scoring.

Example Selection. To select typical examples, we use the k-means algorithm. We pick out full mark compositions, and use BERT to encode the sentences. Then, we take the averaged sentence vector of each essay as the input of k-means. Finally, we select essays that are closest to the cluster centers as examples.

Dynamic Semantic Matching. According to psychological researches, it is hard for people to pay close attention to too many things at the same time [15]. While understanding a text deeply, our focus may dynamically change to different sentences. With the aim to focus on the significant sentences with the consideration of learned information at each step, the dynamic semantic matching block is designed. To get the document representation of the essay. We utilize attention mechanism to integrate the sentences:

$$\begin{aligned} T_i=V_ctanh(W_cS_i+b) \end{aligned}$$
(5)
$$\begin{aligned} \gamma _i=\frac{exp(T_i)}{\sum _{k=1}^{n}exp(T_k)} \end{aligned}$$
(6)

where \(\gamma _i\) is the attention weight. \(V_c\), \(W_c\), and b are parameters to be trained. The document representation \(h_e\) is weighted sum of sentence vector S.

$$\begin{aligned} h_{e}=\sum _{i=1}^{n}\gamma _iS_i \end{aligned}$$
(7)

The same is done on the example essay to get the document representation \(h_s\). The inputs of the dynamic semantic matching block are sentence vectors from input essay \(T_e=\{S_1,S_2,S_3,...,S_n\}\) and the example essay \(\{S_1^{'},S_2^{'},S_3^{'},...,S_m^{'}\}\). For each step, an important sentence will be chosen for current input of an LSTM using attention mechanism. The choosing function \(F_c(T_e,\hat{h_{t-1}},h_s)\) is formulated as follows:

$$\begin{aligned} Z_i=V_d^{T}tanh(W_dS_i+U_d\hat{h_{t-1}}+M_dh_s) \end{aligned}$$
(8)
$$\begin{aligned} \delta _i=\frac{exp(Z_i)}{\sum _{k=1}^{n}exp(Z_k)} \end{aligned}$$
(9)
$$\begin{aligned} \hat{a_{t}}=\sum _{i=1}^{n}\delta _i S_i \end{aligned}$$
(10)

where \(V_d\), \(W_d\), \(U_d\)and\(M_d\) are parameters to be trained. \(h_s\) is the document representation of the example essay and \(\hat{h_{t-1}}\) is the last step of the LSTM as follows:

$$\begin{aligned} \hat{h_t}=\text{ LSTM }(\hat{a_t},\hat{h_{t-1}}) \end{aligned}$$
(11)

We can get the last output \(\hat{h_e}\) from the LSTM where we compare the essay with the example. To compare the example to the essay, we can also get \(\hat{h_s}\). Then, we send them to multi-layer perceptron (MLP) to calculate the relation probability R:

$$\begin{aligned} R=MLP(\hat{h_e},\hat{h_s},\hat{h_e}\odot \hat{h_s},\hat{h_e}-\hat{h_s}) \end{aligned}$$
(12)

where \(\odot \) means element-wise product. To each of the example essays, we repeat this process and get the averaged features \(\hat{H}\):

$$\begin{aligned} \hat{H}=\frac{1}{q}\sum _{i=1}^{q}R_i \end{aligned}$$
(13)

where q is the number of the example essays.

3.3 Scoring

After obtaining coherence features \(S^{coh}\) and relevance features \(S^{rel}\), for each sentence, we concatenate the features together and send them to a BI-LSTM for modeling the document. After that, all the hidden states are fed into a mean-over-time layer. The function is defined as follows, where n denotes the num of sentences in an essay and \(h_t\) is the hidden state of the BI-LSTM at time t.

$$\begin{aligned} h_t=\text{ BI-LSTM } (h_{t-1},[S_t^{coh};S_t^{rel}]) \end{aligned}$$
(14)
$$\begin{aligned} H=\frac{1}{n}\sum _{t=1}^{n}h_t \end{aligned}$$
(15)

Finally, we use the sigmoid function to compute the final score.

$$\begin{aligned} y=\sigma (W_y[H;\hat{H}]+b_y) \end{aligned}$$
(16)

where \(W_y\) and \(b_y\) indicate the weight matrix and bias. H is the semantic representation of the essay. \(\hat{H}\) is the semantic matching feature.

As for loss function, we use mean squared error (MSE) [6]. MSE is used to compute the average value of squared error between the predicted scores and golden ones, as follows:

$$\begin{aligned} mse(y,y^*)=\frac{1}{N}\sum _{i=1}^{N}(y_i-y_i^*)^2 \end{aligned}$$
(17)

where y is the predicted score and \(y^*\) is the true value.

4 Experiments

In this section, we introduce the dataset and evaluation metric we use and the experimental results.

4.1 Dataset

We use the ASAP (Automated Student Assessment Prize) datasetFootnote 1 as it has been widely used to evaluate the performance of AES systems. There are 12976 essays written by students with 8 prompts of different genres. The students were from Grade 7 to Grade 10 and 2 human graders scored the essays.

4.2 Evaluation Metric

Quadratic Weighted Kappa (QWK) is the official evaluation metric in the ASAP competition, which measures the agreement between ratings assigned by humans and ratings predicted by AES systems. As the ASAP dataset is used in this paper for evaluation, we adapt QWK as our evaluation metric.

Table 1. Comparison with state-of-the-art methods on the ASAP dataset

4.3 Experimental Results

In this section, we test the performance of AES-SE and the baselines on the ASAP dataset. The results in Table 1 are the QWK scores on the eight prompts from the ASAP dataset, where the best results are bold. The baselines include RNN, GRU, LSTM, CNN, EASE, SKIPFLOW LSTM, and HISK+BOSWE+ \(\nu \)-SVR, which achieved state-of-the-art performance on the ASAP dataset. Compared with HISK+BOSWE+ \(\nu \)-SVR [3], AES-SE achieves higher QWK scores on five of the eight prompts and the average QWK score of AES-SE is also higher. As shown in Table 1, AES-SE achieves new state-of-the-art performance on five of the eight prompts and the averaged QWK score. On average of the eight prompts, our AES-SE achieves 0.788, which is 0.3% higher than HISK+BOSWE+ \(\nu \)-SVR [3].

5 Conclusion

In this paper, we conduct a hierarchical structure named AES-SE with an auxiliary task for automated essay scoring. We use BERT to encode sentences capturing lexical-semantic relations among words. We simultaneously consider coherence features and relevance features to evaluate cohesion and task achievement. Moreover, with dynamic semantic matching block, the similarity of an essay with high score essays is computed as auxiliary information for scoring. Finally, we concatenate all the extracted features and compute the final score. Experimental results show that our model outperforms the current state-of-the-art methods with the improvement of the QWK score by 0.3%. In addition, we also achieve a significant 11.7% improvement over feature engineering baselines. For future work, we will explore using domain adaptation in our model.