Keywords

1 Introduction

In Chinese National College Entrance Examination and Senior High School Entrance Examination, evaluating the writing quality of essays has been a time-consuming task whose results might lack consistency when evaluated by human raters. Previous essay assessment tasks have focused on leveraging linguistic features of essays, such as those related to rhetoric and idioms.

Coherence is a fundamental concept in essay assessment and is particularly useful to assess how well an essay is organized. Coherence can be broken down to the cohesion between sentences and the fluency of transitions between paragraphs. It plays a vital role in ensuring clarity, conciseness and fluency in an essay, which is also crucial in improving the overall writing quality.

While early works on coherence evaluation can trace back to entity grid model(Barzilay and Lapata 2008; Guinaudeau and Strube 2013), recent works (Nguyen and Joty 2017; Mesgar and Strube 2018; Moon et al. 2019; Farag and Yannakoudakis 2019) have focused on utilizing neural networks for modeling coherence with varied structures such as CNN, LSTM and transformers. These models have achieved noticeable performance when sufficient amount of labeled data is provided. As emphasized above, manually labeling the coherence of essays relies on expert knowledge, requiring significant amount of time and cost. Hence modeling coherence in a low-resource setting can be crucial in many real-world scenarios and applications. However, most previous coherence models assume sufficient labeled data available while the low-resource setting is less explored.

In this paper, we present our approach which pretrains RoBERTa with whole word masking (WWM) on Chinese middle school student essays collected from an external source. WWM is performed in an unsupervised way which adds little cost to the original low-resource setting. In addition, WWM is effective in capturing general characteristics of middle school essays. Subsequently, the pretrained RoBERTa is finetuned on a small set of training data for coherence evalutaion.

Though we pretrain RoBERTa, our method is easy-to-employ and universal across most of the transformer-based language models. Experiment results on Chinese essays written by middle school students provided by NLPCC2023 Shared Task7 demonstrates that this simple strategy can achieve a fair performance. We also illustrate the performance of some variants of our method including pseudo labeling and adding additional neural network on top of RoBERTa, which provides insights into potential methods that are likely to result in performance drop.

The contributions of this work are as follows:

  • We collected and curated a substantial amount of middle school student essay data relevant to the task.

  • We propose a simple yet effective pretraining method that comes with little additional cost under a low-resource setting.

  • We carry out experiments on several methods beyond pretraining to provide future works with evidence on effective approach for coherence evaluation.

2 Related Work

For Coherence Evaluation, there had been several theories that characterize coherence(Mann and Thompson 1988; Grosz et al. 1995; Asher and Lascarides 2003). Inspired by the Centering Theory (Grosz et al. 1995), some early coherence evaluation models (Barzilay and Lapata 2008; Guinaudeau and Strube 2013) were proposed to distinguish a coherent from incoherent texts with the entity grid model.

Later works have designed neural network architectures for coherence modeling: the Neural Local Coherence Model (Nguyen and Joty 2017) which uses CNN to capture local coherence features in an essay; LSTM variants for modeling potentially longer coherence relationships (Moon et al. 2019; Farag and Yannakoudakis 2019; Mesgar and Strube 2018); multi-task learning which jointly trains Bi-LSTM to score coherence and predict the type of grammatical role (GR) of a dependent with its head. With transformer-based models becoming widespread across various NLP tasks, some recent works utilized transformer-based architectures for coherence evaluation. For instance, Jeon and Strube (Jeon and Strube 2022) proposed an entity-based neural local coherence model which encode an essay with XLNet.

Coherence evaluation can be incorporated into other tasks to boost the performance of the target task. One model for automated essay scoring (AES), for instance, can take coherence evaluation as one of its components for assessing organization score of an essay (He et al. 2022; Song et al. 2020), which greatly improves the effectiveness of essay scoring.

3 Method

When pretraining a language model, general corpora such as Chinese Wikipedia are typically used to capture the linguistic knowledge that is universal across various NLP tasks. However, essays written by middle school students might differ substantially from the corpora on which the language model is pretrained. In general, grammatical and logical errors are frequently found in those essays, which poses a gap between language models and the downstream coherence evaluation task for middle school students’ essays.

To this end, we pretrain RoBERTa on middle school student essays with whole word masking (WWM) strategy so that RoBERTa has a better understanding of the general content and structure of the essays. Pretraining with WWM is performed in an unsupervised way, hence it could be easily adopted in our setting where little labeled examples are available. We choose whole word masking as it outperforms individual character masking in various Chinese NLP tasks (Cui et al. 2021).

Whole Word Masking (WWM) primarily changes the training data generation strategy during the pre-training phase. In simple terms, the original tokenization based on Word Piece would split a complete word into several subwords, and during the generation of training samples, these separated subwords would be randomly masked. In WWM, if some of the Word Piece subwords of a complete word are masked, then other parts belonging to the same word will also be masked, which means the whole word is masked.

It’s important to note that the term “mask” here refers to different actions, such as replacing with [MASK], keeping the original vocabulary, or randomly replacing with another word. It is not limited to the case where a word is replaced with the [MASK] label.

Subsequently, we finetune the pretrained RoBERTa on the labeled dataset. Specifically, we add a linear classifier head on top of the [CLS] token representation produced by RoBERTa, and finetune the model with the standard cross entropy loss:

$$\begin{aligned} \mathcal {L}_{class} = \sum \limits _{i=1}^n \sum \limits _{c=1}^{|C|} p(y^{(i)}_c|x^{(i)}_c) \log q(y^{(i)}_c|x^{(i)}_c) \end{aligned}$$
(1)

where \(p(y^{(i)}_c|x^{(i)}_c)\) and \(q(y^{(i)}_c|x^{(i)}_c)\) are the true and predicted probability of c-th class of i-th training instance, respectively.

4 Experiments

4.1 Dataset and Evaluation Metric

We carry out experiments on the dataset provided by NLPCC2023 Shared Task7 Track1: Coherence EvaluationFootnote 1 This dataset consists of Chinese essays written by middle school students, where the coherence of each essay is evaluated on a three-level scale of excellent, moderate and poor. Within the dataset, 60 essays are train set, while another 5000 essays serve as test set.The data statistics are shown in Table 1.

Table 1. Statistics of Coherence Evaluation dataset

For pretraining dataset, we crawl essay data from website Lele KetangFootnote 2 The Chinese essays are written by middle school students from 7 to 12 grade. We split the dataset during pretraining so that all data is utilized while also ensuring appropriate text length for the language model. This yielded approximately 200,000 essays. The statistics of the length of the above data is provided in Table 2.

The performances of our method and baselines are evaluated on macro precision(P), recall(R), F1-score(F1) and accuracy (acc). We perform 5-fold cross validation on training set, and report the test set performance of models that have best validation accuracies.

Table 2. Statistics of the length of the pretraining data

4.2 Implementation Details

We implement our model with Pytorch and transformers library. For pre-training, We used the Roberta-chinese-wwm-ext-large as the baseline pre-trained model and trained it for 10 epochs using AdamW optimizer with default parameters. The batch size was set to 16 and the learning rate was set to 2e-5. The training was performed on two NVIDIA RTX 3090 GPUs. Next, we finetune the pretrained RoBERTa with the Adawm optimizer for 20 epochs. The learning rate was set to 1e-5 and the batch size was fixed at 8. All other parameters were set to their default values.

4.3 Baselines

We consider several variants of our method as baselines which we compare our proposed method against:

PFT our proposed method; pretraining on task-related data with WWM followed by finetuning.

PFT+HAN a hierarchical attention pooling network (HAN) on top of RoBERTa pretrained on task-related data; Attention pooling layer with RoBERTa map the essay into a sequence of paragraph representations, and another attention pooling layer maps paragraph representations into an essay representation for final classification. Punctuations in each paragraph are embedded as a single vector that is concatenated to the corresponding paragraph representation.

PFT+HAN+pseudo assign pseudo labels on unlabeled test set using PFT model, and augment original train set with pseudo dataset to train HAN.

Table 3. Performance comparison on the test set. Best results are in bold.

5 Results

Experiment results are shown in Table 3. We can observe that even when provided with a small amount of labeled data, combining finetuning with the task-related pretraining is an efficient strategy which can outperform random guess by a large margin, achieving an accuracy of 43.99%. Contrary to our expectation, PFT experiences a big drop in its performance when it is added with an auxiliary hierarchical attention pooling network, reaching an accuracy of 35.15%. Augmenting PFT+HAN further with pseudo labeled test data slightly improves the accuracy (38.78%) yet it also worsens precision, recall and F1 of PFT+HAN. It shows that knowledge learnt from PFT is not good enough to transfer to the test set, since the pseudo-labeled dataset has a harmful effect on the PFT+HAN model. In short, both adding auxiliary network or pseudo-dataset to PFT have failed to make further improvements over a simple PFT.

6 Conclusion

Coherence is a fundamental concept in essay assessment in that it plays a vital role in ensuring clarity, conciseness and fluency of an essay. Due to the prohibitive cost for manually assigning coherence labels to essays, developing coherence models under low-resource settings is of importance in various real-world scenarios. In this paper, we propose an effective approach for Chinese coherence evaluation task. Specifically, we address the challenge of a small amount of labeled data through pretraining RoBERTa with a large amount of task-related data in an unsupervised manner and finetuning the pretrained model on labeled data.

Experiment results on the Chinese essays written by middle school students demonstrate that our simple approach can outperform a random guess by a large margin despite of limited amount of labeled data. In addition to the simplicity, our method is also applicable to transformer-based coherence evaluation models other than RoBERTa. However, both adding auxiliary network or pseudo-dataset to our original method had negative effects on the performance, indicating that more investigations are necessary to carefully design auxiliary network or self-training strategy.