Keywords

1 Introduction

In recent years, introducing the weak supervision for text classification has been vastly popular in researches because labels are often lacking in practice, resulting in many rigid techniques and tools for text classification with weak tags [1, 2]. Researchers have already addressed many educational problems by using the powerful machine learning models, such as knowledge diagnosis [3, 5], learning performance prediction [4, 6], path optimization recommendation [7]. However, associations between Chinese academic review comments and the subsequent grades have not been touched upon. These reviews are generally used to evaluate if a graduate student can gain a master’s degree or not and control the quality of graduate students. This problem falls into the short-text classification category, which is a fundamental task in natural language processing (NLP) [8] and has been studied in many applications including automated customer relationship management [9], paragraph summarization [10], and question answer system [11].

In general, commonly used text classification methods are comprised of two main tasks, i.e., feature engineering and classification. Feature engineering takes the primary steps to obtain word vectors, including data cleaning, unnecessary characters and words removal, and performing vectorization by leveraging TF-IDF (Term Frequency-Inverse Document Frequency), Word2Vec, or GloVe (Global Vector for Word Representation). Afterwards, classification techniques are im-plied for classification tasks, e.g., support vector machine [12] and XGboost [13].

The nature of human language often challenges the traditional methods of text representations by high sparsity. As a result, most existing models usually ends in simple features while incurring computation costs. On the other hand, the ambiguous nature of human language toughens the procedure of feature extraction from texts. On top of that, it is difficult to acquire sufficient labeled texts data. To resolve these issues, many state-of-the-art extremely performant Deep Learning NLP techniques, e.g., BERT [14], ERNIE [15] and GPT [16], have been proposed to gain near-supervised accuracy in text classification through exploiting the weakly-supervised techniques. Yosi et al. proposed an end-to-end document retrieval system using a weak supervised deep model with BERT and GPT2 [17]; Zihan et al. came up with a weakly-supervised text classification method based on Key Word graphs [18].

While weakly supervised text classification has been researched widely, it has been overlooked in Chinese educational texts. In this study, we propose a Chinese text classification workflow to learn the relation between expert-provided comments and their annotated grades for academic articles. More specifically, this study introduces a Chinese short-text classifier by extending the popular seed words-based model [23] into a systemic workflow. We first made texts into vectors in our proposed method using Chinese preprocessing and then exploited the bidirectional encoder representations from Transformers [19] to integrate the contextualization features, followed by a hierarchical attention network for performing text classification. Our method can address the existing conflict between expert given comments and their grades and provides better classification accuracy. Altogether, automatic grading of the expert given reviews would progress the educational evaluation process by manifolds.

The remaining of our paper is organized to (1) introduce the data and the proposed workflow in Sect. 2; (2) Provide and analyze the experiment results in Sect. 3; and (3) Conclude our study and discuss the future works in Sect. 4.

In this study, we provided a thorough comparison to portray the superiority of our method over traditionally used linear models like SVM, NB, and nonlinear methods like CNN.

2 Methodology

2.1 Dataset Collection

We have collected academic dissertation reviews from over 140 Chinese universities for this study. The data includes 4,310 reviews, comprising over six years (2014–2020) span.

Table 1. The used Chinese dissertation review dataset

These reviews contain basic information about the students, academic graduation thesis title, evaluation reviews and revision, expert provided grades, etc. Usually, the evaluation reviews and the revision comments lead to the potential expert-provided grades, divided into five categories, i.e., Excellent, Good, General, Bad, and Poor. We use the English translations of the assigned grades instead of Chinese, as shown in Fig. 1.

Table 2. The class distribution in our dataset

With regards to our approach, we removed the non-graded reviews that led us to have a total of 4296 data for our study.

We had a fair amount of data for this study, but the data distribution among the classes were highly imbalanced, which can be observed in Fig. 1. Table 1 summarizes the dataset's statistics. Table 2 summarizes the data available for each class in our final dataset.

Fig. 1.
figure 1

Data distribution in the review category.

We anticipated that this data was noisy due to the fact that the expert-assigned grade fluctuated according to the individual's perspective. This assumption was also validated by the model's performance. Our previous paper demonstrated this phenomenon [20].

2.2 The Proposed Workflow

To design our workflow, we have developed a data preparation system that preprocesses Chinese text data and prepare the data-frame to be used in our selected model. Following the text preprocessing system, we have selected a seed word-based contextualized weak supervision model, “ConWea” [23] which leverages use of contextualized representation techniques to generate contextualized vectors for each word occurrence. However, in our experiments, we used “Bert-base-Chinese” [24] for Chinese text embeddings. Afterwards, the model generates pseudo labels and trains a neural text classifier on those labels along with the contextualized corpus while expands initial seed words list based on a ranking system it employs.

We feed our collected text data to our workflow along with the initial seed words. To begin, we pass the collected Chinese review texts through the preprocessing steps to clean it up and make the data-frame model [23] ready. Then the model ready data-frame and the seed words are provided to the model. The model disambiguates the seed words by explicitly learning different senses (meanings) of each word with contextualized word embeddings. It first performs k-means clustering for each word in the vocabulary to identify potentially different senses (meanings), then eliminates the ambiguous keyword senses leading to a fully contextualized corpus. This contextualized corpus and generated pseudo labels trains the hierarchical attention network classifier and employing a ranking system extends the initial given seed words and iteratively uses these seed words alongside the contextualized corpus.

Figure 4, illustrates the whole method and shows the data flow through our proposed workflow.

Data Preparation.

To begin, we addressed the word segmentation issue of our collected text data as Chinese texts lacks proper word to word separation. We used “jieba” [21], a natural language toolkit that works well with the Chinese texts to obtain proper word segmentation.

Figure 2 shows the unsegmented data and Fig. 3 shows the properly segmented Chinese texts in our dataset.

Fig. 2.
figure 2

Collected unsegmented text data

Our data cleansing process entails removing unnecessary characters and noise-introducing words, as well as adding certain words to the “jieba” dictionary., etc.

Fig. 3.
figure 3

Word segmented Chinese review comments

We opted for “jieba” for tokenization and vocabulary generating. Our dataset generated a vocabulary length of 9927. For sentence segmentation task, we used “harvesttext” [22] which is another well-performing toolkit for natural language tasks on Chinese texts.

Fig. 4.
figure 4

The proposed workflow of Chinese auto-grading model training

We used Chinese stop words list from a GitHub repository [31] and had it refined based on our corpus.

Chinese Corpus Contextualization.

Following data preparation, the corpus and the seed words are provided to the selected model. We utilized a seed words-based contextualized weak supervision method [23], “ConWea” in the proposed workflow method, that leverages the use of contextualized representation techniques to generate contextualized vectors for each word occurrence. We fed the model with the corpus and initial seed words and obtained a contextualized corpus. In our experiments, we used “Bert-base-Chinese” to obtain text embeddings. Following data preparation, the corpus and the seed words are provided to the model. Afterwards, the model groups the word occurrences of the same word into an adaptive number of interpretations based on the label indicative seed-words, resulting in a contextualized corpus.

BERT for Chinese Text.

BERT stands for bidirectional encoder representation from transformers which is a transformer-based machine learning technique that has been pre-trained by Google [14]. It was created to assist computers in understanding the meaning of ambiguous human language in text by utilizing surrounding texts and therefore comprehending context. BERT’s key technical feature is applying the bidirectional training of transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. Leveraging this capability, it is pre-trained on Masked Language Modeling [26] and Next Sentence Prediction. We use the version, “Bert-base-Chinese” [24], in our experiment as we are working with Chinese text to get contextualized vector representation of every word occurrence.

Classifier Training and Expanding Seed Words.

Leveraging provided seed-words for each label, the model uses pseudo labels to train the contextualized Chinese corpus and train a neural classifier on those labels. We used the class names as initial seed-words.

In our experiments, we used a hierarchical attention network [25] that considers the hierarchical structure of the text data i.e., document-sentence-words, and integrates an attention mechanism that finds the most important words and sentences in a document considering the context. The architecture has been provided in Fig. 4.

There are two levels of attention whereas word level attention identifies the important word in sentence and sentence level attention does the same for the document. The contextualized corpus alongside the seed words is fed to the HAN model, and predicted pseudo-labels are used for document classification.

Leveraging contextualized Chinese corpus along with the predicted labels the model ranks the contextualized words and the top words are included in the seed words list per class. To get the top words, how label indicative the words are, frequency of that ideal seed words in the documents and unusualness of the words are considered. The seed words expansion and documents classification happen in an iterative manner inside “ConWea” [23] and the iteration number is controlled by T as a tunable hyper parameter in the model. We have set it to 7 as by then the method achieves convergence therefore the expanded seed sets and classification performance become constant.

2.3 Model Selection

On this dataset, we utilized three commonly used text classifiers. We implemented Linear Support Vector Machines (LSVM) and Multinomial Naive Bayes (MNB) as linear classifiers. Additionally, we used a nonlinear model based on the classical Convolutional Neural Network (CNN). We evaluated these models’ performance on our dataset and compared with the result our proposed workflow method. We used ReLU [27] as the activation function in our classical CNN design and ADAM [28] as the optimizer. For LSVM and MNB stochastic gradient descent (SGD) was used and we did appropriate parameters tuning.

2.4 Performance Evaluation Metrics

We implemented several validation metrics to observe the model performance in depth. We used accuracy score, macro-precision, macro-recall, macro-f1-score, weighted-precision, weighted-recall, and weighted-f1 score to properly examine the ability of the classifier. The accuracy rate is the aggregated accuracy score of a model.

Our classification task is a multi-class classification task; therefore, we calculate the True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). As we have five classes to classify among, we can consider x = a, b, c, d, e for the five classes. Therefore, precision (Px) and Recall (Rx) per class are calculated by

$$Px = \frac{TPx}{(TPx\,+\,FPx)}$$
(1)
$$Rx = \frac{TPx}{(TPx\,+\,FNx)}$$
(2)

For macro-precision (MR), macro-recall (MR), and macro-(F1)-score we use the equations below

$$Mp = 15\sum\nolimits_{x\,=\,a}^{e}(Px), MR = 15 \sum\nolimits_{x\,=\,a}^{e}(Rx), MF1 = \frac{2\, *\, Mp\,*\, MR }{(MP\, +\, MR)} $$
(3)

As our dataset is highly imbalanced, to get proper evaluation, we will compute weighted-precision (Wp), weighted-recall (WR) and weighted-f1-score (WF1) as

$$WP = \frac{\sum\nolimits_{x\,=\,a}^{e}Px* Nx }{\sum\nolimits_{x\,=\,a}^{e}Nx}, WR =,\frac{\sum\nolimits_{x\,=\,a}^{e}Rx* Nx }{\sum\nolimits_{x\,=\,a}^{e}Nx}, WF1 = \frac{2\,*\, Wp\,*\,WR }{WP\, +\, WR}$$
(4)

where (Nx) indicates the number of total data samples in the \(x\)-th class.

3 Result Evaluation

3.1 Overall Results

Table 3 shows the classification accuracy of all the methods we have tried the dataset on. As can be seen, our proposed workflow yields about 20% better accuracy than the Linear Support Vector Machine classifier and 25% better accuracy than the Multinomial Naive Bayes classifier and 13% better performance on accuracy score than the nonlinear CNN classifier leveraging contextualization. Therefore, portraying superiority in classification task performance compared to the widely used traditional methods. Observations remain uniform across all the classification metrics.

Table 3. The prediction accuracy on our data set by using all mentioned methods

Figure 5 illustrates, the classification prediction accuracy of all the methods used in each category of our dataset. We can observe the uniform superiority of our proposed workflow method here over every other method. All of the approaches had difficulty distinguishing between the classes “Bad” and “Poor”. The reason behind this lies with the existing conflicts between experts given comments and their subsequent grades.

Fig. 5.
figure 5

The classification accuracy of MNB, CNN, Our Method, and LSVM

Table 4 explains the arbitrariness of assigning these two grades by the experts hence raising the conflict between the comments and the annotated grades that leads the workflow to fail to perform classification properly at times. Yet, the workflow could handle this issue better than other used methods. The substantial superiority of our approach is also demonstrated in classifying highly imbalanced datasets such as the one used, which has very little data in classes “Poor” (128) and “Bad” (252) compare to other classes.

Table 4. Example of arbitrariness in the expert annotated grades

Nonetheless, the classification accuracy of these classes is comparable to that of classes with more data. Other methods failed to accomplish this.

3.2 Confusion Matrix

Figure 6 shows the confusion matrix for the methods we have used in this experiment, LSVM, CNN, MNB, and our proposed workflow method for the classification task.

All of the approaches had difficulty distinguishing between the classes “Bad” and “Poor”, however our approach performed significantly better. The reason behind this lies with the existing conflicts between experts given comments and their subsequent grades. But our proposed workflow method shows significant superiority in addressing the issue.

Fig. 6.
figure 6

Confusion Matrix of all the four methods used

4 Conclusion and Discussion

This study proposes a deep learning-based workflow method to explore the relation between expert-given comments and their subsequent grades in Chinese graduation thesis evaluation. The proposed workflow introduced an effective Chinese text classifier leveraging Chinese text preprocessing and then exploited one of the popular seed words-based weakly supervised methods [23] to create contextualized corpus followed by a hierarchical attention network.

We used 4296 evaluation reviews in our dataset that have been collected from over 140 higher educational institutions in China. Our proposed workflow method achieved.

93% accuracy, which is an improvement of 13% over classical CNN, 20% over LSVM and an astounding 25% over the NB-based method. In addition, our method performed uniformly well in classifying each grading class in terms of various implied metrics.

In the future, we will work on addressing the problem of noisy and imbalanced labels that exist at present in the educational situation and come up with scientific solutions for omitting those. More data analysis studies will also be developed, including visualization [29] and important factor discovery [30], linear dimensionality reduction of data [32], further works on unsupervised Order-Graph Regularized Sparse Dictionary Learning [33], students’ knowledge diagnosis [34], prediction of undergraduate students learning performance [35], expanding research on deep learning based exploration in impact of tumor infiltrating immune cells [36], etc.