Keywords

1 Introduction

Self-explanation is defined as generating explanations to oneself and explaining concepts, procedures, and solutions to deepen understanding of the material [1]. It has been widely recognized for its learning effects for a long time [2]. The iSTART system is the leading research method in self-explanation evaluations, which guides learners through the exercise to support active reading and thinking [3].

In mathematics, there is a procedure for solving a quiz, and the quiz is solved according to that procedure. Therefore, we proposed a method to check whether students can describe each step in a self-explanation by comparing the similarity between the human-created sample answer and students’ self-explanations [4]. It was judged that the student's knowledge was likely to be insufficient because the information and words of the unit required were included or, if not, they were missing some knowledge components. We defined “Rubric” as can-do descriptors that clearly describe all the essential knowledge components of the quiz and “Sample Answer” as model answers of self-explanations with knowledge components, which are prepared according to the step rubric number(Table 1). In this study, we propose an automatic generating sample answers model in place of human-created sample answers. Our contributions have a wide range of implications, such as scoring self-explanations and generating self-explanation scaffold templates based on sample sentences.

Table 1. Rubrics and a sample answer of self-explanation in a quiz.

2 Data Collections and Model Architecture

We collected the data from January 1, 2020, to December 31, 2021, using the LEAF platform [5], which consists of a digital reading system named BookRoll, and a learning analytics tool LAViEW (Fig. 1). For this experiment, we chose quizzes with at least five answers. The number of quizzes were 25, and the total number of answers were 1434. Figure 2 illustrates the proposed model, which consists of (i) Vectorizing component, (ii) Clustering component, and (iii) Extracting Component. As the vectorizing component, we adopted Sentence BERT and BERT Japanese pre-trained model to represent the sentences [6, 7]. As the clustering component, we employed an unsupervised learning model, K-means. The reason for generating meaning-intensive clusters through unsupervised learning is to reproduce the solution steps in mathematics. From an educational point of view, a problem for junior high school students would probably contain at least two steps and at most six steps of unit knowledge components and set the number of clusters in the range of 3–5 by the elbow method. As the extracting component, for each semantic cluster, the most representative sentences are extracted and sorted by multiplying them by their position in the problem, obtained from pen strokes. For extracting a representative sentence, Lexrank [8] was tested to extract the most representative sentences from each cluster. The input is all the self-explanation sentences associated with the quiz, and the output is the summarization with knowledge components for the quiz.

Fig. 1.
figure 1

The students input a sentence of explanation every time they think they have completed some step in their answers during the playback. Therefore, the self-explanations are temporally associated with the pen stroke data.

Fig. 2.
figure 2

Overall model architecture.

3 Experiments

Firstly, we set the rubrics for each quiz for evaluation (Table 2). Secondly, two authors and one assistant evaluated the machine-generated self-explanations to determine if they contained the necessary knowledge components. Though the Fleiss’ kappa coefficient [9] was 0.518 initially, after discussing the differences among the three, the final coefficient was 0.870. Table 2 shows the human evaluation results in 72% of the quizzes, it could generate all of the maximum five knowledge components.

Next, we evaluated the similarity of human-created and machine-generated sentences from several metrics: BERTScore, BLEU [10, 11]. In addition, we conducted a Spearman correlation analysis to investigate the correlations between the summary index and human evaluation. The Human Evaluation Score (HES) was scored according to how well machine-generated answers met the knowledge components against rubrics in the following form.

Table 2. Missing knowledge components of each quiz by Human evaluation
Table 3. The similarity evaluation(F1)
Table 4. RMSE and Correlations between HES and metrics.

Table 3 presents the F1 Metrics scores. The highest similarity metric was BERTScore with an average of 0.719. Table 4 shows the correlations and RMSE between HES and metrics. As for correlation, it was 0.48 for BERTScore, showing a moderate correlation. As for RMSE, the BERTScore with the minor error was 0.273, while the other metrics were over 0.5, a significant difference.

4 Conclusion

This study attempted to generate sample self-explanation sentences from collected data. The collected 1434 self-explanations from 25 quizzes were fed into a model and the results showed that 72% of the quizzes could generate all of the maximum five knowledge components. The similarity between human-created and machine-generated sentences was 0.715, with a significant correlation of R = 0.48(BERTScore). Results suggest it is possible to generate sample answers using the proposed model to extract the necessary knowledge components and improving the BERTScore accuracy correlates with extracting essential knowledge components.