Keywords

1 Introduction

The integration of CBLPs into classrooms and the willingness of teachers to utilize them in various capacities in their classrooms has enabled researchers to explore the effectiveness of CBLPs through data-driven methods. Consequently, researchers have focused on developing their CBLPs in alleviating difficult or tedious tasks faced by teachers in their everyday classroom activities. Researchers, however, have faced challenges in supporting open-ended problems due to the variance in the answers. While the recent advancement in NLP and machine learning have made progress towards automating the assessment of open-ended questions in various domains, the evaluation of open-ended responses remains a predominantly manual task for teachers. Writing is a critically important skill, and it facilitates students with an avenue to exhibit their thought processes and ability in formulating arguments and providing justifications for their work [11, 26]. In mathematics, it enables teachers to gauge whether students have a strong comprehension of mathematical concepts. Furthermore, teachers can leverage open-ended problems to identify situations where students may be able to answer close-ended problems correctly by shallowly learning and applying procedural rules [17, 23].

The assessment of open response problems is a largely subjective task. Giving concise responses to open-ended maths problems further underscores the subjective nature of grading open response problems. While grading of open-ended responses often relies on rubrics or other standardized procedures to help optimize the evaluation procedure, teachers often account for contextual factors. The students’ past academic performance, persistence exhibited during lessons, or other qualities may affect the teachers’ grades. It is important to emphasize that this does not necessarily mean that the grading is unfair. The subjective nature accounting for student ability can positively impact students through personalized feedback [13, 14]. However, teachers usage of contextual information in the assessment of students’ performance presents a unique challenge in automating the grading of open-ended responses and raises concerns about ensuring the fairness.

Our goal in this work is to build on our prior work and explore teacher grading behavior of open-ended problems and the role of student identity on the grades. Explore the effects of anonymized vs. non anonymized data in the automated grading and feedback generation of open response problems. As such, this paper aims to address the following research questions:

  1. 1.

    Does using anonymized grades in NLP models mitigate possible biases introduced by student identity?

  2. 2.

    What factors affect the teacher’s perception of AI agents in automated grading of open-ended math problems?

  3. 3.

    How does teacher perception of AI agents influence their behavior?

2 Background

Growth and innovation in Education Technology (Ed-Tech) have influenced the adaption and regular usage of CBLPs in classrooms. Through ease of logging data, the adaption of CBLPs has motivated researchers to explore the effectiveness of various design paradigms, from traditional teacher-driven designs to self-paced learning, peer learning, discussion-oriented learning, demonstration-focused learning, and flipped classrooms. Several platforms often provide a selection of these features for teachers and students to leverage instead of simply focusing on a single one. Similar to the different design paradigms, researchers have also taken a varied approach in prioritizing the focus of their platform. Some provide a generic platform to host content and leverage crowdsourcing to address learner needs, such as generating problems and solutions [2, 7], collecting hints and explanations [4, 27]. Other platforms focus on specific domains such as writing skills [3, 22], mathematics [5, 12], programming [21] to facilitate learning by providing content that addresses the specific needs of learners. It is important to note that these two approaches of prioritizing focus are not mutually exclusive. Platforms often leverage a combination of designs that focus on a specific domain while also facilitating crowdsourcing features that address learner needs.

The automated grading and feedback generation of open-ended problems has been particularly challenging. Researchers have explored various approaches to provide real-time feedback and assessment of open responses to support students. Similar efforts have also been made to support teachers by automating the assessment of open-ended responses. Various approaches such as hand-crafted boutique pattern matching [24], and deconstructing grading rubrics into knowledge components [25]. The rapid growth and innovation of NLP have provided a significant advantage to automating the assessment of open-ended student responses. Researchers have explored NLP in evaluating a diverse range of responses from short-answer responses in mathematics [1, 9] to long-form responses such as essays [3, 16]. Neural network models such as Word2Vec [19], Glove [20], and BERT [8] have enables the ability to capture semantic and contextual information from responses. While using deep learning models has improved the NLP models’ performance, they require a large corpus of data that often are not readily available or easy to compile.

Researchers have explored the effectiveness of NLP models in the automatic grading and feedback generation of responses; there is a requirement for examining the effectiveness of the automation while accounting for fairness. Most examinations of fairness revolve around the algorithm’s performance [10, 15] and model generalizabilityacross target groups to identify possible biases [6, 18]. However, post hoc analysis of models can be rather challenging. These biases can only be mitigated if we are conscious of their existence beforehand or by detecting the existence of biases across certain aspects, such as genders or biases across ethnicity. We propose exploring the utility and effectiveness of NLP models when trained on anonymized data vs. when trained on non-anonymized data.

3 Teacher Grading Behavior

In prior work, we reported on a pilot study where we asked 14 teachers to grade anonymized open-ended responses of students who worked on three open response problems in the month prior to the study. Of the 14 teachers, only 9 completed the study. The data corpus only included the students of the 14 teachers in the pilot study. A random sample of 25 responses was generated per teacher, where we checked to ensure that at least 10 of the 25 responses were responses from their students. If the random sample had less than 10 open responses from their students, then additional open responses were selected for the teacher to grade by randomly selecting additional responses from their students. If a teacher did not have any of their students in the random sample, they were assigned an additional 10 responses, making the total number of problems they graded 35. Table 1 reports on the 9 teachers who completed the study along with the total number of problems they graded(N) non-anonymized beforehand and anonymized during the pilot study. Some teachers had less than 10 problems to grade because we had to remove duplicate answers (e.g., empty responses or answers of “I do not know”) to ensure that the teacher graded a unique set of responses.

As shown in Table 1, we explored the teacher’s grading behavior by applying Cohen’s Kappa to measure the variation in their grading of student responses when anonymized vs. non-anonymized. We found the agreement coefficient to be as low as k = 0.163 and as high as k = 0.67, which was concerning as it indicated that the teacher disagreed with themselves when it came to scoring their students when their students across conditions. The grading behavior was lower than anticipated, indicating significant differences in teacher grading behavior when students were anonymized. Given that the grades are given on a 5 point scale, and the teacher’s assessment may reasonably vary by a small degree, we also explored a relaxed calculation of Kappa. We computed the intra-rater reliability of each teacher with an off-by-one adjustment; if the absolute difference in score across conditions was one or less, then we treated it as equivalent. The adjustment resulted in notably higher kappas indicating that teachers have consistent general grading behavior. We also computed the average difference in the grades across conditions. While most teachers were more lenient graders when they knew the student’s identity, some of the teachers were more lenient when the student was anonymized.

Table 1. Exploring the grading behavior of teachers when they had access to students’ identity vs. when students were anonymized.

3.1 Analysis Plan

Currently, we are designing a larger study expanding our pilot study to explore teacher grading behavior and investigate if the proportion of the behavior where some teachers are more lenient grader than others repeats itself across teachers. The more extensive study also provides the data to train the NLP models to compare the model performance when trained on anonymized grades versus non-anonymized grades.