Keywords

1 Introduction

Science educators and policy makers (NGSS Lead States 2013; OECD 2014) agree that richly integrating authentic inquiry with science content will promote well-honed learning strategies and allow students to apply and transfer their science knowledge in more flexible ways as is needed for tomorrow’s jobs (Hilton and Honey 2011). As a result, as schools in the United States adopt the Next Generation Science Standards (NGSS), educators will need to (1) incorporate more inquiry experiences into instruction, (2) assess their students’ inquiry practices/skills, and (3) ensure that each student demonstrates adequate progress on these.

Meeting these goals however poses significant challenges (Fadel et al. 2007). First, educators may not have adequate time, lab space, and/or physical materials for inquiry (Staer et al. 1998), particularly in schools with large class sizes (e.g., in Oregon there can be 50 students in a class). Second, grading inquiry is difficult, subjective, and time-intensive (Deters 2005). Third, teachers need immediate and actionable data to identify which of the many types of difficulties students are experiencing (Kuhn 2005) in order to foster students’ growth (Shute 2008), but current assessments yield data too late for teachers to impact students’ learning (Pellegrino et al. 2001). Fourth, developing authentic inquiry tasks and assessments is difficult due to its multifaceted, ill-defined nature (Williamson et al. 2006), and as a result, there are too few empirically tested resources to assess and support inquiry (Krajcik et al. 2000; Schneider et al. 2005). Lastly, since inquiry practices need to be honed over time, students need to engage in authentic inquiry multiple times across the school year, and without an automated solution, the burden on teachers to do grading is extremely onerous.

To add to these issues, the most recent student data on international comparisons of science performances show that American students continue to fall behind their peers. For example, in 2015, the United States ranked 25th worldwide on a key educational survey called the Program for International Student Assessment (PISA; Organization for Economic Cooperation and Development 2018). This is no doubt related, at least in part, to the many student difficulties that have been demonstrated for all of the inquiry skills identified by NGSS (2013). Specifically, students have trouble forming testable hypotheses (Chinn and Brewer 1993; Klahr and Dunbar 1988; Kuhnetal. 1995; Njoo and de Jong 1993; van Joolingen and de Jong 1997; Glaser et al. 1992) and difficulty testing their hypotheses (van Joolingen and de Jong 1991b, 1993; Kuhn et al. 1992; Schauble et al. 1991). They have difficulty conducting experiments (Glaser et al. 1992; Reimann 1991; Tsirgi 1980; Shute and Glaser 1990; Kuhn 2005; Schunn and Anderson 1998, 1999; Harrison and Schunn 2004; McElhaney and Linn 2008, 2010).

When interpreting data during inquiry, a key NGSS inquiry practice and the one addressed in this chapter, students have several different types of difficulties. They may draw conclusions based on confounded data (Klahr and Dunbar 1988; Kuhnetal. 1992; Schauble et al. 1995), state conclusions that are inconsistent with their data (Kanari and Millar 2004), change ideas about causality (Kuhn et al. 1992), and/or have difficulty in making a valid inference and reconciling previous conceptions with their collected data, falling back on prior knowledge (Schauble 1990; Kanari and Millar 2004), thereby exhibiting confirmation bias during inquiry (Klayman and Ha 1987; Dunbar 1993; Quinn and Alessi 1994; Klahr and Dunbar 1988). They also fail to relate the outcomes of experiments to the theories being tested in the hypothesis (Schunn and Anderson 1999; Chinn and Brewer 1993; Klahr and Dunbar 1988).

When warranting their claims with evidence, one of the five essential features of classroom inquiry per NRC’s (National Research Council 2011), they often provide little to no justification (McNeill and Krajcik 2011; Schunn and Anderson 1999) and create claims that do not answer the question posed (McNeill and Krajcik 2011). Students can also rely on theoretical arguments rather than on experimental evidence during warranting (Kuhn 1991; Schunn and Anderson 1999).

Lastly, they have difficulties developing rich explanations to explain their findings (Krajcik et al. 1998; McNeill and Krajcik 2007). When students provide reasoning for their claims, they often use inappropriate data by drawing on data that do not support their claim (McNeill and Krajcik 2011; Kuhn 1991; Schunn and Anderson 1999), make no mention of specific evidence (Chinn et al. 2008), or generally state that an entire data table is evidence (McNeill and Krajcik 2011; Chinn et al. 2008).

In this work, we sought to unpack the difficulties associated with data interpretation and warranting claims in particular.

2 Our Solution: Inq-ITS (Inquiry Intelligent Tutoring System; www.inqits.com)

In response to calls such as the Next Generation Science Standards, as well as teachers’ assessment challenges and students’ learning challenges, we have developed a solution that leverages schools’ existing computing resources to help teachers with inquiry assessment by providing automatic, formative data and to help students learn these skills by providing real-time, personalized scaffolds as they engage in inquiry. Inq-ITS (Inquiry Intelligent Tutoring System) is a lightweight LMS, providing computer-based assessment and tutoring for science inquiry skills. It is a no-install, entirely browser-based learning and assessment tool created using evidence-centered design (Mislevy et al. 2012) in which middle school students conduct inquiry using science microworlds (Gobert 2015). Within Inq-ITS, which consists of different interactive simulations within microworlds, or virtual labs, for different domains in physical, life, and earth science, students “show what they know” by forming questions, collecting data, analyzing their data, warranting their claims, and explaining findings using a claim-evidence-reasoning framework, all key inquiry practices (NGSS Lead States 2013). As students work, the inquiry work products they create and processes they use are automatically assessed using our patented assessment algorithms (Gobert et al. 2016a, b). These assessment algorithms were built and validated using student data (Sao Pedro et al. 2010, 2012a, 2013b, c, 2014; Gobert et al. 2012, 2013, 2015; Moussavi et al. 2015, 2016a). They have been shown to be robust when tested across inquiry activities with diverse groups of students and match human coders with high precision (precision values ranging from 84% to 99%; Sao Pedro et al. 2012a, b, 2013a, b, 2014, 2015).

3 Others’ Prior Research on Scaffolding Inquiry

Given student difficulties with inquiry as previously described, providing support to students for inquiry is critical if the Next Generation Science Standards (2013) or other policies emphasizing authentic science practices (e.g., OECD 2018) are to be realized. Scaffolds for inquiry can help students achieve success they could not on their own (Kang et al. 2014; McNeill and Krajcik 2011) and can lead to a better understanding of scientific concepts and the purpose of experimentation, as well as the inquiry skills used in experimentation (Kirschner et al. 2006). For example, providing scaffolding for a PhET simulation on circuit construction lead students to be more explicit in their testing (such as adding a voltmeter or connecting an ammeter in the circuit); this systematicity also transferred once scaffolding was removed (Roll et al. 2014). Additionally, the specific skill of collecting controlled trials, a lynchpin skill of inquiry, can be learned via strategy training and transfers to other topics (Klahr and Nigam 2004). Scaffolding can also be used to help students make connections between experimental data and real-world scenarios (Schauble etal. 1995). Lastly, scaffolding students’ explanations during inquiry can yield positive effects on learning (Edelson et al. 1995; McNeill et al. 2006). Taken together, these results demonstrate the potential for deeper inquiry learning when students are provided with adequate support.

One drawback, however, to many of these studies is that the scaffolding is either provided by a teacher, is in the form of text-based worksheets, or in some other form that is either not scalable or fine-grained, i.e., operationalized at the sub-skill level. Additionally, these approaches typically require a student to know when they need help; however, students may not have the metacognitive skills needed to do so (Aleven and Koedinger 2000; Aleven et al. 2004).

In our system, by contrast, we use an automated approach that detects students’ problems with inquiry and provides computer-based scaffolding in real time in order to support the acquisition and development of inquiry skills/practices (Gobert etal. 2013; Sao Pedro et al. 2013b, c, 2014; Gobert and Sao Pedro 2017). These scaffolds are designed to address specific aspects of scientific inquiry on a fine-grained level and can help students receive the help they need by targeting the exact sub-skill on which they are having difficulty. Our identification of each of the sub-skills underlying each of the science practices described by the NGSS (2013) is described elsewhere (Gobert and Sao Pedro 2017). This approach provides both scalable assessment of science inquiry practices as well scalable guidance so that students can get help while they are having difficulty. Scaffolding in real time has been shown to better support students’ learning in general (Koedinger and Corbett 2006) and in inquiry learning in particular (Gobert et al. 2013; Gobert and Sao Pedro 2017). This approach has a great benefit over the others in that it is scalable so that NGSS practices, as described, can be learned.

4 Inq-ITS’ Prior Work on Efficacy of Scaffolding

In our work, we have shown that our scaffolding can help students who did not know two skills related to planning and conducting experiments (NGSS Lead States 2013) – testing hypotheses and designing controlled experiments – acquire these skills and transfer them to a new science topic. These findings were robust both within the topic in which students were scaffolded and across topics for each domain studied (physical, life, and earth science), with scaffolded students maintaining and/or improving their skills in new topics when scaffolding was removed compared to those who did not receive scaffolding (Sao Pedro et al. 2013a, b, 2014).

With regard to the inquiry practices of interest in this chapter, namely, interpreting data and warranting claims, we recently conducted a systematic analysis of a subset of our data to address whether our scaffolding with Rex is supporting students in the acquisition and transfer of these inquiry skills. Later in the chapter, we provide an additional study, using Bayesian Knowledge Tracing (BKT) (Corbett and Anderson 1995), a computational approach allowing for the analysis of the fine-grained sub-skills underlying our practices of data interpretation and warranting claims.

Our data were drawn from 357 students in six middle school classes in the Northeast of the United States. Students completed two microworlds (Flower and Density) in either the Rex (N = 156) or No Rex (N = 201) condition. Mixed repeated measures ANOVAs on both interpretation skill and warranting skill were performed. An independent variable of time phase (repeated) was included in order to account for how participants consecutively completed two microworlds: Flower and Density. In the Flower virtual lab, none of the students received scaffolding from Rex, so performance in this virtual lab was used as the baseline. In the Density virtual lab, students were randomly assigned to either the Rex or No Rex condition. The Rex condition meant that Rex was available to assist students as they engaged in the microworld, whereas the No Rex condition meant that Rex was not available and could not be triggered. The results focused on the interactive effects of time × condition. Effect size was calculated using Cohen’s d. All significance testing for the primary analyses was conducted with an alpha level of .05. Our main interest was the effect of Rex’s scaffolding on learning.

Table 8.1 illustrates the estimated means of interpretation skill and warranting skill in the Rex and No Rex conditions as well as standard errors, lower and upper bound with 95% confidence interval, F values, and the effect size of Cohen’s d in the pairwise analyses, respectively.

Table 8.1 Statistics for condition × time in the Flower and Density virtual labs

4.1 Data Interpretation

There was a significant two-way interaction between condition × time for data interpretation skill, F(2, 710) = 12.25, p < 0.001 (see Table 8.1 and Fig. 8.1). The pairwise comparisons showed that students’ interpretation substantially improved in both the No Rex (mean increased from .68 to .74, p = .010, d = .26) and Rex conditions (mean increased from .66 to .84, p < .001, d = .79). This implies that students’ interpretation skills improved when they used the virtual lab even without scaffolding from Rex. In the second virtual lab, Density, students who received scaffolding from Rex achieved higher scores on interpreting data in the Rex condition than in the No Rex condition with a medium effect size. These findings confirm that students who received Rex’s support experienced greater improvement on interpretation skills relative to students who did not receive support from Rex.

Fig. 8.1
figure 1

Estimated means of condition × time in Flower and Density microworlds, respectively

4.2 Warranting Claims

There was a significant two-way interaction between condition × time for warranting skill, F(2, 710) = 10.40, p = 0.001 (see Table 8.1 and Fig. 8.1). The pairwise comparisons showed that students’ performance on warranting claims substantially improved in both the No Rex (mean increased from .37 to .68, p < .001, d = 1.02) and Rex conditions (mean increased from .30 to .77, p < .001, d = 1.51). This implies that students’ skills at warranting claims improved when they used the virtual lab with or without scaffolding from Rex. Results also showed that there were no significant differences in students’ skills at warranting claims when they conducted the first virtual lab, Flower, without Rex scaffolding. However, in the second virtual lab, Density, students who received scaffolding from Rex achieved higher scores on warranting claims in the Rex condition than in the No Rex condition with a small effect size. These findings further confirm that students who received Rex’s support experienced greater improvement on warranting claims skills relative to students who did not receive support from Rex.

4.3 Using Advanced Analytical Approaches to Study the Fine-Grained Effects of Scaffolding on Students’ Data Interpretation and Warranting Claims

In this study, we hypothesized that an automated scaffolding approach that provides personalized feedback would help students learn data interpretation skills and warranting claims skills. As such, we developed scaffolds within Inq-ITS that react when students have difficulty on these key skills and sub-skills (McNeill and Krajcik 2011; Gotwals and Songer 2009; Kang et al. 2014; Berland and Reiser 2009).

4.4 Method

4.4.1 Participants

Data were collected from 160 eighth grade students from the same school in the Northeast of the United States using Inq-ITS Density activities. All the students had previously used Inq-ITS, but not with this new scaffolding capacity.

4.4.2 Materials

Inq-ITS Density Virtual Lab Activities

For each Inq-ITS virtual lab, there are typically three or four inquiry activities, consisting of driving questions that help guide students through the inquiry phases. Within each activity, students conduct inquiry by first articulating a testable hypothesis using a hypothesis widget with pulldown menus. They then experiment by collecting data with an interactive simulation through the manipulation of variables (Fig. 8.2a). Once they have collected all of their data, they interpret the results of their experiment by forming a claim in claim widget (similar to that used for hypothesizing) and selecting trials as evidence (Fig. 8.2b). Finally, students write a short open-response report that summarizes their findings from their inquiry using a claim-evidence-reasoning format (McNeill and Krajcik 2011).

Fig. 8.2a
figure 2

In the “collect data” phase of the Inq-ITS Density virtual lab, students collect to test their hypothesis

Fig. 8.2b
figure 3

After collecting data, students analyze their data. They review the data they collected, use pulldown menus to describe the trends found in their data, and select the evidence (trials) to support their claim

In this study, three Density virtual lab activities were used. These activities aim to foster understanding about the density of different liquid substances (water, oil, and alcohol). In the first activity, the goal was to determine if the shape of the container affected the density of the liquid; the second was to determine if the amount of liquid affected the density; and the third was to determine if the type of liquid affected the density.

4.4.3 Procedure

Students worked on the Density activities in a computer lab at their school for the length of one science class (approximately 50 min). Each student worked independently on a computer at their own pace, meaning that not all students completed the entire set of activities by the end of the class period. Students were randomly assigned to one of two conditions: either the “Interpretation Scaffolding” (n = 78) or “No Interpretation Scaffolding” (n = 82) condition. For the first activity, none of the students, regardless of condition, received scaffolding. This allowed us to collect a baseline for each student on the targeted data interpretation sub-skills. For the next two activities, the students in the “Interpretation Scaffolding” condition received scaffolding during hypothesizing, data collection, and data interpretation. The students in the “No Interpretation Scaffolding” condition only received scaffolding during hypothesizing and data collection. The scaffolding during hypothesizing and data collection ensured that all students, regardless of scaffolding condition, had both a testable hypothesis and relevant, controlled data with which they could correctly undergo data interpretation (this design also allows us to isolate and systematically study the effects of scaffolding for data interpretation skills, as opposed to the two that proceed it in the inquiry process). Thus, students in both conditions worked in the same environment and on the same activities with access to hypothesizing and data collection scaffolding. The only difference was the presence of data interpretation scaffolding for one condition (Interpretation Scaffolding condition).

Evaluation of Inquiry Sub-skills

For data interpretation and warranting claims, there are eight main sub-skills that are evaluated in the system using the work products students create. These work products are their claim (selecting the appropriate variables and relationship between them) and supporting evidence (selecting relevant, controlled trials from their data table that reflect the relationship stated in their claim). These sub-skills and the specific criteria with which they are evaluated can be seen in Table 8.1. Since these sub-skills, defined in the context of this activity, are well-defined (Gobert and Sao Pedro 2017), they are evaluated using knowledge-engineered rules that specify if the sub-skill has been demonstrated. For example, for the sub-skill “Claim DV” shown in Table 8.2, the system evaluates whether or not the student has correctly chosen a variable that is measured, not changeable, within the simulation (a dependent variable) in the appropriate part of the claim. Within the context of the Density virtual lab, the appropriate dependent variable is “density of the liquid.” So if the student states “density of the liquid” as the dependent variable, they would be marked as correctly demonstrating the DV sub-skill. However, if the student chooses another variable, such as one of the independent variables like “type of liquid,” as the dependent variable, then they would be scored as incorrectly demonstrating the DV sub-skill. As another example, for the sub-skill “interpreting the IV/DV relationship,” a rule checks that the relationship between the independent and dependent variables specified in the claim is reflected in the data collected by the student. Elaborating further, if a student claims that “When I increased the size of the container the density of the liquid stayed the same” and their data reflects that relationship, that sub-skill would be scored as correct. If the data they collected did not reflect that relationship, the sub-skill would be scored as incorrect. The evaluation rules yield binary measure of correctness on each sub-skill (i.e., the results are presented as being correct or incorrect rather than having levels of correctness). This allows us to tease apart separate components (the sub-skills) within the broader skill of analyzing data.

Table 8.2 Data interpretation sub-skills

Scaffolds in Inq-ITS

Inq-ITS delivers scaffolds to students in text format via a pedagogical agent named Rex, a cartoon dinosaur (Fig. 8.3). Scaffolding is triggered automatically when a student completes their data analysis and at least one of the sub-skills is incorrectly demonstrated (evaluated by the knowledge-engineered rules discussed previously). This proactive scaffolding approach helps to support students in their inquiry processes (Schauble 1990; deJong 2006) by preventing students from engaging in ineffective behaviors (Buckley et al. 2006; Sao Pedro 2013). This proactive approach is also important because students may not be aware that they need help (Aleven and Koedinger 2000; Aleven et al. 2004). Once scaffolding is triggered, students may also ask Rex for additional clarification and support.

Fig. 8.3
figure 4

Example scaffold delivered by Rex during data interpretation

The scaffolds are designed to adapt to students’ skill level by both providing multiple levels of automatic scaffolds and allowing students to request for further help or clarification (once support is auto-triggered), as needed. In this way, the scaffolds personalize each student’s learning, recognizing that different students may need different amounts of help to successfully hone different sub-skills. The data interpretation scaffolds address four categories of procedurally-oriented difficulties that focus on the eight aforementioned sub-skills evaluated within data interpretation and warranting claims (Moussavi et al. 2015). These data interpretation and warranting claims scaffold categories are:

  1. 1.

    The Claim IV/DV does not match the hypothesis IV/DV.

  2. 2.

    The trials selected for warranting are not properly controlled or relevant to the claim.

  3. 3.

    The claim does not reflect the data selected.

  4. 4.

    The claim is incorrect as to whether it supports/does not support the hypothesis.

Since students may require scaffolding support for none, one, or many of these sub-skills, the scaffolds are designed to address these in the order listed above, so that each step of data interpretation is completed before moving onto the next. For example, it is impossible for students to correctly select relevant trials for warranting if they have not specified an appropriate IV and DV in their claim. Therefore, difficulty with creating a claim with the correct IV and DV (i.e., category 1) is scaffolded first until the sub-skill is demonstrated correctly before another difficulty is addressed. On the other hand, if a student also demonstrates difficulty with stating whether or not the claim supports the hypothesis, then the first three scaffolding categories are skipped and the student only receives the specific scaffolds that address category 4.

When students make multiple errors within the same category, we follow a sequence that increases the level of feedback given to the student. For the first error, a scaffold is provided to orient students to the current task. If the same error is repeated, they are then guided through the necessary procedural skills. Finally, the system provides a “bottom-out” hint telling students the procedure to follow. In this way, the student receives more and more targeted support, similar to cognitive tutors (e.g., Anderson et al. 1995; Corbett and Anderson 1995; Koedinger and Corbett 2006).

In sum, these scaffolds are designed to adapt to students’ skill level by both providing multiple levels of automatic scaffolds and allowing students to request for further help or clarification (once support is auto-triggered), as needed. In this way, the scaffolds personalize each student’s learning, recognizing that different students may need different amounts of help to successfully hone different sub-skills.

Data Analysis Approaches

Due to the complexities and sub-skills inherent in the inquiry practices of data interpretation and warranting claims, an advanced analytical method using an extension of Bayesian Knowledge Tracing (Corbett and Anderson 1995) is better suited to address the effects of scaffolding on students’ learning and transfer of sub-skills of inquiry under investigation here (Sao Pedro etal. 2013b). Bayesian Knowledge Tracing (BKT henceforth), a cognitive modeling approach to approximating the mastery of sub-skills in intelligent tutoring systems, is a powerful technique, and its prediction of student performance is as good as or better than similar algorithms that aggregate performance over time in order to infer student skill (e.g., Baker et al. 2011). Additionally, our group has shown that this approach is effective for modeling students’ learning of inquiry, both with and without the presence of scaffolding (Sao Pedro et al. 2013b).

4.4.4 Bayesian Knowledge Tracing

Bayesian Knowledge Tracing (BKT) (Corbett and Anderson 1995) estimates the likelihood that a student knows a particular skill (or sub-skill) and disentangles between “knowing” and “demonstrating” that skill (or sub-skill) based on prior opportunities in which students attempt to demonstrate a particular skill. BKT assumes that knowledge of a skill is binary (either a student knows the skill or they do not) and that skill demonstration is also binary (either a student demonstrates a skill or they do not).

Mathematically, four parameters are used to model whether a student knows a skill: L 0 , T, G, and S (Corbett and Anderson 1995). L 0 is the probability of initial knowledge that the student is already in the “learned state,” i.e., before they start the first problem. T is the probability of learning, i.e., the chance that the student goes from the “unlearned state” to the “learned state” over the course of doing all of the problems in the sequence. G is the probability of guessing, i.e., the chance that a student in the “unlearned state” answers the problem correctly. Lastly, S is the probability of slipping, i.e., the chance that a student in the “learned state” answers the problem incorrectly (Corbett and Anderson 1995). The parameters of G and S mediate the difference between “knowing” a skill and “showing” a skill. A student who shows the skill may not actually know it, contributing to G. Conversely, a student who knows the skill may not always show it, contributing to S. BKT, in this formulation, assumes that skills are not forgotten (Corbett and Anderson 1995); once a student is in the “learned state,” they cannot forget and go back to the “unlearned state.” Instead, if a student in the “learned state” does not “show” a skill at a specific practice opportunity, they are considered by the model to have “slipped,” i.e., they were not able to show the skill at that time despite knowing it. This then affects the S parameter but does not change what state the student is considered to be in. See Fig. 8.4.

Fig. 8.4
figure 5

Bayesian Knowledge Tracing model

Prior work by Sao Pedro (2013) extended the traditional BKT model to account for the presence of a tutor intervention, similar to that of Beck et al. (2008). To incorporate scaffolding into the BKT framework, they introduced the dichotomous observable variable of Scaffolded = {True, False} and conditioned the learning rate (T) on that observable leading to two distinct learning rate parameters – Tscaffolded and Tunscaffolded. This resulted in the following equations for computing P(L n ), the likelihood of knowing a skill (Sao Pedro 2013):

P L n | Scaffolded n = True = P L n 1 | Prac n + 1 P L n 1 | Prac n P T scaff P L n | Scaffolded n = False = P L n 1 | Prac n + 1 P L n 1 | Prac n P T unscaff

We follow this approach to determine whether data interpretation scaffolds are supporting students’ learning.

One of the main assumptions of BKT is that skills are considered to be independent. This means that each skill that we want to track has to be modeled separately. Because of this, there were certain design considerations that we had to make when fitting our data to the BKT model, specifically with regard to how scaffolding condition was defined and practice opportunities were defined. These considerations are discussed in the following section.

4.4.5 Data Preparation Extensions to Leverage the BKT Framework

The data logged here differs from typical data logs due to how the data interpretation scaffolds were integrated into the system. In the system, all of the data interpretation sub-skills are designed to be evaluated at once. However, the data interpretation scaffolds are designed to only address one sub-skill at a time in order to give directed support, as described above. For example, if a student submits their analysis and is evaluated as both choosing an incorrect IV and an incorrect IV/DV relationship, even though they will have been evaluated on every data interpretation sub-skill, they will only receive the scaffold for one of their errors, in this case the error of the incorrect IV. Once the student revises their analysis and submits again, they are once more evaluated on all of the data interpretation sub-skills, regardless of what specific aspects of their analysis they changed.

Considering this and the fact that in BKT analysis every sub-skill is considered separately and has its own model, it became important to consider how the BKT framework defined the scaffolding condition and practice opportunity in order to create an accurate model. These design decisions for the BKT model are described in more detail below.

4.4.6 Determining Scaffolding Condition

Not all of the 78 students in the Interpretation Scaffolding condition needed the data interpretation scaffolds, and while some students only used one scaffold, others used multiple scaffolds targeting multiple sub-skills. Since BKT operates under the assumption of independence of skills, it would not be appropriate to label all of these students as having been scaffolded. Arguably, it is more important to model the scaffolds students received on a per skill basis, rather than simply considering them as scaffolded or not. Because of this, scaffolding was considered at the sub-skill level so that any scaffolds a student received for one specific sub-skill had no bearing on the student’s scaffolding classification for the other sub-skills. This means that in the BKT model for the Claim DV sub-skill, for example, a student will only be considered to have been in the scaffolding condition if they ever received the specific scaffold directly addressing the Claim DV sub-skill, regardless of any other scaffold they may or may not have received. This makes it so that a student may only be in the scaffolding condition in the BKT model for one sub-skill or may be in the scaffolding condition in multiple BKT models on different sub-skills.

4.4.7 Determining Number of Practice Opportunities

In Inq-ITS, students click to submit their data interpretation after which the system records all of the actions as one practice opportunity and evaluates all of the sub-skills (Gobert et al. 2013). When scaffolding is being used, students who have been evaluated as “incorrectly demonstrating any sub-skill” receive scaffolding and are redirected to their data interpretation. Any subsequent actions students perform (up until submitting again) are considered part of a new practice opportunity for all sub-skills regardless of what specific sub-skill(s) were worked on, which can make it seem as though students require more practice opportunities to master a sub-skill than they actually do. For example, as shown in Table 8.3, based on the evaluations, it looks like after three practice opportunities, the student is still incorrectly demonstrating the “claim” and “support” sub-skills. However, if we look at the student’s actions, we can see that the student was only focused on correctly demonstrating the “DV” sub-skill (due to the scaffolding received) and was not actually working on the other two sub-skills. Therefore, it would not be accurate to say that the student had three practice opportunities for the “claim” and “support” sub-skills. This, then, needs to be accounted for in the BKT models in order to more accurately assess students’ probability of learning.

Table 8.3 Example of practice opportunity succession

The option considered here was to collapse student evaluations for each sub-skill within each activity into one practice opportunity. This acts as a “pre-smoothing” of data, and while it looks at the data in a slightly coarser way because of the rolling up of practice opportunities, it yields an easier model with fewer parameters. In collapsing students’ evaluations, all of the evaluations for one sub-skill within an activity were examined, and a student would receive a correct evaluation for a particular sub-skill only if they always had correct evaluations for that sub-skill. This was done because if a student ever incorrectly demonstrated a sub-skill, it could be assumed that the student most likely did not know the sub-skill to begin with. This resulted in the student’s evaluations in the above figure to be collapsed into one practice opportunity as shown in Table 8.4.

Table 8.4 Example of collapsed evaluation

Therefore, the BKT analysis was performed for each of the assessed data interpretation and warranting claims sub-skills, using the scaffolding extension of the BKT framework developed by Sao Pedro (2013), as previously described.

4.4.8 Fitting BKT Model Parameters

To learn the parameters (L 0 , TScaff, TUnscaff, G, S) from student data for each of the BKT models (one model per targeted data interpretation sub-skill), we used a brute force grid search approach (Baker et al. 2010) to find the parameters that minimize the lowest sum of squared residuals (SSR) between the probability of demonstrating a skill and the actual data, as done in Sao Pedro et al. (2013b).

4.4.9 Determining Goodness of the BKT Models

Once the BKT parameters were determined, they were applied to the model, and then its predictive performance was tested against the same set of data used to construct the model. Although cross-validation helps to ensure that the models are accurate and can be applied to new students, it requires a held-out validation data set collected from a similar population. Since this work is exploratory in nature in that it is examining the first set of data collected with the data interpretation scaffolds, we did not have a held-out data set that could be used for this purpose. As such, the same set of data used for training was also used for validation, which can lead to over-fitting of the model. In ongoing work, we are addressing this limitation by using a held-out test set to test the models.

As in Sao Pedro et al. (2013b), performance was measured using A′ (Hanley and McNeil 1982), which is the probability that the detector will able to correctly label two examples of students’ skill evaluation when in one the student is correctly demonstrating the skill and in the other the student is not. An A′ of 0.5 is indicative of chance performance, and an A′ of 1.0 is indicative of perfect performance.

5 Results

Our goal is to determine whether our automated scaffolding approach helps students acquire data interpretation sub-skills. We first look at a descriptive analysis of the frequency with which scaffolds were used across the activities. We also look at error rates for the sub-skills to get an initial look at students’ progress with and without scaffolding. Then, as mentioned, we used our BKT extensions to approximate student learning of the data interpretation sub-skills and to make inferences about whether scaffolding was effective.

Descriptive Analysis

Table 8.5 shows the number of students who received any data interpretation scaffold in an activity and the total number of scaffolds triggered in an activity. Not all the students were able to finish the third activity within the time frame of their science class, contributing to the lower number of students in Activity 3. Looking at these numbers, we can see that by the third activity, a fewer number of students received scaffolds, and that these students, overall, required less scaffolding support to successfully demonstrate the data interpretation sub-skills that we evaluate. This gives an initial indication that the scaffolding support, in its entirety, is helping students successfully interpret the data they collected and warrant their claims with data.

Table 8.5 Students using any data interpretation scaffold

We next looked at the error rates for four of the data interpretation sub-skills most tightly related to the evaluations that trigger the scaffolds. Error rate is defined as the percentage of students who demonstrated that error in each activity. The graphs in Fig. 8.5 show the error rate of students in each of the two conditions (Interpretation Scaffolding condition and No Interpretation Scaffolding condition) as they worked through the three activities.

Fig. 8.5
figure 6

Error rate analysis

As shown in these graphs (Fig. 8.5), student difficulty/error was present in each of these sub-skills, with the sub-skill “Interpreting correct IV/DV relationship” and “Interpreting hypothesis/claim relationship” having the highest initial error rates, regardless of condition. Furthermore, this analysis revealed that students in the “Interpretation Scaffolding” condition start with a higher error rate but end with a lower error rate. For example, for the sub-skill “Warranting with controlled trials,” on their first opportunity, students in the Interpretation Scaffolding condition had an error rate of 0.33 compared to an error rate of 0.26 exhibited by the students in the No Interpretation Scaffolding condition. However, by their third opportunity, students in the Interpretation Scaffolding condition had a much lower error rate of 0.05, which was less than the error rate of 0.16 exhibited by the students in the No Interpretation Scaffolding condition. This indicates that students in the Interpretation Scaffolding condition are improving faster than the students in the No Interpretation Scaffolding condition.

The descriptive analyses suggest that scaffolding appears to be effective at helping students acquire these sub-skills. We next conduct a deeper inferential analysis using the BKT modeling framework described previously.

Inferential Analysis with Bayesian Knowledge Tracing

As described previously, we fit BKT models using the student data collected and use A′ (Hanley and McNeil 1982) to measure the goodness of the models. Recall that an A′ of 0.5 is indicative of chance performance and an A′ of 1.0 is indicative of perfect performance. The A′ values for this analysis can be seen in Table 8.6. In this case, performance was measured to be relatively high for all of the sub-skills with A′ values between 0.69 and 0.81, allowing for parameter interpretation. However, again, since cross-validation was not done, it is possible that some of these models may be over-fitting to some student data (c.f. Sao Pedro et al. 2013).

Table 8.6 A′ values showing high performance of the BKT models

The results from the BKT analysis indicate that the data interpretation scaffolds were effective in supporting the acquisition of data interpretation sub-skills. This can be seen through the values of the probability of learning. This value represents the chance that the student goes from the unlearned state to the learned state over the course of activities. As can be seen in the data in Table 8.7, the probability of learning for students receiving data interpretation scaffolding is higher for all but one of the evaluated sub-skills. This sub-skill, selecting an IV for the claim, also has a high probability of initial knowledge, which could indicate that the sub-skill is not being learned because so many students already know it (e.g., Sao Pedro etal. 2014). Also, compared to another sub-skill with a relatively high probability of initial knowledge – such as the Evidence sub-skill – the Claim IV sub-skill is noisier to assess, likely because it might be highly related to the content in each activity.

Table 8.7 BKT parameters for each sub-skill

6 Discussion

The goal of this work was to test the efficacy of our data interpretation scaffolding on the sub-skills underlying the skill practices underlying data interpretation and warranting claims. We tested this in two ways, both using analysis of variance on the aggregate scores for each practice (data interpretation and warranting claims), as well as an innovative extension to Bayesian Knowledge Tracing (BKT) that considers the presence of scaffolding approximating mastery learning for each of the sub-skills of interest (Sao Pedro et al. 2013b). We also developed modifications to this framework, which allow it to be applied when condition and practice opportunity can be defined on different levels (i.e., activity level vs. skill level).

In developing our BKT extension, this work contributes a fine-grained method for unpacking the effect of scaffolding via logged, process data. Our extension to BKT was used as a modeling paradigm to track the sub-skills underlying data interpretation and warranting claims. This study was done within a complex domain of science inquiry whereby the student data, number of practice opportunity counts, and evaluated skills were not as clearly delineated as in previous studies in which BKT was used to evaluate educational interventions (Koedinger et al. 2010). This work provides a framework for how data in these complex environments can be treated before BKT can be used.

This work also explores modifying the BKT framework to represent and track students’ learning of the targeted data interpretation sub-skills with and without scaffolding. Further analyses are needed to determine the efficacy of this model and its accuracy in comparison to other models. As the data used for this work was collected as an initial study of the data interpretation/warranting claims scaffolds, additional data will be used to cross-validate the predictive performance of the models used here and provide greater assurance in interpreting the parameters of the model. This method could then be used as students work through multiple domains with scaffolding to assess the efficacy of these scaffolds across a larger number of practice opportunities (e.g., Sao Pedro et al. 2014). This will also allow us to assess how scaffolding can impact the transfer of these skills from one science domain to another. Additionally, we will use this method on studies without scaffolding, which will give us data to better understand how this skill develops naturally.

Regarding inquiry, this work builds on prior research (Kang et al. 2014; McNeill and Krajcik 2011; Schauble 1990) on the nature of data interpretation and warranting claims skills, their assessment, and scaffolding. This work makes a contribution to the prior research on argumentation practices for inquiry by conceptualizing and framing the data interpretation and warranting claims practices as necessary but not sufficient for appropriate scientific argumentation.

When it comes to unpacking the broad components of explanation, Toulmin’s (1958) model of argumentation is typically used (McNeill and Krajcik 2011; Gotwals and Songer 2009; Kang et al. 2014; Berland and Reiser 2009), breaking down argumentation into three main components: the use of claims, evidence, and reasoning. The interpretation of evidence and the creation of an evidence-based explanation or argument are both key practice in national science standards and essential for fostering students’ science literacy (McNeill and Krajcik 2011; Kang et al. 2014).

We feel that unpacking the inquiry practices associated with data interpretation and warranting claims separately from students’ data on claims, evidence, and reasoning, as expressed in open response format, is important because if students are having problems analyzing their data, they won’t be able to successfully engage in explanation and argumentation. Our prior work has shown that a number of students are not able to articulate a correct explanation or argument despite knowing the data interpretation skills (Li et al. 2017). Moreover, there are large numbers of students who are being mis-assessed when their open responses are used as the only source of assessment: there are students who are skilled at science but cannot convey what they know in words (i.e., false negatives), as well as students who are skilled at parroting that they have read or heard but do not understand the science they are writing about (i.e., false positives; Gobert 2016). In short, using solely students’ writing for assessment is only an accurate way of measuring what students know if they are good at articulating words.

To this end, we conceptualize/frame data interpretation and warranting claims practices as underlying the argumentation practices necessary for communicating science findings and thus find it necessary to study these skills separately from students’ overall written explanations and arguments. Conceptualizing and supporting students on the components of the explanation framework – claim, evidence, and reasoning – in an automated and fine-grained way with appropriate sub-skills can help us unpack and target known difficulties documented by previous research (Gotwals and Songer 2009; McNeill and Krajcik 2011; Schunn and Anderson 1999). While we could make the assessment of these skills easier by designing activities that only target one skill at a time, this would be a much less authentic way of conducting inquiry. This work attempts to disentangle the effects of learning support delivered via automatic scaffolds that apply to individual sub-skills in an environment where multiple performance-based skills are being practiced and assessed at once. This gives us the nuance to examine these complex practices (as set forth by NGSS) and allows us to look at specifically what aspects students are having difficulty with and work to target those exact difficulties before moving on to students’ claims, evidence, and reasoning.

Lastly, this work provides a scalable solution toward the assessment and scaffolding of these practices and in doing so represents a scalable solution to supporting teachers and students in NGSS practices.