Introduction

Need for computer assisted instruction for teaching basic statistics concepts

One of the key concepts to research methodology is the variable concept. Students’ fail or demonstrate slow progress in research methods and basic statistics courses if they have problems identifying types of variables. The following remark comes from an experienced basic statistics course instructor: “I suspect that types of variables in statistics are tough for students to grasp. Rereading the standard definitions, they don’t seem to make any sense unless they know first what a variable is.” An effective low cost online instructional program that students could use at their convenience for bringing them up to speed with the variable concept would benefit them.

Historically, computer assisted statistics instruction (CAI) has a potential of increasing student engagement with course content because it allows interactivity and self-paced learning (Larreamendy-Joerns et al. 2005). The software for teaching statistics that is used in the classroom varies from software demonstrations and occasional interactive exercises to completely Web-based courses (Symanzik and Vulkasinovic 2006).

Despite the fact that the use of CAI in statistics education has grown to 80 % among universities, research about its effectiveness in teaching statistics shows controversial findings (Larwin and Larwin 2011). The authors found in their meta-analysis that CAI has a moderate impact on student achievement and that “it is not a panacea for all the ills that might plague statistics education” (p. 272), but it can serve as a very beneficial enhancement or supplement to small class sections (Larwin and Larwin 2011).

According to some scholars, technology needs to be used not only for computing numbers but for concept exploration, which would enhance student learning chance (Moore 1997; Friel 2007; Garfield et al. 2000). Chance et al. (2007) suggest that technology has a potential to expand the range of visualization techniques that can help learners understand concepts.

Sklar and Zwick (2009) in their paper provided recommendations for designing animated presentations in statistics and instructional approaches for teaching the material to non-statistical majors. One of their recommendations is segmenting animated presentations into scenes rather than using one single continuous presentation. Also, the authors emphasize the need for embedded questions that need to be provided during the presentation to assess viewer’s understanding of the concept. Finally, the authors contend that more empirical research is needed to understand the best methods for teaching statistics to non-statistical majors in multimedia environments.

Wender and Muehlboeck (2003) investigated whether the computer-animated graphics were more effective than static images in teaching statistics. They found that student scores on retention and understanding of the concepts presented were significantly higher when the animated graphics were used compared to static graphics. Four statistical concepts were presented and explained to students in class. The presentations included graphics either in static or in animated form. The authors suggested that animations designed for teaching concepts should focus on functional relationships of sub-concepts. The interdependencies among sub-concepts need to be presented explicitly (Wender and Muehlboeck 2003).

“Given the increased emphasis on stand-alone tools and distance learning, research on the role of interactivity, engagement, and feedback takes on increased importance as educators continue work on improving the efficacy of technology-based statistics instruction.” (Sosa et al. 2011, p. 122). To address this need, a low-cost online software for teaching the concept of variables (referred here as the Program) was designed and evaluated for its overall effectiveness. The Program was designed as a research platform for testing and evaluation of the implemented program features and instructional design decisions.

Systemic approach to design engineering of the program

Many researchers noted the need for research on specific design features and strategies in computer assisted instruction. In their meta-analysis, Sosa et al. (2011) suggested that research on computer-assisted instruction in statistics would benefit from closer collaboration on measurement and design across researchers. They emphasized the importance of research focused on specific features of interest.

Ormel et al. (2012) stated the need for research on specific strategies that were used to facilitate the production of new theoretical understanding through design of instructional solutions. Bernard et al. (2009) in their meta-analysis underscored the importance of studies that focus on the specific features of interest. Interactions of design features with learner characteristics and more sensitive approaches to test the impact of learner-based features in computer-assisted instruction were identified as the key aspects of future research.

While conducting this study, we were working systematically and simultaneously toward the dual goals of solution development and theoretical understanding by studying what happened when those solutions came to life in real classrooms (McKenney and Reeves 2012). This case study uses a systemic approach for using existing knowledge to construct a web-based instructional solution and conducting a series of formative evaluations of the implemented solution, which will ensure the effectiveness of the developed product. The following systematic approach for incorporating the existing knowledge in the Program design was implemented.

  • Identifying the benefits of multimedia and instructional methods that can potentially contribute to the effectiveness of stand-alone instructional software

  • Defining potentially effective Program features and instructional design decisions in regards to the learning task (teaching the concept of variable) and the target population (low and high prior knowledge students in an Educational Psychology class and two Basic Statistics classes).

  • Implementing the evidence-based research findings in the design of the Program.

  • Building the Program as a research platform that allows online data collection about program use and student progress.

The following systematic approach to the evaluation of the Program was used.

Evaluation 1 (see Fig. 1).

  • Evaluating the overall effectiveness of the Program by comparing student performance in each of the experimental conditions to no-treatment condition.

  • Testing possible solutions for the key program features in experimental conditions.

  • Collecting information about other potentially effective program features and instructional strategies through surveys. The findings serve as rationale for further design experiments in the second round of the evaluation series, if needed (Evaluation 2, 3 etc.)

  • Collecting and analyzing the above data in regards to the student level of prior knowledge.

  • Based on student perceptions of other program features (see the Evaluation 1) identifying the features (if any) to be tested in the experimental conditions in Evaluation 2.

  • Collecting information about other potentially effective program features and instructional strategies through surveys. The findings serve as rationale for further design experiments in the third round of the series of evaluations, if any.

  • Collecting and analyzing the above data in regards to students’ level of prior knowledge.

Fig. 1
figure 1

Systemic approach for designing effective instructional multimedia. Evaluation 2 (see Fig. 2)

Literature review

Knowledge sources informing the design of instructional strategies in the program

On one hand, for the software to be effective, instructional principles must be consistent with what is known about how people learn (Mayer 2008). On the other hand, multimedia have the potential of promoting meaningful learning by varying both the number of representations provided to students and the degree of student interactivity (Moreno and Valdez 2005). The literature review was conducted keeping the above aspects of multimedia in mind.

Compare and contrast strategy for teaching concepts

Concept learning can be facilitated by the instructor (or software) that presents examples and non-examples and through having students solve problems that facilitate comparison of the defining features of concepts. According to Richards and Godfarb (1986), “concept reasoning can be mastered through central tendency information, logical rules or single episodes depending upon which of these is activated in a particular task situation.”

Tennyson and Cocchiarella (1986) in their instructional design model stated two basic relationships between concepts: successive and coordinate. They recommended that when teaching concepts defined as coordinate, an opportunity to compare and contrast from one coordinate concept with examples from another coordinate concept allows students to better develop the discrimination skill. Thus coordinate concepts need to be presented simultaneously (Tennyson and Cocchiarella 1986).

Litchfield (1987) recommends teaching a set of concepts presented simultaneously. Even though the attributes of different concepts are easily confused, this kind of presentation makes students compare and contrast similarities and differences between concepts and helps them clarify individual concepts (Litchfield 1987).

If the limited capacity of student working memory is taken into consideration, student can manipulate no more than four information elements at one time (Miller 1956). Thus, presenting no more than four concepts at one time can potentially benefit students.

Types of visuals

Previous research comparing the learning outcomes of student learning with text and illustrations versus animation and narration has produced inconsistent and controversial findings. On one hand, several studies have suggested that learning is enhanced in computer-based animation environments (Park 1994; Tversky et al. 2002). For example, learners that have limitations in spatial ability benefit from animations because they may have problem mentally animate how a complex system works from a series of static diagrams (Hegarty and Sims 1994).

Some empirical results have indicated that animation is superior to static images in terms of retention and transfer of information (Mayer and Moreno 2002; Craig et al. 2002; Moreno et al. 2001). Mayer and Moreno (2002) stated the importance of principles of cognitive theory of multimedia learning to underlie animations design. Such design has a potential of promoting learner understanding. Gulz and Haake (2006) found that animations have the potential to make the learning experience more engaging. Animation appears to be most effective when presenting concepts of information students may have difficulty visualizing (Betrancourt 2005; Narayanan and Hegarty 2002).

On the other hand, in many studies dealing with abstract, scientific or technical content, animation did not turn out beneficial compared to static pictures (Tversky et al. 2002). Lowe (2003) showed that low prior knowledge students are often more focused on perceptually salient rather than thematically relevant features of animation. Clark and Mayer (2007) recommended using static illustrations unless there is a compelling instructional rationale for animation. Hasler et al. (2007) noted possible disadvantages of animations if not designed properly. Animation may impose greater cognitive processing demands than static visuals when critical objects and their relations disappear during the animation.

The results obtained for learner-controlled pacing and segmentation of animation have produced inconsistent findings. Whereas some researchers observed positive effects of pacing and segmentation (Mayer et al. 2003), others found that system-paced instruction was more beneficial for learning (Tabbers et al. 2004). Plotzner and Lowe (2004) stated that animations are frequently used for attention gaining effect, and we know little about how animation needs to be designed in order to facilitate learning. One of the suggestions for building effective animations for instructional purposes is constructing them in ways that tap the positive features of static illustrations (Mayer et al. 2005).

Program feedback

Computer-based interactive learning environments support the learner engagement in an interactive learning exchange. Computer-generated feedback can potentially support this interaction because it can provide learners with information they may use to correct errors (Valdez 2012). The effects of different types and forms of informative feedback have been investigated in multiple instructional contexts and provided inconsistent findings (see the reviews by Azevedo and Bernard 1995; Bangert-Drowns et al. 1991; Butler and Winne 1995; Clariana 1993; Mason and Bruning 2001; Mory 1992, 1996, 2004). Verification (correct/incorrect) feedback has not shown much effect at promoting learning (Bangert-Drowns et al. 1991; Moreno 2004).

Most researchers believe that the program feedback that facilitates the greatest gains in learning must include both verification and elaboration. This combination strengthens student correct responses and is more effective than simply providing correct/incorrect feedback. (e.g., Bangert-Drowns et al. 1991).

As to the optimal feedback timing, there is substantial disagreement among researchers on whether the feedback should be given immediately after answering each test question (Keller 1983) or it should be delayed, which will give errors a chance to dissipate (Kulhavy 1977; Kulhavy and Anderson 1972; Kulhavy and Stock 1989). According to Keller’s recommendations, formative (corrective) feedback needs to be provided when it will be immediately useful (Keller 1983).

In contrast, Butler and Roediger (2008) found that the delayed feedback benefited students when given soon after the test but not immediately after answering each item. A meta-analysis of 53 studies conducted by Kulik and Kulik (1988), showed that the studies that used actual classroom materials and quizzes revealed a better effectiveness of the immediate feedback compared to delayed feedback. On the contrary, in laboratory studies, delayed feedback demonstrated better effectiveness.

Another important aspect of feedback, the number of steps, was addressed by Spector et al. (2008). They emphasized the need for research on individual and situational conditions regarding the number of feedback steps and cycles. Clariana (1993) in his review of 30 studies compared single try feedback types (immediate knowledge of result, immediate knowledge of correct response, delayed feedback, no- feedback) to multiple try feedback and did not find any differences between single-try and multiple-try feedback. In regards to student prior knowledge differences, multiple tries are most effective with high prior knowledge students, and a single attempt with the correct answer feedback is the most effective with low prior knowledge students (Clariana 1993).

Program design based on the evidence-based findings from the previous research teaching the concept of variables

The Program was designed as a set of 20 scenarios, “extensive network of episodes involving the concept” (Richards and Godfarb 1986, p. 34). Students needed to identify independent variable (IV), dependent variable (DV), controlled variable or constant (CV), and the levels of independent variables (LIV). Also, theory explanation pop-ups (brief explanations of types of variables and examples of their use) were available on each screen for the learner to develop their concept reasoning based on “central tendency information, logical rules” (Richards and Godfarb 1986, p. 34).

The number of types of variables to discriminate from, increased gradually through the Program: 2 variables in scenario 1, 3 variables in scenarios 2–8, and 4 variables in scenarios 9–20. A screenshot of a problem scenario is presented in Fig. 2.

Fig. 2
figure 2

A screenshot of a problem scenario

Visuals

The animations on the screens end as still images showing the completed state of the process. This strategy allows us to compensate for the fact that human perception is limited by the speed of neural networks while processing the temporarily changing information (Betrancourt 2005). The use of animations is not excessive. They emphasize the functional relationships between variables and make them explicit by portraying interdependencies (Wender and Muehlboeck 2003). Pointing arrows and highlights of important information are used to compensate for the fact that low prior knowledge students are often more focused on perceptually salient rather than thematically relevant feature of information (Lowe 2003).

In order for us to be able to compare and decide on the type of visuals to implement in the final version of the Program, we included three formats of presenting problem scenarios in the program design.

  • Text scenario augmented with animation showing the functional relationships between variables (8 problems)

  • Text scenarios augmented with still images (8 problems)

  • Text only scenarios (4 problems)

Program feedback

The design of the program feedback (see Fig. 2) included the following features:

  1. (1)

    The Program uses response-contingent feedback to provide both verification (knowledge of result-KR- feedback presented as smiley faces) and item-specific elaboration (elaborative feedback-EF- presented as text) that explains the answer. The decision was made based on recommendations to use immediate feedback for retention of procedural or conceptual knowledge (Shute 2008; Kulik and Kulik 1988).

  2. (2)

    Students receive feedback regardless of the correctness of their response so that the learner can develop understanding when a correct answer was simply a guess.

  3. (3)

    The strategy of engaging the learner in the mindful processing of program feedback was implemented.

The types of variables in the training were manipulated in such a way, that an item classified as an independent variable in problem scenario A became a controlled variable (constant) in scenario B. This strategy put students in a situation where the probability of their correct answers was low. According to Kulhavy and Stock’s certitude model of feedback (Kulhavy and Stock 1989), the learners who are informed that their answer is wrong when they are confident that their answer is correct would “exert much effort to find out what was remiss in their thinking “(Mory 2004, p. 749). This strategy makes the learner mindfully engage in the feedback content and pay more attention to the defining versus non-defining features of variables.

Research hypotheses

Hypotheses regarding overall effectiveness of the program

Hypothesis 1a

The Program will facilitate retention of the concept of variable, and thus the average knowledge gain in the control condition will be statistically significantly lower than in each of the experimental conditions (single try condition or two tries condition).

Hypothesis 1b

The average knowledge gain difference between single try condition and two tries condition will be statistically significant. Since the only difference between the experimental conditions is feedback type, the higher knowledge gains would indicate the more effective type of feedback for the Program.

Hypotheses regarding the effect of feedback type and student level of prior knowledge on knowledge retention

Hypothesis 2a

There will be a statistically significant difference in knowledge gain between low prior knowledge (LPK) students in the single try condition and the two tries condition.

Hypothesis 2b

There will be statistically significant difference in knowledge gain between high prior knowledge (HPK) students in the single try condition and the two tries condition.

Hypotheses regarding the effect of student level of prior knowledge on student perceptions of program helpfulness

Hypothesis 3a

There will be statistically significant difference in student ratings of their satisfaction with the Program between the LPK students in the single try condition and the two tries condition.

Hypothesis 3b

There will be statistically significant difference in student ratings of their satisfaction with the Program between the HPK students in single try condition (C 2) and two tries condition.

Hypotheses regarding the effect of problem presentation format on student perceptions of program helpfulness

Hypothesis 4a, 4b, and 4c

There will be statistically significant difference in student survey ratings on how the formats of problem presentation (animation plus text, still image plus text, text-only) helped them remember the key concepts (hypothesis 4a), understand them(hypothesis 4b), and maintain their attention (hypothesis 4c).

Student perceptions of instructional strategies

Which instructional strategies did the students find the most and the least helpful? For this question, descriptive statistics were used for the analysis of the student survey ratings.

Methods

Participants

Participants were 90 undergraduate students (27 male, 63 female) of a Midwest university, USA (average age: males: 20.0; SD = 2.00, females: 21.0; SD = 7.13). The groups came from two Basic Statistics courses for non-statistics majors and an Educational Psychology course. Students were not required to perform at any criterion level on any measure given during the experiment in order to receive course credit.

Treatments

Hypotheses 1a and 1b

Three comparison groups (Condition 1, Condition 2, and Condition 3) consisted of 30, 29, and 31 students correspondingly. Condition 1 was a control group (no-treatment condition); Condition 2 and Condition 3 were experimental conditions. The dependent variable was student knowledge gain measured as a difference between the pre-test and delayed post-test (5 days after the training). The independent variable was program feedback type: single try in Condition 2 and two tries in Condition 3. The only difference between experimental conditions was feedback type, single try versus two tries. Single try feedback consisted of the knowledge of result-KR-feedback presented as green or red smiley faces (correct or incorrect response correspondently). Two tries feedback presented to the student the correct/incorrect status on the first try, and both the correct/incorrect status and the explanatory feedback (explanation of the correct answer) on the second try (See Fig. 2).

The comparison of student knowledge gain between the Condition 1 and Condition 2, and between the Condition 1 and the Condition 3 was done for testing the overall effectiveness of the Program (Hypothesis 1a). The comparison of student knowledge gain between two experimental conditions was done to determine the most effective type of feedback, single try or two tries (Hypothesis 1b).

Hypotheses 2a and 2b

We conducted a factorial experiment that followed a 2 × 2 design. The between subjects factor was the type of feedback, single try or two tries. The within subjects factor was the level of student domain specific prior knowledge, low or high. We compared the LPK student knowledge gain and the HPK student knowledge gain in two experimental conditions, Condition 2 (LPK, n = 15; HPK, n = 14) and Condition 3 (LPK, n = 17; HPK, n = 14).

Hypotheses 3a and 3b

The factorial experiment that followed a 2 × 2 design was conducted. The between subjects factor was the type of feedback, single try or two tries. The within subjects factor was the level of student domain specific prior knowledge, low or high. We compared the average LPK student ratings of their satisfaction with the program with the average HPK student ratings in two experimental conditions, single try condition (LPK, n = 15; HPK, n = 14) and two tries condition (LPK, n = 17; HPK, n = 14).

Hypotheses 4a, 4b, and 4c

Each of the three comparison groups consisted of the LPK and the HPK students. The independent variable was the format of a problem scenario presentation: animation plus text, still image plus text, text-only. The dependent variable was the student ratings of the following survey items: how the Program helped them remember the key concepts (Hypothesis 4a), understand the key concepts (Hypothesis 4b), and maintain their attention (Hypothesis 4c).

Student perceptions of instructional strategies

The helpfulness of the instructional strategies implemented in the Program (learning from theory explanation pop-us and learning from program feedback) was rated by both the LPK and HPK students in both experimental conditions (n = 60). Descriptive statistics were used for the analysis of the obtained data.

Instruments

The data collection instruments for each condition are presented in Table 1.

Table 1 Experimental conditions and data collection instruments

Tests

Student learning performance was assessed with a pre-test taken before the training and delayed post-test taken on the 5th day after the trading. The same test was used for the pre-test and post-test. Each test item was designed as a problem scenario. In each scenario, participants were asked to select five choices from five dropdown menus: independent variable (IV), dependent variable (DV), controlled variable (CV), level of independent variable (LIV), and “I want to know”. The fifth choice “I want to know” was added to avoid random answers (see Fig. 3). Each correct answer was scored as one point. The maximum score was 50 points.

Fig. 3
figure 3

A screenshot of a test item

Each test item was designed to test student knowledge of defining features of independent, dependent, controlled (constant) variables and the level of independent variable. In each test item, students were asked to differentiate between the independent, dependent, controlled (constant) variables and the level of independent variable. The problem scenarios came from various fields: science, education, health care, nutrition, business, engineering, cooking etc. This strategy was used to support future knowledge transfer of the acquired concepts to a variety of contexts. The test was found to be highly reliable (10 test items, 5 questions in each test item; α = .84).

Surveys

Students’ perceptions on the overall effectiveness of the Program and individual program features were collected through Survey 1 (Post-training Survey) taken immediately after the training (see Fig. 4 in Appendix ).

Survey 1 consisted of 11 items that measured student opinion on program features and overall experiences with the program. The rating scale of 5 (5- strongly agree, 1-strongly disagree) was used. The survey was found to be reliable (11 survey items, 3–6 questions in each survey item; α = .77).

Survey items 1–3 allowed the researchers to collect data about how specific instructional strategies (learning through explanatory feedback, learning by solving problems, learning through theory explanations) and specific formats of the problem scenario presentations (animation plus text, still image plus text, text only) helped students learn about types of variables. The data were collected in regards to the following types of information processing:

  • Helped the student recall the information learned through the Program (remembering)

  • Helped the student identify the variable type (understanding)

  • Helped the student maintain their attention

Survey items 4–6 allowed the researchers to collect student perceptions of the overall effectiveness of the Program. Survey items 7–11 helped the researchers understand how individual program features were perceived by students to make sure that none of the features were redundant (received low ratings).

Survey 2 (Follow up Survey) was administered immediately after the post-test (5 days after the training episode). The Survey 2 was a shortened version of the Survey 1 with only two first items included. The assumption was that the difference in the ratings between the surveys would indicate how well the program features helped students retain the information. For example, if the compared survey ratings went up or stayed the same, this could be an indication that the Program helped the students retain the information.

Procedure

The whole evaluation was done online, including the sign up process, pretest, assignment to conditions, actual training episode, post-training survey (Survey 1) administered immediately after the training, delayed post-test (5 days after the training), follow up survey (Survey 2) administered immediately after the delayed post-test. Students were allowed to participate in the study at their convenience over the time period of 3 weeks. The only requirement was to follow the timeline.

The instructors in three undergraduate courses asked their volunteer students to participate. As part of the recruitment process, students were asked not to use any instructional materials for studying types of variable other than the Program training (20–40 min) before and during the research study.

There was no formal training on the concept of variable for the participants neither before the training nor during the time period between the training and the delayed post-test. The participants were reminded by the researchers through emails to take the delayed post-test on the 5th day after the training.

The participants signed into the Program in a random order at their convenience. The following schema was used to assign students to conditions with a goal of having the same proportion of high and low prior knowledge students in each condition. Based on the results of the pre-test, the participants were assigned by the Program the low prior knowledge (LPK) or high prior knowledge (HPK) status. The level of students’ prior knowledge was determined by the results of the pre-test. If their total pre-test score was 25 of 50 or lower, they were considered LPK students. If their total pre-test score was 26 of 50 or higher, they were considered HPK students.

As the participants were signing in, they were being assigned sequentially to the Condition 1, 2, 3, 1, 2, 3 and so on. The LPK student who signed in first was assigned to Condition 1, the next LPK student who signed in second was assigned to Condition 2 and so on. The same procedure was used for the students identified by the Program as HPK students. Thus, regardless of the status, LPK or HPK, the participants were assigned to the Condition 1, 2, 3 sequentially.

The participants accessed the Program at home at their convenience. The time on task was collected. The average time spent on the pre-test was 14.0 min (HPK: 14.6 min; LPK: 13.6 min) and 15.0 min (HPK: 17.5 min; LPK: 12.7 min) in the single try condition and the two tries condition correspondingly.

The average time spent on the post-test was 11.3 min (HPK: 10.6 min; LPK: 11.8 min) and 11.9 min (HPK: 12.3 min; LPK: 11.6 min) in the single try condition and the two tries condition correspondingly. The average time spent on the training episode was 23.8 min (HPK: 22.4 min; LPK: 24.8 min) and 28.2 min (HPK: 25.5 min; LPK: 30.5 min) in the single try condition and the two tries condition correspondingly.

Evaluation 1 results

Data analysis

The data from the ten students were excluded from our statistical analysis because the students did not follow the timeline. Three students took the post-test on the third and fourth day, four students did not do the post-test, and three students took the post-test 6–9 days after the training.

In this study, first we determined the overall effectiveness of the Program in regard to each feedback type. Second, we investigated the overall effectiveness of the Program both in regard to feedback type and student prior knowledge. Next, student perceptions of the overall program design in regard to student level of prior knowledge were analyzed. Third, student perceptions on how the format of problem presentation (animation plus text, still image plus text, text-only) helped them remember and understand the key concepts and maintain their attention were analyzed. Finally, average student ratings of the implemented instructional strategies were determined.

Statistical results

Hypotheses 1a &1b

A one-way between subjects ANOVA was conducted to compare the effect of overall program training on students’ knowledge gain between the pre-test and delayed post-test for Condition 1(C1), Condition 2 (C2), and Condition 3(C3). Tests of the hypotheses were conducted using Bonferroni adjusted alpha levels of .017 per test (.05/3). There was a significant effect on student knowledge gain [F (2, 87) = 34.2. p < 0.001]. The results indicated that the knowledge gain was significantly lower in C1 (M = 1.54, SD = 2.52), than it was in both C2 (M = 14.9, SD = 9.16, d = −1.99) and C3 (M = 16.06, SD = 8.58, d = −2.30). Hypothesis 1a was confirmed. The pairwise comparison of the knowledge gain in C2 and C3 was non-significant [F (2, 87) = 1.13, p = 0.838, d = − 0.128]. Therefore, Hypothesis 1b was rejected. The knowledge gain between the pre-test and delayed post-test was 30.8 % in C2 and 30.0 % in C3.

Hypotheses 2a & 2b

Because of the small number of the LPK students in each experimental condition (15 in Condition 2 and 17 in Condition 3) and the HPK participants (14 in Condition 2 and 14 in Condition 3), non-parametric tests were used to analyze the data about the effects of feedback type within each prior knowledge level. Two two-sample Wilcoxon rank-sum (Mann–Whitney) tests were computed within each prior knowledge level. The results suggested that there was no statistically significant difference between the underlying distributions of the knowledge gain scores between the LPK students in the single try condition (M = 19.67, SD = 9.78) and in the two tries condition C (M = 19.2, SD = 9.11) (Z = 0.189, p = 0.850, d = 0.052).

As to the HPK students in Condition 2 (M = 9.86, SD = 4.88) and Condition 3 (M = 12.29, SD = 6.29), their knowledge gain scores were not significantly different either (Z = −0.761, p = 0.447, d = −0.431).

Hypotheses 3a & 3b

Two two-sample Wilcoxon rank-sum (Mann–Whitney) tests were computed within each student prior knowledge level. The results suggested that there was statistically significant difference between the underlying distributions of the average student ratings of their satisfaction with the Program between the LPK students in the single try condition (M = 4.07, SD = 0.41) and in the two tries condition (M = 4.38, SD = 0.32) (Z = 2.1714, p = 0.030), d = − 0.837). On average, The LPK students in the two tries condition were more satisfied with the program than in the single try condition.

As to the HPK students, there was no statistically significant difference in the student ratings between the single try condition (M = 4.29, SD = 0.43) and in the two tries condition (M = 4.54, SD = 0.45) (Z = − 1.447; p = 0.147, d = − 0.588). In both conditions, the HPK participants were highly satisfied with the Program. The average student ratings of their overall satisfaction with the Program are presented in Appendix Table 3.

Hypotheses 4a, 4b, &4c

A one-way ANOVA was conducted to compare the effects of three formats of problem presentation (animation plus text, still image plus text, text-only) on each of the following dimensions: helping students remember and understand the concept of variables and helping students maintain their attention.

There was a significant effect of a problem presentation format on students ability to remember the information learned through the Program [F (2, 57) = 17.1, p = 0.001]. The results indicated that the ratings were significantly lower for text-only format (M = 3.15, SD = 0.81) compared to the text augmented with a still image (M = 3.74, SD = 0.82) and to the text augmented with animation (M = 4.08, SD = 1.00). Therefore, Hypothesis 4a was confirmed.

There was a significant effect of a problem presentation format on student ability to understand the information learned through the Program [F (2, 57) = 16.76, p = 0.001]. The results indicated that the ratings were significantly lower for text-only format (M = 3.16, SD = 0.88) compared to the text augmented with a still image (M = 3.77, SD = 0.89) and to the text augmented with animation (M = 4.13, SD = 1.01). Therefore, Hypothesis 4b was confirmed.

There was a significant effect of a problem presentation format on students ability to maintain their attention while learning through the Program [F (2, 57) = 51.00, p = 0.001]. The results indicated that the ratings were significantly lower for text-only format (M = 2.53, SD = 1.11) compared to the text augmented with a still image (M = 3.70, SD = 1.05) and to the text augmented with animation (M = 4.48, SD = 1.03). Therefore, Hypothesis 4c was confirmed. The average student survey ratings of problem scenario presentation format by type of cognitive processing are presented in Appendix Table 4.

Student perceptions of instructional strategies

Descriptive statistics were applied for the analysis of the data of student perceptions of the most and least helpful program features. The results are presented in Table 2. The student ratings of the implemented instructional strategies were high regardless of student level of prior knowledge.

Table 2 Means (and SD) of students’ survey ratings of their perceptions of the effectiveness of the instructional strategies implemented in the Program (1-strongly disagree, 5-strongly agree)

Rationale for conducting evaluation 2

The results of the Evaluation 1 indicated the statistically significant differences between student ratings of different formats of problem scenario presentations (animation plus text, still images plus text, and text-only). The differences were found on the following dimensions: how the Program helped students remember the key concepts (hypothesis 4a), understand the key concepts (hypothesis 4b), and maintain student attention (hypothesis 4c). The conclusions were made about the need for further testing of this effect in Evaluation 2 (see Fig. 1). The format of problem scenario presentation will be tested in three experimental conditions in Evaluation 2. The only difference between the experimental conditions will be the presentation format. It will allow the designers to make the final decision on the most effective type of problem scenario presentation. This type will be implemented in the final version of the software. The second round of the evaluation (Evaluation 2) will be conducted according to the plan described in the Introduction section of this manuscript (See Fig. 1).

Discussion

This research paper describes the design and evaluation of the Program for teaching types of variables. The Program was designed as a research platform. Careful consideration of the elements that made the teaching principles used in the Program effective was done by using a systemic approach for the Program design and evaluation.

The results of this evaluation can be used for guiding program design and further evaluation. They cannot be generalized to larger populations because of the small number of participants in the sub-conditions (Condition 2—LPK: 15 students, Condition 2—HPK: 14 students, Condition 3—LPK:17 students, Condition 3—HPK: 14 students) and thus insufficient power to detect small differences.

Based on the results of the Evaluation, the Program can facilitate retention of the variable concept regardless of the feedback type. Five days after the training, students demonstrated a significant improvement in knowledge gain. The results showed that the number of attempts in the program feedback did not have a significant effect on either the LPK or HPK student knowledge gain. In contrast, according to the review of studies by Clariana (1999) that compared AUC (answer until correct feedback) with STF (single try feedback), “For low prior knowledge students, single-try feedback was more effective than multiple-try feedback, ES = 0.11” (p. 88).

Interestingly, the LPK students liked having two tries feedback better even though they did not show higher knowledge gain. This phenomenon needs further investigation. As to the HPK students, there was no statistically significant difference in student ratings of their satisfaction with the Program between the single try condition and the two tries condition. In contrast, according to Clariana (1999), “for high prior knowledge learners, multiple-try feedback was better, ES = 0.39” (p. 88).

Student perceptions on how the format of problem presentation (animation plus text, still image plus text, text-only) helped them remember and understand the key concepts and maintain their attention, indicated the need for the next stage of the evaluation (Evaluation 2). Students gave higher ratings to problem scenarios augmented with animation compared to the ones augmented with still images and text-only. There was a significant effect of a problem presentation format on students ability to remember the information learned through the Program [F (2, 57) = 17.1, p = 0.001], understand the information [F (2, 57) = 16.7, p = 0.001], and maintain student attention [F (2, 57) = 51.00, p = 0.001]. These findings contradict the results obtained in other studies dealing with abstract, scientific or technical content (Tversky et al. 2002). The possible explanation for this phenomenon could the fact that the animations in the Program were built in ways that included the positive features of static images (Mayer et al. 2005).

As to the instructional strategies used in the program, they received high ratings from both LPK and HPK students, which may indicate that direct instruction through theory explanation and immediate explanatory program feedback were important for students’ success. Redundant (low-rated) program features were not detected during the evaluation.

Surprisingly, when student ratings in Survey 1 (administered after the training) and Survey 2 (administered after the post-test 5 days after the training) were compared, the ratings for Survey 2 were higher (see Appendix Tables 5, 6). Since the Survey 2 was a shortened version of Survey 1, and the Survey 2 was taken immediately after the post-test, the findings could serve as an indication that the Program helped students retain the concept of variables.

Conclusion

The systemic approach used for the evaluation of the Program allowed the designer to limit the design space of potential features to the small subset of the space that needed to be explored. It helped the designers to decide on the inclusion, exclusion, and further investigation of the potential candidates of program features. The most effective ones will be included in the final version of the Program.

Based on the results of Evaluation 1 presented in this paper, the overall effectiveness of the program was determined. As indicated by significant improvement in knowledge gain, the Program can facilitate retention of the variable concept regardless of the type of feedback, single try and two tries, and the level of student prior knowledge, low and high. Students found all the program features helpful. Redundant program features were not detected. If this would not be the case, the systemic approach would help us determine the features that did not contribute to student success.

Also, systemic approach was helpful in deciphering specific features that contributed to student knowledge gain. Based on the results of Evaluation 1, we can conclude that it is not the feedback type that made the difference. Also, the results of Evaluation 1 demonstrated that students regardless of their level of prior knowledge, found animated problem scenarios more helpful than scenarios augmented with still images and text only scenarios. The effect of the problem presentation format on students’ knowledge gain will be tested in Evaluation 2. After that, the final decisions will be made on which problem scenario format will be used in the final version of the Program.

Limitations and implications for further research

The major limitation of this study is insufficient power to detect small differences that might underlie the phenomenon. For this reason, the findings about the lack of significant differences in students’ knowledge gain between LPK students in Condition 2 and 3 and HPK students in Condition 2 and 3 should be treated with caution. The findings from this research cannot be generalized to larger groups.

The effect size for the analysis of student knowledge gain (d = 1.99 for single try condition and d = 2.30 for multiple try condition) was found to exceed Cohen’s (1988) convention for a large effect (d = .80). It might be the combination of instructional methods and program features that contributed to this fact. Future research is needed to disentangle which features of the computer treatments contributed to differences in test performance.

Other factors worthy of future study include the format of problem scenario presentation (text augmented with animation, still images and text-only). Also, further research is needed to determine the effectiveness of the Program on higher learning criterion measure such as transfer tasks.

Another limitation of the study is gender issue. More than two-thirds of the participants were females. Even though we diversified the selection of problem scenarios, supporting animations and images and ensured their high quality, the differences in the male’s and females’ perceptions of the used visuals remained a possibility.

The contribution of this research

This research demonstrates a holistic method of conducting a formative evaluation of newly designed software while using this software as a research platform. It is common knowledge that conducting repeated formative evaluations of instructional software can be resource intensive. The initial formative evaluation described here demonstrates the use of the framework that allows the designer to limit the number of formative evaluation studies by collecting the information not only on the student knowledge gain, but also on specific program features of interest and on the interaction of design features in regard to student level of prior knowledge. This formative evaluation demonstrates how to conduct design experiments in the background and refine instructional software design in a non-intrusive fashion.