Introduction

The American Association for the Advancement of Science (AAAS) and the National Science Foundation (NSF) in their recent (2010) call to action, Vision and Change, along with an older but still resonating call by the National Research Council, as part of the National Science Education Standards (1996), suggests that a more active learning approach can lead to greater gains in true conceptual understanding as well as greater retention in the STEM subjects. Following this advice, many instructors have put forth efforts to transform their instructional approaches to reflect a more active, student-center philosophy. However, many default to previously written exams based on recalling massive amounts of biological facts rather than focusing on the science process skills they are trying to teach (J. L. Momsen et al. 2013). Unfortunately, there are potentially negative consequences from this arrangement. First, the exams do not appear to reflect the goals of these courses, and likely the goals of all instructors: to improve student reasoning and encourage deep conceptual understanding through practices implemented in class as well as through the encouragement of appropriate study habits by students outside of class. The second potential negative consequence is the focus of the present study. The level of exam questions (e.g., a focus on retention of facts) themselves may influence student learning throughout the course. As we develop in more detail below, these influences could be a consequence of learning from testing (see Roediger and Karpicke 2006, for review), from test expectancy effects fostered by the level of quizzes and exams administered in the course (McDaniel et al. 1994; Thiede et al. 2011), or both.

In a recent review of the relative assessment literature, Joughin (2010) discussed the three seminal and most influential works in establishing the influence of assessments on students: Becker, Greer, and Hughes’ Making the grade: The academic side of college life (1968); Snyder’s The hidden curriculum (1951); and Miller and Parlett’s Up to the mark: A study of the examination game (1974). Joughin concluded that there was reasonable support for several tenets surrounding research on assessment’s role in student learning: assessments influence a students’ distribution of efforts, their approach to learning, and their study behaviors. However, Joughin notes that most of the studies have weaknesses that limit their generalizability. In addition, we note that they do little to inform researchers on the causal mechanisms behind the effects seen. Several additional reviews support Joughin’s conclusions showing connections between assessments and student motivation (Harlen and Deakin Crick 2002; Van Etten et al. 2008) and student approaches to learning (Dickie 2003; Struyven et al. 2005).

Based on both the laboratory work in cognitive and educational psychology (e.g., Thiede et al. 2011) and the classroom studies mentioned above, we posit that the nature of the quizzes and unit exams that students receive throughout the semester potentially impact cognitive aspects of their learning, both the effort and strategy they utilize for learning during classroom activities and the type of self-directed learning (e.g., study strategies) in which they participate outside the classroom. Accordingly, we hypothesize that the quiz and unit exam levels uniquely impact student performances on their final exam, even when course content and instruction are held constant.

In this study, assessment levels differed according to Bloom’s Taxonomy (Bloom 1984). Bloom’s taxonomy of cognitive domains is a well-established framework for categorizing the assessment items into six levels according to the thinking patterns required (hereafter referred to as “Bloom’s”). Figure 1 illustrates the revised version of Bloom’s (Anderson et al. 2001). It is generally accepted that the first two levels of Bloom’s, remember and understand require only minimal levels of understanding and are considered lower-order cognitive skills (Crowe et al. 2008; Zoller 1993). It has been suggested that the third level of Bloom’s, apply, is at an intermediate level (Crowe et al. 2008); whereas, the three higher levels of Bloom’s (analyze, evaluate, and create) require higher-order cognitive skills (Zoller 1993). The taxonomy was originally designed to be a cumulative hierarchy where mastery of the lower levels of the taxonomy were prerequisite to performance on higher levels (Anderson et al. 2001; Krathwohl 2002). For example, a question written at the analyze level of the taxonomy would require mastery of the basic content (i.e., remember and understand) in order to perform the analysis required by the question. Many researchers have attempted to test the assumptions of Bloom’s model and its hierarchical nature, but the results have been mixed (Hill and Mcgaw 1981; Kropp et al. 1966; Madaus 1973; Seddon 1978).

Fig. 1
figure 1

The revised Bloom’s taxonomy

From an applied perspective, Momsen et al. (2010) established that biology instructors’ learning objectives were usually written at higher levels of Bloom’s, implying that educators embrace these higher levels as of central importance, whereas basic researchers often emphasize that mastery at lower levels of the taxonomy are critically important as well (e.g., Sternberg, Grigorenko, and Zhang 2008). In the present study, we retain the terminology “high-level” and “low-level” questions to reflect the origins of the distinction we draw, but not to confer implications about the respective putative importance of these levels of assessment and performance.

To test the effect of exam item level (low-level Bloom’s or high-level Bloom’s questions) on student learning (performance on the final exam), we compared two sections of introductory biology. In one section, all quizzes and unit exams were written at the “Remember” level of Bloom’s thereby requiring no more than memorization of material to perform well (for purposes of exposition, we label this the low-level condition). In the other section all of the quizzes and unit exams were written at high levels of Bloom’s designed to require higher-order cognitive skills (labeled the high-level condition). Students in both sections then received an identical cumulative final exam, composed of lower-level questions focused on memory for factual knowledge and higher-level questions focused on application, analysis, and evaluation.

We reasoned that the performance patterns on the final exam could be influenced by the learning gained from the prior quizzes and unit exams themselves (a direct effect of testing) and/or by students adjusting their in-class learning and/or out-of-class study strategies throughout the semester to match the expected assessments (an indirect effect of testing, such as an effect of test-expectancy, e.g., see McDaniel et al. 1994). We emphasize at the outset, that similar to many other studies conducted in the classroom examining the influence of quizzing/testing on authentic learning outcomes (e.g., course exams; (McDaniel et al. 2013; McDaniel et al. 2012; Roediger et al. 2011), the current study does not allow determination of whether the source of any test-level (low-level, high-level) effects are a consequence of direct testing effects, text-expectancy effects (an indirect effect), or a host of other indirect effects that might vary across the test-level manipulation (such as effects on metacognitive accuracy, study policies that are more frequent or spaced, exposure to different information). Nevertheless, we can still appeal to the testing effect and test expectancy literatures to anticipate several possible patterns of outcomes from the present test-level manipulation.

The Testing Effect

The quizzing/exam manipulations could directly benefit final exam performance. Termed the ‘testing effect’, it has been shown that actively retrieving target information is much more effective at enhancing retention of that material than simply re-reading (or restudying) the information. Testing-effect research has generally focused on test questions requiring simple retrieval of facts (S. K. Carpenter and DeLosh 2006; S. K. Carpenter et al. 2008; S. K. Carpenter and Pashler 2007; S. K. Carpenter et al. 2009; Carrier and Pashler 1992; Chan and McDermott 2007; Johnson and Mayer 2009; McDaniel et al. 2007; Rohrer et al. 2010) and aligns most directly with the current low-level quiz/exam condition. The testing effect (and associated processes) for high-level questions has received less attention, but the evidence suggests that testing can increase transfer of knowledge to new questions (Shana K. Carpenter 2012). Related to the present study, McDaniel et al. (2013) found evidence that suggests that quiz questions requiring application of a concept enhance performance on exam questions targeting a different application of the same target concept. Further, these application quiz questions also benefited performance on exam questions targeting memory for the concept terminology (relative to when the terminology was not quizzed). In contrast, quiz questions targeting only memory of concept terminology did not facilitate application of these concepts on a later test.

In the current study, the high-level condition included not only application questions but also analyze and evaluate questions. Cautiously extending from the above preliminary patterns, we reasoned that the high-level quiz/exam questions could directly improve performance on high-level questions on the final exam. We also reasoned that high-level quiz/exam questions could produce performance on low-level final exam items equal to that of the low-level quiz/exam questions. Of course, it remains possible that high-level quiz/exam questions (which included higher levels than application, as used in (McDaniel et al. 2013) would promote substantial elaboration of conceptual information, thereby yielding an advantage on low-level items as well.

By contrast, any direct testing effects in the current study could follow a transfer-appropriate processing (TAP) pattern (e.g., Fisher and Craik 1977; McDaniel et al. 1978) that is that significant benefits are seen when the level of questions on the prior tests align with the level of questions on the final exam when previously tested concepts/constructs are targeted. We acknowledge that the TAP explanation has not fared well in accommodating standard findings from the testing-effect literature in terms of matching/mismatching question format (multiple choice vs. short answer questions; S. K. Carpenter and DeLosh 2006; Kang et al. 2007; McDaniel et al. 2007; McDermott et al. in press). Nevertheless, it remains entirely possible that the view would apply to the present circumstances. Here the conditions differ in terms of question level, with one level requiring remembering (low-level) and the other level focusing on problem solving (high-level analysis and evaluation). Thus, misaligning the quiz/exam question level with the final exam question level may be associated with profound processing differences, thereby giving students a selective advantage on items on the final exam which align most closely in processes with the quiz/exam items. This leads to the clear possibility of a cross-over interaction such that the low-level condition produces better performance on the low-level final exam items than does the high-level condition, with the reverse true for the high-level final exam items.

Test Expectancy Effect

Recent laboratory findings suggest that test-expectancies stimulate studying that appropriately matches the demands of the anticipated test (Finley and Benjamin 2012; Thiede et al. 2011). Applied to the present context, quizzes and unit exams throughout the semester would provide students with practice on different types of test questions (as in Finley and Benjamin and in Thiede et al.) and could thus stimulate test expectancies that might guide students’ study activities. Expectations about how test-expectancy could influence performance hinge on the kinds of preparation (studying) that high-level quiz/exam questions stimulate. One account might suggest that in preparing for high-level quiz/exam questions, a student would focus on problem solving, analysis, and evaluation thereby specifically honing those skills. These honed skills would then lead to an advantage of the high-level condition students on high-level final exam questions; similar advantage would be expected of low-level condition students on low-level final exam questions. This reasoning suggests a transfer-appropriate processing pattern like that described in the above paragraph and reported in laboratory conditions (cf., Thiede et al.).

A more complex account hinges on the idea that in order to perform at higher levels of Bloom’s, students must have mastered the lower levels (i.e., remembering and understanding basic terminology specific to the subject) (Bloom 1984). Accordingly, the students’ expectancy that high-level Bloom questions will appear on the test would presumably require students to master the basic facts before extending this learning to focus on applying the facts and using them for analysis. Indeed, under this expectancy the basic facts should be elaborated with regard to analysis and evaluation, thereby increasing retention of the facts per se (cf. McDaniel and Donnelly 1996, with regard to astrophysics concepts; also see Mayer 2003, for a review of the positive effects on learning of answering deep-level questions while studying material). Thus, the idea is that students in the high-level condition, having a high-level expectancy, will outperform students in the low-level condition on both low-level and high-level final exam items.

To summarize, a consideration of effects (testing effect and test-expectancy effect) that could plausibly be operative in the current study suggests two possible general patterns. One pattern would reflect selective benefits of each condition on performance on matched final exam items. The other pattern would reflect cascading effects of the high-level quiz/exam questions on both high-level and low-level final exam items. The idea here is that application, analysis, and evaluation encompass processes that benefit retention (memory). Such effects would be evidenced by higher performance of the high-level condition on high-level final exam items, and critically by equivalent or even superior performance on low-level final exam items.

Student attitudes toward assessments are generally negative, regardless of format, given the effect the assessments can have on their course grade. However, students likely enter class with pre-conceptions about assessment format. In an analysis of 77 introductory biology courses (and 9713 assessment items), Momsen et al. (2010) found that 93 % of the items were assessing low-level Bloom’s (remember or understand). Given that this is the case, we expect that students, who clearly expect low-level exams from an introductory biology course, will likely be dissatisfied with high-level exams presented in this study, and may express this dissatisfaction in the form of increased negative comments about the exam.

Methods

Ethics Statement

Permission for human subjects use was obtained by the Institutional Review Board of the first author’s university and written consent was obtained from all participants.

Course and Participants

The participants were undergraduate students enrolled in two sections (∼90 students each) of non-majors general biology at a large private western university. The course, which met three 1-h periods per week, is part of the general education required core and covered the entire biology curriculum, from molecular and cellular biology, to genetics and biotechnology, to evolution and ecology. Both sections were taught in the same inquiry fashion using the learning cycle (Bybee 1993; Lawson 2002): specifically, exploratory activities to introduce each unit followed by term introduction and concept application activities. Homework assignments were identical between sections. Students enrolled in the course are typically non-science majors and range from freshman to seniors.

Experimental Design

A quasi-experimental nonequivalent groups design was utilized. Steps were taken to ensure as much group equivalence as possible among the two treatment groups [i.e., same instructor, identical classrooms, course materials, textbooks (Belk & Maier, Biology: Science for Life), resources, curriculum, and expected learning outcomes]. One section was assigned to a low-level (herein referred to as LL; N = 84) assessment format and the other section was assigned to a high-level (herein referred to as HL; N = 85) assessment format that encompassed all weekly quizzes as well as 3 unit exams throughout the semester. In both conditions, the quizzes consisted of 10 questions of various formats (multiple-choice, fill-in-the-blank, and short answer) administered through the course management system at the end of each unit. The exams were 100 multiple-choice questions and were administered in the University Testing Center, a center where exams are proctored to students outside of class time. Both sections took a common final exam, which consisted of half low-level and half high-level questions. The final exam was comprehensive and tested the same (or similar) concepts but with new scenarios such that no question appearing on a quiz or unit exam was repeated on the final. Low-level items were defined as “Remember” on the revised Bloom scale (Anderson et al. 2001). High-level items were defined as items falling into “Apply,” “Analyze,” and “Evaluate.” [Note that none of the items were “Create” due to the constraints of multiple-choice testing.] These items required students to go beyond a simple understanding of the concepts and use them appropriately, e.g., apply them to a new situation, analyze data and draw appropriate conclusions, or evaluate the validity of information based on these concepts. In order to create a distinct difference between exams, we chose not to include items at the “Understand” level. In summary, low-level items required memory of terms and definitions, whereas, high-level items required remembering a concept’s definition and then the use of a higher-order skill using this concept (either application, analysis, or evaluation). Occasionally, a particular low-level item on the low-level unit exam was directly subsumed within a high-level item on the corresponding high-level exam in the alternative treatment (for an example, see Appendix, questions i and xi; both cover haploid number). However, more often, the concepts were related, but the high-level item was not just an extension of the low-level item (for an example, see Appendix, questions vii and xviii; both cover electron shells and atomic bonding, but solicit different aspects within this concept. Selected items are included in the Appendix. Items were classified by three independent researchers trained in Bloom’s taxonomy. Items were topic-matched between both exam formats to ensure that the same content was being tested in each condition. In addition, based on the instructor’s extensive experience with the material and the course, low-level items were selected to ensure that exams were equally difficult between treatments (confirmed by exam averages, as reported in the Results section).

There were a total of 14 weekly quizzes. Quizzes were taken independently and answers were not discussed in class. Students could choose to visit a teaching assistant outside of class to discuss quizzes and get feedback if they so desired (very few students took advantage of this). However, students were not allowed access to quizzes outside of TA office hours and thus were quizzes were not available to be used as study materials. There were three unit exams spaced evenly throughout the semester. Exams were never discussed in class and students could not take exams home as study materials. We did, however, incentivize students with five points of extra credit (a relatively nominal portion of their 895 total points in the course) to go and discuss their exams with teaching assistants. They were allowed to see the exam but not take it with them. On average, about 50 % of students in each condition chose to take advantage of this opportunity.

Dependent Measures

Initial Reasoning Ability

A key factor involved in performance, especially in biology, is scientific reasoning ability. Scientific reasoning ability is correlated with college level biology achievement; it is also very closely related to science process skills (e.g., controlling variables, interpreting data, drawing conclusions) and is highly correlated with a student’s ability to perform at higher levels of Bloom’s (Lawson et al. 2000b). Thus, students with higher reasoning abilities have an advantage on test items requiring procedural skills (e.g., science process skills). To control for this possibility, we assessed student reasoning ability using Lawson’s Classroom Test of Scientific Reasoning Skills (LCTSR, ver. 2000, Lawson 1978) and used it as a covariate in our analysis of achievement scores.

The LCTSR consists of 24 items used to assess initial reasoning ability. Scoring procedures, validity and reliability of the test are discussed in Lawson et al. (2000a). Briefly, scores from 0 to 8 are level 3, or concrete operational thinkers. Scores from 9 to 14 are low level 4, or students transitioning from concrete to formal operations. Scores from 15 to 20 are high level 4, or students transitioning from formal to post-formal operations. Scores from 21 to 24 are level 5, or post-formal operational thinkers. The reasoning test was administered as an in-class assignment at the beginning of the course, and students were given a fixed number of points for its completion.

Achievement

Student achievement was assessed by a common course final exam. The exam, informed by several standardized biology exams, consisted of 20 low-level multiple-choice items and 21 high-level multiple-choice items. Items were designed and then categorized into Bloom levels by three independent researchers trained in assessing levels of Bloom’s Taxonomy. Items were discussed and modified until all raters came to an agreement on the Bloom’s level. Because so many different constructs were being measured by this exam, reliability was difficult to determine and overall internal consistency was not expected to be high. A Cronbach’s alpha for the 41 content questions was determined to be 0.66. Students were assigned a low-level achievement score by averaging their performance on the 20 low-level items and a high-level achievement score by averaging their performance on the 21 high-level items.

Attitudes

Student comments were taken from end-of-semester student course evaluations. Students were asked to respond to the statement “Evaluations are a good measure of learning” on an eight-point Likert scale from “Very Strongly Disagree” to “Very Strongly Agree.” Students were also given the opportunity to give any additional comments about the course in a free response portion. Comments about the exams were taken from each treatment condition and used to qualitatively judge student satisfaction with the evaluation component of the course.

Results

Reasoning Ability

Students were administered the 24-question LCTSR at the beginning of the semester to assess whether the treatment conditions were equally matched. Student scores indicated that the HL treatment group, on average, had slightly higher reasoning skills than the LL treatment group (M HL = 19.4, M LL = 17.8, t (167) = 2.61, p = .01, ηp 2 = .039). However, both treatment groups had formal reasoning skills (defined by scores above 14; see (Lawson et al. 2000c) and therefore should have been capable of learning the theoretical concepts in the course. Nevertheless, as a consequence of this difference, the LCTSR was used as a covariate in analyses on effects of treatment.

Quiz and Unit Exam Performance

Table 1 displays raw scores on each of the 14 quizzes within each treatment group. Scores on quizzes varied from approximately 5.5 to 9 out of a possible 10 points. Simple comparisons showed that on quizzes 1, 3, 6, 9, and 11, the HL treatment group outperformed the LL treatment group; whereas, on quizzes 5, 7, 12, and 13, the LL treatment group outperformed the HL treatment group. The remaining quizzes were statistically equivalent. Thus, no pattern emerged that would indicate that one condition was losing motivation throughout the semester or that one condition was disproportionately benefitting from quizzing.

Table 1 Average quiz scores between treatment conditions

Table 2 displays mean percentages on each of the 3 unit exams as a function of exam level (low level, high level). Inspection of this table reveals that means were virtually identical across the two types of exams. A 3 (exam number) × 2 (treatment condition) mixed model analysis of covariance (ANCOVA), with LCTSR as the covariate, confirmed that mean scores did not differ across the low-level and high-level exams (F < 1), and that this equivalence held for all three exams [F (2, 332) = 1.44, p > .23, ηp 2 = .009], for the interaction; also see Table 2 for t-values comparing the two treatments on each of the three exams). Accordingly, any treatment effects on final exam performance are not a function of divergent quiz or unit exam performances across the treatment conditions (LL, HL).

Table 2 Average unit exam scores between treatment conditions

Several effects did emerge from the ANCOVA. Exam percentages varied significantly as a function of exam number, F (2, 332) = 4.52, MSE = .015, p = .01, ηp 2 = .027; Table 2 shows that Exam 2 scores were somewhat higher than Exam 1 and 3 scores. The LCTSR was positively related to Exam scores, F (1, 166) = 33.08, MSE = .06, p < .0001, ηp 2 = .17, and more so for the earlier Exams (r = .47 and .39 for Exams 1 and 2 respectively) than for the last Exam (r = .19), F (2, 332) = 5.19, MSE = .015, p < .006, ηp 2 = .030 (for the interaction of LCTSR with exam number).

Achievement

Final exam scores were analyzed with a 2 (question level: low level, high level) × 2 (treatment) mixed model ANCOVA, with final-exam question level as a within-subjects factor, treatment as a between-subjects factor, and LCTSR as the covariate. The treatment main effect was significant [F (1, 166) = 7.15, p = .008, ηp 2 = .041Footnote 1] indicating that students who took high-level unit exams (HL treatment, adjusted mean = .54) generally scored higher those students who took low-level unit exams (LL treatment, adjusted mean = .50). Importantly, this treatment main effect did not interact with question level (F < 1). Figure 2 (unadjusted means) shows that the higher scores for the HL condition were clearly evident for both the low-level final exam questions and the high-level final exam questions. To confirm this observation and because of the theoretical significance of the result, we conducted separate ANCOVAs (with LCTSR as a covariate) on scores on the low-level and high-level question . For the low-level questions, the HL condition (adjusted mean = .63) scored significantly higher than the LL condition (adjusted mean = .59) [F (1, 166) = 4.19, MSE = .02, p < .05, ηp 2 = .025Footnote 2]; and for the high-level questions, the HL condition (adjusted mean = .46) similarly scored higher than the LL condition (adjusted mean = .41) [F (1, 166) = 6.32, MSE = .02, p < .02, ηp 2 = .037Footnote 3].

Fig. 2
figure 2

a Low-level and b high-level achievement sub-scores for each treatment condition. For both sub-scores, the HL treatment significantly outperformed the LL treatment (p < .05 and p < .02, respectively). Error bars represent 95 % confidence intervals

The overall ANCOVA also indicated that correctly answered more low-level questions than high-level questions, F (1, 166) = 34.74, MSE = .009, p < .0001, ηp 2 = .173. Reasoning level (LCTSR) assessed at the beginning of the semester was positively associated with final exam scores, F (1, 166) = 41.50, MSE = .02, p < .0001, ηp 2 = .200. This association was significantly modulated by question level, F (1, 166) = 5.65, MSE = .009, p < .02, ηp 2 = .033, such that students’ reasoning level was more strongly associated with their ability to solve high-level questions (r = .50) than low-level questions (r = .33). This pattern is sensible because performance on high-level items requires both content knowledge and reasoning skills, whereas for low-level items reasoning ability is presumably less necessary.

Attitudes

Response rate on end-of-semester student course evaluations was high (74 % in the LL treatment and 80 % in the HL treatment). Student impressions of the evaluations used in the course were generally lukewarm. In response to the statement, “Evaluations are a good measure of learning,” the average response (on an 8-point Likert scale) was a 5.5 in the LL treatment and 5.3 in the HL treatment, between “Somewhat Agree” and “Agree.” Open-ended comments were searched for feedback specifically about exams. In the LL treatment, only 8 students chose to comment on the exams, mostly with disappointment in their performance. In the HL treatment, comments were much more plentiful and it appears that students recognized that the exams were testing higher-level thinking skills but resented their poor performance (see Table 3).

Table 3 Student impressions of exams

Discussion

The assessment level incorporated into the course had a significant impact on students’ conceptual understanding and final achievement scores. A striking pattern emerged such that students who routinely took quizzes and unit exams requiring higher-order thinking not only showed deeper conceptual understanding by higher scores on high-level questions, but also showed greater retention of the facts, as evidenced by higher scores on low-level questions. These effect sizes were in the range of small-to-medium size effects (small effects are considered to be 0.01 and medium are considered to be 0.06). From a practical standpoint, differences in scores between the low-level and high-level conditions were equivalent to approximately a half-grade level (e.g., the difference between a B and B+). Returning to our original theoretical perspectives, these results best fit the expectation outlined in the introduction that high-level quizzing and testing would have cascading effects such that the focus toward high-level deep conceptual understanding (on quizzes/unit exams) would confer benefits on final exam items that targeted deep use of information (application, analyze, evaluate) and exam items that targeted memory of target information. As outlined in the introduction these effects can potentially be explained through both a test expectancy and a testing effect perspective. Below, we also sketch some other indirect consequences of testing that could potentially be involved in the present effects.

Briefly, a test-expectancy perspective would suggest that students adjusted the focus and strategies used in their in-class learning and/or at home studying to best match the demands of the quizzes and exams. Much as students with a surface or achieving study strategy (Biggs 1987) will tailor their study habits to match expectations, if students are presented with assessments throughout the course that require simple memory of facts (low-level exams), they likely focus their cognitive strategies on memorizing terms and definitions while neglecting to practice applying the material. The higher-level exams, on the other hand, would prompt a change in student expectations and as a consequence, a change in the focus of their studying from low- to high-level tasks. In particular, students might have begun to focus their studying on these exercises to integrate, evaluate and apply the material. By focusing on integration and application in their studying, we assume that students (in the high-level quiz/exam condition) would also need to learn the terminology and basic understanding tapped by the low-level final exam questions.

Of course the assessment level manipulation could have fostered a host of other indirect effects. For instance, high-level quizzes and exams could have prompted an increase in the quantity or spacing of study for the final exam than did low-level quizzes and exams (perhaps because the high-level quizzes/exams created the anticipation of a more challenging final exam). Another possibility is that the students in the high-level quiz/exam condition were more likely to seek feedback and review their exams with the teaching assistants than were students in the low-level condition. To inform this possibility, we calculated the percentage of students who consistently met with teaching assistants to review their exams. The results disfavored this possibility as students in the low-level condition met more often with teaching assistants to review exams (M = 55 %) than did students in the high-level condition (M = 39 %, p = .04). Finally, it could be that the high-level quizzes/exams allowed students to more effectively calibrate their metacomprehension of the material, which in turn might be expected to guide more efficient study policies for the final exam (i.e., more effective deployment of study time; cf. (Thomas and McDaniel 2007).

The present patterns might also be a consequence of testing (on the quizzes and exams) directly affecting learning and consequently performance on the final exam. The idea is that students practiced at performing high-level processing of the material on quizzes/exams better learned conceptual aspects of the material that required applying, analyzing, and evaluating than students practiced at remembering the material (on quizzes/exams). This would be revealed as an advantage on the high-level questions presented on the final exam. Additionally, testing on high-level questions could stimulate processing that enhanced memory for the target information (e.g., McDaniel et al. 2013), thereby possibly playing a role in producing superior performance on low-level (memory-based) final exam questions relative to low-level testing. In addition to (or instead of) high-level items stimulating more extensive processing when answering the quizzes and unit exams (than low-level items), it is possible that the high-level items (quiz and unit exam) exposed students to greater amounts of information than did the low-level items, due to the more complex nature of the high-level items (see the Appendix for sample high- and low-level unit exam items). If so, then this additional exposure per se could have contributed to the increased achievement (final exam performance) of students in the high-level condition.Footnote 4

An alternative interpretation to the general idea that the benefits of the high-level condition were a consequence of direct and/or indirect cognitive effects is that students in the low-level condition found the quizzes and unit exams to mismatch the way the course was being taught (an inquiry, problem-based approach); as a consequence their motivation suffered relative to students in the high-level condition. That is, students in the low-level condition may have become less motivated to engage in class and to study than those in the high-level condition, thereby producing lower final exam performance. At least two findings argue against this interpretation. First, if motivation were declining in the low-level condition, we would expect to have seen quizzes and unit exam scores drop throughout the semester. However, quizzes and unit exam scores remained consistent between conditions. Second, the attitudinal data indicated that students in the low-level condition considered the quizzes and unit exams to be typical and at the level expected. Of the 62 students in the low-level condition who submitted end-of-semester student course evaluations, only eight made any reference to exams, and those references were generally regarding the difficulty, not the format. This suggests that factually oriented exams are typical of college science courses and expected by students (as was seen by Momsen et al. 2010). In the high-level treatment, many more references were made to exams and most comments referred to the types of questions on these exams.

The student comments lead us to two conclusions. First, students recognized that the exams were testing their ability to think and that this was somewhat novel and unexpected. Second, students were extremely dissatisfied with this type of testing, despite their increased learning. Testing that requires application of material is challenging for students. It requires a real effort to understand the material and not just a cramming session to memorize definitions. Students are rarely exposed to this type of assessment and therefore find it to be uncomfortable and difficult. Many students expressed to the instructor their frustration with their inability to effectively study for these exams. It was our impression that their frustration was not due to a lack of ability but rather to a lack of experience with exams requiring study strategies aimed at deep conceptual learning and critical thinking. As was pointed out in a recent review, students often do not recognize how they learn and thus do not appreciate many beneficial learning tasks (Bjork et al. 2013).

Interestingly, the LCTSR was correlated with performance on all three unit exams as well as the final, especially with high-level items. Teaching through inquiry did not eliminate this correlation. First, the robust association of the LCTSR and high-level item performance indicates that high-level assessment items indeed require scientific reasoning skills in addition to simple content knowledge. Second, it suggests that students with higher reasoning skills have a distinct advantage on high-level assessments. It follows, then, that perhaps more explicit focus on the teaching of scientific reasoning skills is warranted when high-level assessments are used.

Overall, our findings are in line with the hierarchical assumption of Bloom’s taxonomy of knowledge processes. The observed general benefits of the high-level quiz/unit exam condition (relative to the low-level condition) on the final exam suggest that preparing for high-level questions (application, analysis and evaluation) necessitated a mastery of information at lower levels on the taxonomy (memory and comprehension) as well. We also suggest that it is possible that in order to perform on items at higher levels of Bloom’s taxonomy, students must engage in elaborative practices when answering high-level questions that enhance their understanding of the basic terminology. Of course these possibilities await direct evaluation in more controlled experiments. Regardless of the eventual explanation, the findings show cascading effects such that consistent experience with high-level quiz and exam items (as identified by Bloom’s taxonomy) not only facilitates performance on subsequent questions requiring application, analysis, and synthesis, but also facilitates performance on items that require memory and understanding of basic terminology (relative to a steady diet of low-level quiz and exam items).

Conclusions and Educational Implications

The present results reinforce the assumption that assessments inform students of expectations for the course, and further indicate that such expectations can have important consequences for student learning outcomes. Writing exams that require higher-order thinking skills is certainly a challenging task for an instructor, especially utilizing the multiple-choice format that can be easily administered in a large-enrollment course. However, higher-order assessments may be a key factor in stimulating students to effectively acquire a deep understanding of the material, an understanding that supports, not only application, analysis and evaluation, but also better retention of the core facts. By contrast, adopting a more typical (and perhaps easier for the instructor) approach of giving factual recall exams does students a disservice, as these kinds of exams are less likely to foster thinking critically and applying knowledge and further do not appear to even promote acquisition and retention of factual information to the extent stimulated by higher-order exams (given throughout the course).

This study illustrates the importance of higher-level assessment in promoting scientific understanding. Using a backward design (Wiggins et al. 1998), instructors are encouraged to first identify desired results or learning outcomes. In biology, these learning outcomes should certainly include content knowledge, but we suggest they should also include the learning of scientific process skills. The second step in backward design is to determine what evidence would illustrate that learning outcomes were met and to design assessments accordingly. The third step is to design learning activities that align with the assessments. By aligning our learning activities with our assessments and our assessments with our learning outcomes, we provide greater opportunities for students to demonstrate what they have learned. Many science instructors fail to design assessments that effectively test for scientific process skills, often defaulting to test-bank generated exams testing only content knowledge. Assessments should be designed to truly test scientific process skills and as such should be written at higher levels of Bloom’s taxonomy. Not only will this assessment format give the appropriate evidence of the attainment of desired learning outcomes, but this study shows that it actually directs student learning, focusing their study efforts on these desired skills and ultimately leading to deep conceptual understanding. Effective assessment can turn students’ time inside and outside of class into productive learning time, rather than bouts of rote memorization.