Introduction

A central aim of science education is to “provide learners with ample opportunities to construct, re-construct and consolidate their own internally consistent frameworks about the epistemological foundations of science” (Abd-El-Khalick 2012, p. 360). Hodson (2014) points out that students should acquire conceptual and theoretical knowledge (‘learn science’) as well as develop an understanding of the characteristics of scientific inquiry including the role and status of the generated knowledge (‘learn about science’). Both in the philosophy of science and in science instruction research, the importance of models is being increasingly acknowledged (e.g., Bailer-Jones 2009; Giere 1988; Oh and Oh 2011; Windschitl et al. 2008). It can be argued that an elaborate understanding of models as scientists’ tools of inquiry marks an important step in learning about the practice of science (Gilbert 2004; Harrison and Treagust 2000; Passmore et al. 2014). Thus, science educational standards place a considerable emphasis on models and their use in science in order for students to understand the nature of science and develop scientific literacy (KMK 2005; NGSS Lead States 2013).

The relevance of models in the scientific community and the demand for their integration into school curricula have led researchers to investigate students’ and teachers’ understandings of the nature of models (e.g., Chittleborough et al. 2005; Grosslight et al. 1991; Justi and Gilbert 2003; Schwarz et al. 2009; Schwarz and White 2005; Treagust et al. 2002; Trier and Upmeier zu Belzen 2009; Van Driel and Verloop 1999, 2002). Most of the findings suggest that students as well as teachers have little awareness of the role that models play in the process of research but see them mainly as representations of a given phenomenon in order to show or describe the latter in a comprehensible manner. However, it is asked of teachers to have a profound understanding of the nature of models and modelling in science (Oh and Oh 2011) and to be able to diagnose their students’ understanding of models in order to determine specific starting points for effective interventions (Justi and Van Driel 2005).

Existing instruments (e.g., Chittleborough et al. 2005; Grosslight et al. 1991; Grünkorn et al. 2014; Krell 2013; Sins et al. 2009; Treagust et al., 2002) used to assess students’ understanding of models are suitable for educational research but cannot readily be employed by teachers when seeking to obtain individual feedback on their students. While the scoring of open-ended tasks is time consuming, the results from large-scale assessments cannot be related to individuals. Consequently, an efficient individual diagnostic instrument is needed (Hartig et al. 2008). We aim to support teachers by giving them a theory-based diagnostic instrument using forced choice tasks. Unlike in multiple choice tasks, in forced choice tasks none of the answer options is wrong. By choosing their most preferred out of three given options, the respondents indicate their understanding of the topic along a construct continuum (McCloy et al. 2005). The intended use of the instrument is to give teachers direct feedback on their students’ understanding of the nature of models (Gogolin and Krüger 2015).

When diagnosing students’ understanding of models in a school context, the validity of the diagnosis is of great importance as it enables educators to draw appropriate conclusions from the results and to plan appropriate interventions. Nevertheless, many researchers point out that the hypothesis of validity is not given enough consideration during the development of diagnostic instruments in educational research (e.g., Leighton 2004). On top of that, there is a demand to employ methods to identify how students react and respond to test items (Leighton 2004). Hereby, the ‘Standards for Educational and Psychological Testing’ stress the use of different sources to collect pieces of evidence for validity and to consider evidence across relevant subgroups of test takers (AERA et al. 2014). In order to ensure that the diagnostic instrument can serve its intended use and that the interpretations drawn on the basis of the forced choice tasks are valid, we took different approaches, two of which will be presented in this article as an example of how to integrate students’ reactions and responses to test items into the validity argument. First, in order to judge validity based on relations to other variables (AERA et al. 2014), we compared the findings of three different diagnostic assessment methods, including the forced choice tasks. Second, to find evidence for validity based on response processes, we performed think-aloud protocols to evaluate whether the diagnostic forced choice tasks are being understood by the students (AERA et al. 2014). In both approaches, special consideration was given to investigating whether valid inferences from the results can be drawn for all students, or for some subgroups only but not others.

Theoretical Background

The Nature of Models

Many educational researchers point out the potential of models as educational tools that can be used by teachers to help their students learn scientific content or learn about the nature of science (e.g., Gilbert 2004; Harrison and Treagust 2000; Oh and Oh 2011; Passmore et al. 2014; Windschitl et al. 2008). This potential may lie in the role that models play as tools of scientific discovery (Upmeier zu Belzen and Krüger 2010): they are the result of a theory-driven modelling process and allow the deduction of further research questions and hypotheses (e.g., Giere 1988; Van der Valk et al. 2007). Mahr (2011) highlights from a philosophical point of view that models “do not incarnate any form of truth, but rather forms of demonstrability, possibility, and choice” (p. 303). In his epistemic pattern of model-being, he moves away from the desire to define the nature of models and instead determines interdependent relationships that justify something to be conceived of as a model. In this frame, a model is (1) distinct from its representation as a model-object, (2) in its function as a medium, a model of something, and (3) in a methodological view, a model for something. For example, a computer-simulated Tyrannosaurus rex model (the model-object) is based on assumptions drawn from paleontological bone findings, thus being a model of the original T. rex. Moreover, as a model for predicting, it allows the testing of hypotheses, which are drawn from the model itself about the manner in which the T. rex possibly walked or ran. This concept is in line with the distinction between the retrospective view of models as representations and the prospective view of models as a scientific tool for making sense of the world (Krell et al. 2012).

The present study is based on the theoretical framework of model competence by Upmeier zu Belzen and Krüger (2010). Competencies may be described as “context-specific cognitive dispositions that are acquired by learning and needed to successfully cope with certain situations or tasks in specific domains” (Klieme et al. 2008, p. 9). The theoretical framework comprises the five aspects ‘nature of models’, ‘multiple models’, ‘purpose of models’, ‘testing models’, and ‘changing models’ in order to describe views that students, teachers and scientists may have about models. In this article, we will present diagnostic tasks that focus on the aspect ‘nature of models’ (Table 1). Upmeier zu Belzen and Krüger (2010) propose three levels of understanding for each of the aspects, reflecting the epistemological perspectives on models as both products and methods of science (e.g., Mahr 2011; Passmore et al. 2014; Treagust et al. 2004). As part of the empirical evaluation of the theoretical framework, Grünkorn et al. (2014) developed a category system of students’ perspectives for each of the five aspects (see Table 1).

Table 1 The theoretical framework for students’ understandings of the nature of models (Grünkorn et al. 2014; Upmeier zu Belzen and Krüger 2010)

When expressing views on the aspect ‘nature of models’, one compares a model with its original in order to specify to what extent the model accords with the original. Within this aspect, comparable to parts of the description of the aspect ‘kinds of models’ by Grosslight et al. (1991), a model may be seen as a replication (level I), as an idealised representation (level II), and as a theoretical reconstruction (level III) of its original. The perspectives described are similar to those defined by Schwarz et al. (2009) as part of a meta-modelling knowledge. According to the theoretical framework by Upmeier zu Belzen and Krüger (2010), a scientific understanding of the nature of models is accredited to students who are able to understand the perspective described in level III for this aspect.

Research on Students’ Understanding of the Nature of Models

Several studies have been carried out in order to investigate students’ understanding of models and modelling in science (e.g., Chittleborough et al. 2005; Gobert et al. 2011; Grosslight et al. 1991; Grünkorn et al. 2014; Krell 2013; Patzke et al. 2015; Sins et al. 2009; Schwarz et al. 2009; Schwarz and White 2005; Terzer 2013; Treagust et al. 2002; Trier et al. 2014; Willard and Roseman 2010). These studies differ in at least three main aspects. First, the underlying theory dictates the design of the assessment and guides the interpretation of the data. While Grosslight et al. (1991) announce general levels of understanding models — comparable to a person’s epistemological views on science, ranging from naive-realistic to constructivist — Upmeier zu Belzen and Krüger (2010) propose aspect-dependent levels of understanding models. Second, some of the authors discuss the influence of the students’ age on their understanding of models. Findings suppose that the age of students has an influence on their understanding of models, with older students answering more elaborately (e.g., Chittleborough et al. 2005; Grosslight et al. 1991; Krell et al. 2014b; Patzke et al. 2015; Terzer 2013; Willard and Roseman 2010). Al-Balushi (2011) sums up that “the drop in the number of students across grade levels who believe that there is a 1:1 correspondence between models and reality […] was well documented in the literature” (p. 575). Third, different methods were used to assess students’ understanding. For example, Grosslight et al. (1991), Schwarz and White (2005), and Trier et al. (2014) performed interviews. Gobert et al. (2011), Grünkorn et al. (2014), and Sins et al. (2009) used open-ended tasks. In addition to these rather open forms of assessment, students’ understanding has been identified using closed-ended tasks. For example, Krell (2013) used forced choice tasks, Terzer (2013) used multiple choice tasks and Treagust et al. (2002) used the ‘Student Understanding of Models in Science’ (SUMS) test, which includes 27 likert-type items. Some researchers combined methods, using both closed- and open-ended tasks (Chittleborough et al. 2005) or open-ended tasks and interviews (Schwarz and White 2005).

Even when sharing the same theoretical background and samples from the same population of students, there are divergent findings amongst researchers depending on the instrument used. While Grünkorn et al. (2014; open-ended tasks) found the majority of students (N = 1177) to understand models as replications of an original (level I 75 %; level II 20 %; level III 5 %), Krell (2013; forced choice tasks) found that students (N = 1216) were able to recognise that models are idealised representations (level I 15 %; level II 52 %; level III 33 %). Closed-ended formats are more likely to address a wider range of perspectives than open formats, as students can choose from given perspectives (Katz et al. 2000). In open formats, students may not state some implicit perspectives due to an aversion to writing or due to the fact that they think they answered the task sufficiently depending on the operator used in the task (Anastasi 1976).

Most researchers agree that, in general, when students comment on the relationship between the model and the corresponding phenomenon, they focus on representational aspects rather than on the model and its underlying hypotheses (e.g., Upmeier zu Belzen and Krüger 2010). All of the abovementioned studies investigated students’ understandings of models and their use in science by using assessment methods. While interviews were used for fundamental research into students’ conceptions, most of the paper-pencil instruments were employed in large-scale assessments. Kauertz et al. (2012) point out that often “the results of large-scale assessments cannot be related to individuals or even classes since the individuals’ measurement errors are out of scale” (p. 47). Therefore, the instruments are informative for educational administration (e.g., Pant et al. 2013) but cannot be employed by teachers when seeking to obtain direct individual feedback on their students. Consequently, an efficient instrument for an individual diagnosis of students’ understanding is needed (Hartig et al. 2008). We developed a diagnostic instrument which is intended to comply with the demand to support teachers in such a way. It consists of a set of forced choice tasks and can be used directly by the teacher, allowing for an efficient diagnosis of students’ understanding of the nature of biological models (Gogolin and Krüger 2015). Compared to interviews or open-ended questionnaires, an instrument containing forced choice tasks is easily administered and interpreted. The closed-ended format of forced choice tasks allows for direct feedback to the teacher of a student’s level of understanding of the nature of models to the teacher.

Validity

As students’ understanding of models cannot be observed directly, their thinking must be judged indirectly from their performance or their score on tasks. Depending on a student’s score, it may be assumed that the student has a more or less elaborate understanding of the matter. On the basis of the scores, predictions about the students’ future promotion in the classroom will be made (AERA et al. 2014). For a diagnosis of students’ understanding to be valuable, the validity of a proposed interpretation of the scores obtained from an instrument is of utmost importance. The Standards for Educational and Psychological Testing define validity as “the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests” (AERA et al. 2014, p. 11). They stress that validity is the most fundamental consideration in developing and evaluating tests, and a property of the meaning of the test scores rather than a characteristic of the instrument itself (Messick 1995). Hereby, the scores obtained from an instrument should lead to valid interpretations. For example, if an instrument is designed to measure students’ understandings of models, a student with a rather naive understanding should have a low score while a student with an elaborate understanding should have a high score.

Sources of evidence for the validity of a proposed interpretation of test scores are manifold (AERA et al. 2014). According to the Standards for Educational and Psychological Testing, researchers may inter alia provide (1) evidence based on the test content, including whether the test content adequately represents the content domain or whether construct underrepresentation and construct-irrelevance give a disadvantage to certain subgroups of test takers (Messick 1995). In order to generate this evidence, the development of tasks may involve the construction of the tasks to be theory-based, experts judging the representativeness of the chosen set of tasks, and a pilot test of the instrument. (2) Evidence based on response processes of the test takers may provide evidence concerning the fit between construct and response. Cognitive psychologists agree that think-aloud protocols can be seen as a window to conscious experience (e.g., Embretson and Gorin 2001) and can be used to investigate the cognitive processes underlying students’ responses to tasks (Leighton 2004). This contributes to answering questions about differences in meaning of scores across relevant subgroups of test takers. (3) Evidence based on relations to other variables describes the “relationship between test scores and other measures intended to assess the same or similar constructs” (AERA et al. 2014, p. 16). The convergence of different independent methods measuring the same trait indicates convergent validity (Campbell and Fiske 1959), though evidence needs to be discussed with regard to the task format (Martinez 1999; Taub and MacRury 1990).

Research Questions and Hypotheses

Knowing about students’ understanding of models is of interest to science teachers seeking to improve their teaching about models and modelling (e.g., Oh and Oh 2011). Teachers need an instrument that provides them with direct feedback on their students’ understanding in order to determine specific starting points for effective interventions (Hartig et al. 2008). Existing instruments cannot provide this service (Kauertz et al. 2012). This article gathers evidence regarding the validity of a diagnostic instrument (forced choice tasks; Gogolin and Krüger 2015) in order to empirically evaluate whether this instrument will be able to serve its intended use of diagnosing students’ understanding of the nature of models. We examined evidence that is based on relations to other variables and evidence that is based on students’ response processes (AERA et al. 2014). The different empirical approaches we took to gain evidence for validity are reflected in the research questions:

  • RQ1—evidence for validity based on relations to other variables:

To what extent do students have a consistent understanding of the nature of models when being assessed with different diagnostic methods, including the forced choice tasks?

  • RQ2—evidence for validity based on students’ response processes:

To what extent are the students’ interpretations of the diagnostic forced choice tasks consistent with the theoretically intended content of the tasks?

The first research question addresses the demand for providing evidence for validity based on relations to other variables. We expect the measurement of students’ understanding of models, identified using different diagnostic assessment methods, to converge in order to indicate convergent validity (Campbell and Fiske 1959).

The second research question refers to evidence for validity based on response processes of the test takers. In order to be suitable for a valid diagnosis of students’ understanding, the theoretical content of the tasks needs to be understood by the students (Leighton 2004). Therefore, we expect the students’ interpretation of the theoretical content of a task to match its theoretically intended content.

Given the previous research in the field, we expect the age of the students to have an effect on the scores. The Standards for Educational and Psychological Testing (AERA et al. 2014) request test developers to consider evidence for validity across relevant subgroups of test takers. Consequently, this aspect has methodically been taken into consideration and will be discussed.

Methods

As mentioned earlier, the research questions reflect different methodological approaches to the gathering of empirical evidence for validity. This entails the presentation of the method to be two-parted. The forced choice tasks (Gogolin and Krüger 2015), being in the centre of both parts, are described first.

Assessment Methods

The diagnostic instrument uses the format of forced choice tasks. In the forced choice tasks, respondents must choose their most preferred out of three given options and thereby indicate “which of the [options] included in the item is most indicative of his or her behavior” (McCloy et al. 2005, p. 225). Students’ responses are interpreted as indicators for their understanding of models (Gogolin and Krüger 2015). The fact that none of the answer options is wrong distinguishes forced choice from multiple choice tasks. Forced choice tasks have the advantage of giving direct feedback of the understanding while avoiding tied judgments that often occur with likert-type rating tasks (Böckenholt 2004). The fact that a cross by the student can directly be interpreted as a level allows for an efficient diagnosis of students’ understanding.

The tasks are based on the theoretical framework of model competence by Upmeier zu Belzen and Krüger (2010) and on the written responses of students to open-ended tasks, also based on the aforementioned theoretical framework (Grünkorn et al. 2014). Following Lee et al.’s (2015) claim that it is important to develop more context-based instruments because models and their roles vary across different topics or phenomena, contextualised tasks, including a variety of biological models, were developed (Gogolin and Krüger 2015). For each of eight different biological models, six answer options (two for each of the three levels, Table 1) were constructed. Three answer options (one per level) can be put together to form one forced choice task (Gogolin and Krüger 2015). The random combination of all answer options results in a total of eight possible different forced choice tasks for one model. In each task, a task stem shows a biological model with a short description. Due to the observation that students distinguish between school models and models in science (e.g., Gobert et al. 2011), we included the phrase ‘the [model] which was made by scientists’ (Fig. 1).

Fig. 1
figure 1

Forced choice task for the aspect’ nature of models’

The forced choice tasks were integrated into a paper-pencil questionnaire. To control item-sequencing effects (Eid and Schmidt 2014; Rector Federer et al. 2014), the six answers for each model were randomly combined in eight forced choice tasks per model and integrated into different test booklets. Consequently, each student saw a fraction of the available tasks in a particular combination, thus “ensuring sufficient content coverage across the population while reducing the assessment burden for any one student” (multi-matrix design; Gonzalez and Rutkowski 2010, p. 126). The combinations led to 16 different test booklets, each containing six of the eight models with two forced choice tasks for each model resulting in a total of 12 forced choice tasks to be answered by each student.

In order to investigate the first research question, we used two other methods to diagnose students’ understanding of the nature of models, one of them being a paper-pencil questionnaire containing two open-ended tasks and the other being fully structured interviews. The wordings of the questions in the open-ended tasks and the interviews were similar to the questions in the forced choice tasks (Table 2).

Table 2 Questions used in the three diagnostic tools. The brackets [model], [original] are replaced by a specific model (e.g. water cycle; Fig. 1) in the forced choice tasks

The decision for including two questions in both the open-ended tasks and the interviews derived from the wish to assess a broader spectrum of students’ perspectives. The decision to use abstract tasks, which do not refer to the specific models that are used in the forced choice tasks, was economic in nature: the number of tasks the students had to handle was to be kept at a minimum.

Data Collection—Evidence for Validity Based on Relations to Other Variables (RQ 1)

The data for the investigation of the first research question, which will help discuss evidence for validity for the forced choice tasks based on relations to other variables, stems from the three different diagnostic assessment methods and were collected at secondary schools in Berlin (Germany), where in total 448 students from grades seven to twelve agreed to participate in the study. All of these students answered one questionnaire with the open-ended tasks as well as one with the forced choice tasks. A subsample (n = 194) also expressed their views in fully structured interviews (Table 3).

Table 3 Demographic data of the sample

In all classes, the survey started with standardised instructions in order to inform the students about the procedure and to assure objectivity. Subsequently, all students were handed the first questionnaire booklet containing information about the project’s aim and the open-ended task format as well as the two open-ended tasks themselves. Anonymity was maintained by the creation of a password. There was no time limit on the completion of the tasks but the students were asked to raise their hand when they were done. When doing so, one person out of a team of trained helpers approached the student and asked him or her for a short interview. In the fully structured interview, the student was tape recorded while answering the two questions. The interviewer did not interrogate any further. Afterwards, the student returned to the classroom and worked on the second questionnaire with the forced choice tasks. Students from different grades (7th–12th graders) were included to represent relevant subgroups of test takers.

Data Analysis—Evidence for Validity Based on Relations to Other Variables (RQ 1)

While for the forced choice tasks, the cross of a student could directly be transformed into a level for the particular task, for the open-ended tasks and the interviews, the data had to be transcribed before being analysed. We chose the procedure of an evaluative content analysis (Kuckartz 2012) where the material is assessed based on already existing categories and scaled data are generated (Mayring 2010). Following this, the students’ responses were compared with the descriptions of the aspect ‘nature of models’ of the theoretical framework (Table 1, Grünkorn et al. 2014; Upmeier zu Belzen and Krüger 2010). A level was assigned to the students’ responses to the two open-ended tasks and the interview questions by two independent raters based on a coding scheme containing coding rules, a description of the aspect ‘nature of models’, and examples of student answers. If there was more than one level to be detected in a student’s response, only the highest level was scored in each of the tasks. Cohen’s kappa (k) was computed to measure the level of agreement between raters (interrater reliability; Wirtz and Caspar 2002). The interrater reliability was very good for all tasks (k interrater > .75). In the end, each student had 16 data points (two open-ended tasks, two interview questions, 12 forced choice tasks) expressing his or her level of understanding of the nature of models.

The grades were combined in pairs of two (7/8, 9/10, 11/12) following the German educational documents’ structure (‘Doppeljahrgang’, Engl. double year). According to the Standards for Educational and Psychological Testing (AERA et al. 2014), evidence for validity needs to be ensured separately for each subgroup.For a comprehensive analysis of the data, we performed a Rasch analysis with the software ACER ConQuest 3 (Wu et al. 2007). Applying item response theory made it possible to estimate person and item parameters on the same scale, despite the incomplete set of data we had as a result of the multi-matrix design (Adams and Wu 2002; Bond and Fox 2001). Due to the fact that the different models presented in the FC tasks may have an effect on the difficulty of the tasks, it is important to employ a method that explains the single-item response of a person on an item (Rost and Walter 2006). As we had tasks with three ordinal item responses, we estimated a partial credit model (PCM; Masters 1982), which allows describing responses on a multi-point scale (Embretson and Reise 2000). This has been shown to be appropriate for the analysis of forced choice data (Krell 2012).

In order to answer the first research question concerning differences between the three diagnostic assessment methods, we modelled the methods as three latent dimensions. Students’ understanding was inferred for each of the three dimensions using Warm’s (1989) weighted likelihood estimator (WLE). By the use of the WLEs, the categorical data from the students’ answers was converted, so that distances can be explained in terms of interval linear measures (Wilson et al. 2008). These may be preferred over ordinal raw scores for tests in classical test theory, many of which assume linearity (Eggert and Bögeholz 2010). In order to investigate whether the students answered differently depending on the diagnostic measure, the WLEs were analysed using methods of classical test theory. Different subgroups of students (depending on their grade) were analysed separately. A two-way mixed analysis of variance (ANOVA), with the diagnostic assessment methods constituting a repeated-measures independent variable and the grade constituting a between-group independent variable, allowed us to check whether there is an interaction effect between the methods and the grades.

Data Collection—Evidence for Validity Based on Students’ Response Processes (RQ 2)

In a second study, we performed think-aloud protocols with another 30 students from Berlin secondary schools (grades nine to twelve; seven students from grades nine and eleven; eight students from grades ten and twelve), in order to collect evidence for validity based on response processes. Every student thought aloud while answering eight forced choice tasks (four different models with two tasks for each) out of a pool of 16 forced choice tasks (eight different models with two tasks for each). This resulted in a total of 254 answers. Think-aloud protocols provide information on how students perceive assessment tasks, which cognitive processes are being triggered, and which strategies are being used to answer the tasks (Leighton 2004). The study was performed according to the instructions from Van Someren et al. (1994). To start with, the student was given an explanation about the purpose of the research and the process of the think-aloud protocol. Afterwards, the student practiced thinking aloud by answering the question of how many windows there are in his or her home. In a third step, the interviewer explained the format of the forced choice tasks and then gave the instruction: “Perform the tasks and keep saying out loud what comes to your mind”. While the student was working on the tasks, the interviewer stayed in the background, except if the student stopped talking. Then the interviewer encouraged the student, “Keep on talking”. The sessions were tape recorded, including the instructions and the practicing phase in order to check whether the procedure was performed correctly.

Data Analysis—Evidence for Validity Based on Students’ Response Processes (RQ 2)

The data from the think-aloud protocols were transcribed and analysed by two independent raters. The analysis was performed for only those excerpts of the protocols where the student referred to the one answer option (out of the three) that they finally chose in each forced choice task. Inevitably, this procedure results in an uneven number of think aloud units for different answer options. When analysing the protocols, the raters compared the students’ responses with the descriptions of the three levels of the aspect ‘nature of models’ in the theoretical framework (Table 1; Grünkorn et al. 2014; Upmeier zu Belzen and Krüger 2010). If the framework covered the student’s response, category I, II or III was assigned as the interpreted level of the answer option. If a deductive assignment was not possible, categories were inductively added to the coding manual. Again, Cohen’s kappa (k) was computed. It indicated a very good level of agreement between raters (k interrater > .75; Wirtz and Caspar 2002).

In a next step, the theoretically intended level of the answer options that were chosen by the students were compared with the students’ interpreted level of the respective options using a crosstable. The levels were treated as nominal data with no intrinsic order because we were solely interested in the fit between theoretical intention and interpretation. Consequently, we chose Cramer’s V as a chi-square-based measure of association. Additionally, we computed Cohen’s kappa (k) as a more conservative measure of agreement. In contrast to its purpose as a measure of interrater reliability, it is here used to “describe the degree of agreement between an assigner’s judgment about objects and the known criterion status of these objects” (Brennan and Prediger 1981, p. 689). The analysis was performed for different grades, representing relevant subgroups.

Findings

Evidence for Validity Based on Relations to Other Variables (RQ 1)

Keeping in mind that the analysis for the first research question is based on the person and item parameters emitted by the PCM, first, the appropriateness of the model’s assumptions has to be reported (Bond and Fox 2001). The three-dimensional PCM produced satisfying infit values for the items (0.92 < wMNSQ < 1.12) and the t values were within an acceptable range of −0.7 and +1.0 (Adams and Wu 2002). Furthermore, the threshold parameters of all items were ordered and increasing in difficulty, an observation which is in line with the theoretical assumption that more complex levels of understanding are represented by higher step difficulties. Therefore, we assume the three-dimensional PCM to fit the data of students’ understanding of models (Rost 2004; Kauertz and Fischer 2006).

Reliability indices were calculated both for students and items. While the item separation reliability of 0.925 indicates a spread of items from easy to difficult resulting in an excellent separation of items (Bond and Fox 2001), the person reliability (EAP/PV) is similar for all three assessment methods (forced choice tasks 0.508, open-ended tasks 0.550, interviews 0.538).

Figure 2 shows boxplots of the WLEs for the three assessment methods with which students’ level of understanding of the nature of models was assessed. WLEs are given on a logit scale ranging from negative values for decreasing understanding to positive values for increasing understanding. The grades (double years 7/8, 9/10, 11/12) were taken into account as relevant subgroups.

Fig. 2
figure 2

Boxplots of the WLEs for the three assessment methods and the three grade levels (7/8, 9/10, 11/12). n = 185 for open-ended (OE) tasks, interviews (I) and forced choice (FC) tasks

Compared to the forced choice tasks, students in all grades answered at a lower level in the open-ended tasks and the interviews. While the understanding of the nature of models, measured with the open-ended tasks and the interviews, increased by the grades, the forced choice tasks did not measure any obvious difference between any of the grades. The means of the WLEs actually slightly decrease with increasing grades (MeanWLW7/8: .169; MeanWLW9/10: .096; MeanWLW11/12: .094). In relation to the first research question, the descriptive data shows that the three diagnostic assessment methods give similar results in the grades 11/12. The two-way mixed ANOVA revealed that there is a significant method-by-grade interaction, F (2, 216) = 7.36, p = .00. It accounts for 12 % (ƞ 2 = .120) of the total variance in the score. This indicates that the comparison of the diagnostic methods significantly differs depending on the grade, with a medium effect size. Additionally, we performed dependent t tests comparing the person abilities across the three methods separately for the grades (Table 4).

Table 4 Comparison of person abilities across the three diagnostic assessment methods (dependent t tests)

In grades 7/8 and 9/10, the students are accredited a lower ability for the open-ended tasks and the interviews than for the forced choice tasks. These differences are statistically significant and represent medium to large effect sizes (Table 4). In grades 11/12, the students do not answer significantly differently across the open-ended tasks, the interviews and the forced choice tasks.

Evidence for Validity Based on Students’ Response Processes (RQ 2)

The second research question required a comparison between the intended level of the answer option chosen by the students in the forced choice tasks—based on the framework by Upmeier zu Belzen and Krüger (2010), this may be level I, II or III—and the interpreted level of the same answer option by the student. Table 5 shows the matches and mismatches in level for the answer options that the students chose during the think-aloud protocols. Again, the results are shown by grade, representing different subgroups.

Table 5 Crosstable of intended and interpreted level in the forced choice tasks, per grade assessed by the use of think-aloud protocols

First of all, for grades ten, eleven and twelve, about 50 % of the students chose answer options at level II in the forced choice tasks, whereas students from grade nine preferred answers at level III in the forced choice tasks. The levels that were identified in the think-aloud protocols show that nearly all students expressed ideas at level II. In comparison to the other grades, students from grade nine expressed more ideas at level I, whereas students from grade twelve expressed most ideas at level III.

The numbers in the diagonals of Table 5 show the cases in which the students understood an answer option according to the intended theory. The fact that there are quite a few combinations where students interpreted an answer option at a level which is different from the intended shows that there were difficulties of understanding. For example, in the ninth grade, nine out of 17 students misinterpreted the answer option at level III as level II and seven students in grade ten misinterpreted an answer option at level I as level II. With this in mind, it may be pointed out that a misinterpretation from a higher into a lower level is more severe in regard to the interpretation of the results. A student who is being overrated would not be identified as needing further support, even though he or she is actually requires that support.

Table 6 presents some of the qualitative data on which the numbers are based, in order to elicit what causes students to misinterpret the answer options.

Table 6 Answer options for the water cycle task and students’ interpretations

All of the students who misinterpreted level III as level II or I understood the part of the level III answer option, which refers to the hypothetical nature of the model, as a consequence of a lack of information. According to the students, this lack may result from their own or other peoples’ ignorance or from the scarce representation of information in the model. In the students’ interpretations of level III (Table 6), the students often ignored the second part of the sentence which gives reason for the hypothetical nature of the model and they reinterpreted the ‘may take place’ in the first part of the sentence. The number of misinterpretations decreases with higher grades, resulting in increasing measures of Cramer’s V and Cohen’s kappa (Table 7).

Table 7 Measures of association and agreement between intended and interpreted level for the forced choice tasks per grade

Both the Cramer’s V and the Cohen’s kappa show a good consistency between intended and interpreted level in grades eleven and twelve. Unlike Cohen’s kappa, the interpretation of the less conservative Cramer’s V shows a large effect of the correlations for grades nine and ten as well.

Discussion

The aim of this paper was to present evidence of validity in an empirical evaluation of the meaning and consequences of the measurement with forced choice tasks. The diagnostic instrument is to be used by biology teachers in order to gain knowledge about how their students understand the nature of models. According to the Standards for Educational and Psychological Testing, “the process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations” (AERA et al. 2014, p. 11). The evidence of validity, which was gained in the two approaches presented in this paper, will be judged in relation to the findings of other researchers. Furthermore, methodological constraints and the need for further investigations will be discussed.

Students’ Understanding of the Nature of Models Across Grades

The comparison of answers of students from different grades reveals that, with regard to the aspect ‘nature of models’, the students in higher grades show a more elaborate understanding when assessed with the open-ended tasks or in the interviews. Other researchers have also identified age as a factor that may influence students’ understanding of models (Chittleborough et al. 2005; Grosslight et al. 1991; Krell et al. 2014b; Willard and Roseman 2010). The ongoing analysis of the results of a longitudinally designed study (Patzke et al. 2015) indicates that increasing grade may have a positive but small effect on students’ understanding of models.

Evidence for Validity Based on Relations to Other Variables (RQ 1)

In order to provide a piece of evidence for the validity of the interpretations drawn on the basis of the forced choice instrument, we expected the different diagnostic assessment methods for students’ understanding of models to converge in the measurement of this trait (Campbell and Fiske 1959). We found indications for a conversion of measurements for the subgroup of students from grades eleven and twelve. The observation that the forced choice instrument meets this previously defined expectation can be interpreted as a piece of evidence for the assumption that it can serve its intended use of diagnosing students’ understanding of the nature of models. Still, this conclusion is feasible only for the subgroup of students from grades eleven and twelve and needs to be discussed with caution. Eid and Diener (2006) emphasise that next to the traditional aim of reaching maximum convergent validity, an analysis of method influences can “tell a more interesting story of the construct under consideration” (p. 6). Following this proposition, we take a look at possible reasons for differences between the assessment methods.

Martinez (1999) argues that different task formats require different cognitive abilities. Likewise, Traub and MacRury (1990) argue that open- and closed-ended tasks do not assess the same cognitive abilities. Considering that it was the same students who were assessed with all three diagnostic methods, the differences in understanding may indeed be a result of the different question formats (open-ended, interview, forced choice). While open-ended tasks reduce measurement error through random guessing (Bridgeman 1992) and eliminate unintended corrective feedback (Katz et al. 2000), their scoring is prone to a higher level of subjectivity (Bennett 1993). Anastasi (1976) refers to the interaction between student and teacher and points out that the former may adopt an approach to writing in which obscure ideas are presented in a focused style in order to be more accessible to the teacher. This may result in the students mainly expressing ideas at levels I and II while refraining from verbalising the more difficult to grasp ideas at level III.

Additionally, the differences across the diagnostic methods may be a consequence of some of the dissimilarities between the tasks. Hoyt (2000) distinguishes between diagnostically irrelevant method-specific and common factors of assessment methods, which both reduce construct validity but with different implications for convergence. While method-specific factors reduce convergence, common factors between methods increase it. Thus, the common characteristics among the open-ended tasks and the interviews may account for some of the observed convergence in this study. Furthermore, the questions in the open-ended tasks and the interviews are decontextualised in nature. They refer to models in biology but not to specific models as was the case in the forced choice tasks. Considering the fact that the aim of this study was not to perform fundamental research into the differences between specific methods (Martinez 1999; Taub and MacRury 1990) but to collect empirical evidence for validity, we did not systematically vary the task stems. Krell et al. (2012, 2015) point out that research findings about students’ understandings of models are, to some extent, ambiguous as a result of decontextualised and contextualised tasks. In the study by Krell et al. (2015), the students showed a less elaborate understanding of models in the decontextualised tasks. Sins et al. (2009) argue that “in specific contexts, students may entertain advanced epistemologies without being able to articulate the underlying viewpoints in general” (p. 1208). Furthermore, findings indicate that students express different understandings when tasks refer to different models (Al-Balushi 2011; Krell et al. 2012, 2014a). The context-specificity may provide an explanation for the abovementioned person reliability (EAP/PV) in this study. As models are context-specific tools, students’ epistemological ideas are likely to differ between models (e.g., Schwarz 2002). Several researchers point out that there is a need to investigate to what extent students’ understanding of models is model-specific (e.g., Krell et al. 2014a; Leach et al. 2000; Schwarz et al. 2008). A detailed analysis of the influence of the different models used in the forced choice tasks on the interpretation of the scores from the diagnostic instrument is yet to be conducted and is expected to provide further insight into issues of validity for the intended use of the diagnostic instrument.

Grade also had an influence on the diagnostic result. The students in higher grades showed a more elaborate understanding than those in lower grades when assessed with the open-ended tasks or in the interviews. The forced choice tasks did not track these differences between the grades. Furthermore, it has to be stated that in all grades, the students answered at a higher level in the forced choice tasks than in the other two methods, this difference not being significant for grades 11/12.

We propose two hypotheses that may explain these results. First, perhaps the students in lower grades are not able to express their understanding of the nature of models through the mode of an open-ended question or an interview due to a lack of writing ability or due to the thought being too implicit (Anastasi 1976). The forced choice tasks may offer younger students the chance to identify their thinking without having to verbalise it. Second, closed-ended task formats do not give insights into the reasoning behind the answer (Odom and Barrow 1995), and it is therefore possible that the students chose answer options without understanding the concepts being offered to them in the forced choice tasks. They would be more likely to guess or reinterpret and pick a random answer resulting in a scattered response pattern across the three levels. This may explain the slight decrease in the WLEs for the forced choice tasks in higher grades. If students stop guessing and start intentionally selecting the perspectives that most represent what they think, the distribution is likely to change and there will be fewer answers at level III. In this case, the forced choice tasks and consequently the diagnostic instrument would not be suitable to diagnose students’ understanding in lower grades.

The analysis of the think-aloud protocols provides data that offers insights into these hypotheses by eliciting reasons for the students’ decisions in the forced choice tasks.

Evidence for Validity Based on Response Processes (RQ 2)

The students’ expressions of their thoughts in the think-aloud protocols could be related to the theory, showing that students mainly saw models as idealised representations of an original, embodying merely some of the attributes of the original. As students frequently use models in the science classroom as a means to replace or simplify an original in order to describe and explain biological phenomena (e.g., Danusso et al. 2010) and therefore learn science (Hodson 2014) with the help of the models, it is understandable that students think that models should explain how the original works as exactly and efficiently as possible.

Before drawing conclusions about evidences of validity concerning the correspondence between the theoretical content of the forced choice answer options and the students’ interpretations of the latter, the methodical decision of performing concurrent think-aloud protocols needs to be evaluated. Nisbett and Wilson (1977) claim “that people may have little ability to report accurately on their cognitive processes” (p. 246), and they merely report their naive theories about what they think they are doing. Ericsson and Simon (1998) refer to the distinction between inner speech and social speech by Vygotsky (1962) and point out that “verbal descriptions of one’s thinking directed at another individual are likely to alter the course of thinking” (p. 181). In contrast to descriptions and explanations of thinking, concurrent think-aloud protocols provide a way to verbalise one’s thoughts without reactive influences on one’s thinking (Ericsson and Simon 1998). The think-aloud protocols in this study were performed according to guidelines for this method (Van Someren et al. 1994), and the findings we have obtained show incoherent discourses, an observation which seems consistent with Vygotsky’s (1962) comment that “inner speech appears disconnected and incomplete” (p. 138). Although we believe that the think-aloud protocols do not reflect everything the students thought while dealing with the forced choice tasks, we trust to have gained a fruitful source of information about whether the students interpret the tasks as theoretically intended. The students’ interpretations of the forced choice tasks are being used “to verify the actual set of interconnected knowledge and skills that students use to respond to […] items” (Leighton 2004, p. 7). It has to be pointed out here that this argument for validity is provided not for the instrument as a whole but for single forced choice tasks.

The students’ interpretations can be used to generate explanations for the findings from the first research question. We hypothesised that students in lower grades do not adequately understand the answer options in the forced choice tasks and are consequently more likely to either guess or to pick an answer option that is being misunderstood or reinterpreted. This may result in the selection of a level III answer option although the same students express views at lower levels when assessed using the open-ended tasks or the interviews.

Indeed, the misinterpretations of the answer options decrease with higher grades. This trend can be shown statistically by the increasing values of Cramers’ V and kappa. Both measures show a good fit between the intended and interpreted level for grades eleven and twelve. In the lower grades, there are 13 misinterpretations in grade nine compared with 18 misinterpretations in grade ten, which leads to the decrease in values for the tenth grade (Table 5). These results may seem inconsistent with the trend at first, but can be explained by an interpretation that takes into account the intended use of the diagnostic instrument and therefore is essential when judging validity. For this kind of interpretation, it is of special concern in which direction students misinterpret an answer option. A misinterpretation in the direction of a lower level would ultimately result in an overrating of the student in the diagnosis. Based on the diagnostic result from the forced choice instrument, the teacher would not foster this student’s understanding (when, in reality, this is needed) but engage him in challenging tasks at level III, a mistake which we consider more severe than an underrating of a student. Hence, we should pay special attention to the cases where students interpret an answer option in the direction of a lower level. Within these cases (grade 9: 13; grade 10: 9; grade 11: 7; grade 12: 4; Table 5), the misinterpretations decrease from grades nine to ten. The interpretations of an answer option in the direction of a higher level in grade ten may result from the student’s understanding that a model can never be just a copy of its original because in this case it would be the original and lose its purpose as a model (Table 6; see also Giere 1988).

Reasons for the difficulties of interpretation at level III could be identified with the help of the analysis of the protocols. The students most often misinterpreted the part of the forced choice tasks that refers to the model being a theoretical reconstruction and therefore containing hypotheses. Students tend to confuse a lack of information about some parts of the original or a lack of information concerning their own knowledge about the original with the model being the hypothesis of researchers about the original. They seem to not have a profound understanding of what defines a scientific hypothesis. The think-aloud protocols underline the observation of other researchers who point out that students see hypotheses mainly as “(educated) guesses” (e.g., McComas 2013). Meisert (2007) puts emphasis on the point that in inquiry-based approaches of learning science, hypotheses are a means to integrate students’ preconceptions into the lesson. Clearly, these often lack a theoretical basis and are mere guesses instead of scientific hypotheses. As the precondition for an experiment or an observation, students’ hypotheses can be seen as their mental models about a phenomenon (Meisert 2007). Considering that experiments and observations are frequently used in the school context to confirm scientifically accepted knowledge (Welzel et al. 1998), it can be suspected that students see their hypotheses (mental models) as a representation of their own lack of information. Gibbs and Lawson (1992) link students’ uneducated understanding of the term hypothesis to the way it is treated in textbooks. They stress that “hypotheses are not merely educated guesses based upon collected information as some textbook authors, and perhaps most high school teachers, would lead you to believe” (p. 139). In a recent study, Ubben et al. (2015) demonstrated that the material teachers use in schools, in order to foster their students’ understanding, is not adapted for establishing modelling as a scientific practice. Finally, Gobert et al. (2011) argue that, “as typical science instruction does not represent the real world of science and scientific practices, it is not surprising that students have naive views of the nature of science, of scientific inquiry, and the nature of models” (p. 678). The argument that science teachers themselves have a rather limited understanding of the nature of models and use models primarily to demonstrate established knowledge (Crawford and Cullin 2005; Van Driel and Verloop 1999, 2002) may as well be used to explain the students’ misinterpretations of the level III answer options. The findings from the think-aloud protocols could be useful as points of reference for interventions with teachers.

Conclusion

Supporting students to understand the nature of models as part of an understanding of the nature of science can be seen as an important goal of science education. There are successful attempts that describe how teachers engage their students in model-based inquiries and help them understand the nature of models (e.g., Schwarz and White 2005; Schwarz et al. 2009). However, teachers who participated in the studies received instructional support from the researchers. In order to support teachers without the presence of researchers, we developed a forced choice task-based diagnostic instrument that is intended to give teachers direct feedback on their students’s understandings in order to create starting points for individualised interventions.

The good understanding of the forced choice tasks in the think-aloud protocols as well as the convergence of the forced choice tasks with measurements using the open-ended tasks and the interviews for grades eleven and twelve give reason to believe that an efficient and, at the same time, valid diagnosis using forced choice tasks is feasible for this group. Here, the scoring on the basis of the forced choice tasks provides diagnostic information that is comparable to the diagnosis with open-ended tasks or interviews without time consuming and biased analysis of the student data (Bennett 1993). It is important to stress that diagnostic implications should only be deduced from the assessment with the forced choice tasks for grades eleven and twelve. In lower grades, other formats (e.g., open-ended tasks or interviews) should be used to assess students’ understanding of the nature of models. On the other hand, it may not be of tantamount interest to diagnose younger students with regard to individualised interventions due to the observation that in the lower grades nearly all of the students have a naive understanding of the nature of models and should consequently be supported to understand models as theoretical reconstructions (level III).