1 Introduction

Scientific reasoning is defined as intentional knowledge seeking (Kuhn, 2002), and it comprises various components, such as experimentation skills (how to design an informative experiment?), data interpretation skills (how to make sense of patterns of covariation data or confounded data?), or nature of science understanding (what is it that scientists do and what kind of questions do they ask?) (Koerber et al., 2005). Scientific reasoning is an important twenty-first century skill (Trilling & Fadel, 2009). In modern knowledge societies, mature scientific reasoning skills are necessary in many professional occupations, and also, they allow citizens to make well-informed decisions with respect to socio-scientific issues, such as climate change or health crises (Ratcliffe & Grace, 2003; Sadler, 2004). While early developmental work on children’s and adolescents’ scientific reasoning focused on students’ shortcomings (Inhelder & Piaget, 1958), there is a growing body of research that shows that children as young as 6-year-olds perform better than chance on many scientific reasoning tasks (Koerber & Osterhaus, 2019, 2021).

In early elementary school, for instance, children reliably differentiate a conclusive from an inconclusive test of a hypothesis (Sodian et al., 1991). Also, they can reason about the informativeness of different kinds of evidence (Köksal et al., 2021). In kindergarten (children aged 4 to 6 years), children reveal an emergent understanding of the nature of science (Samarapungavan et al., 2008) and they successfully select informative interventions that allow to draw appropriate causal inferences (Lapidow & Walker, 2020). Kindergarteners also select unconfounded experiments when they are presented with conflicting evidence (van Schijndel et al., 2015), and 4-year-olds show a rudimentary command of the control-of-variables strategy (i.e., vary one variable at a time while keeping all others constant) (van der Graaf et al., 2015). The ability to successfully interpret data also emerges in kindergarten when children begin to draw correct inferences from simple covariation data, such as perfect and unconfounded patterns of covariation data (Koerber et al., 2005; Piekny & Maehler, 2013).

Most studies of early scientific reasoning are conducted with Western samples, and studies in Asian countries are rare. The rationale of the present study is therefore to investigate the reliability and convergent validity of a scientific reasoning inventory that was originally developed in Germany, the Science-K Inventory (SKI) (Koerber & Osterhaus, 2019), when applied to a Chinese sample of 6- and 7-year-olds. In addition, we compare the performance of an urban Chinese sample with the performance of a rural Chinese sample, we ask whether there are gender differences in the scientific reasoning of Chinese 6- and 7-year-olds, and we investigate whether subcomponents of scientific reasoning are related.

The Science-K Inventory (SKI) is a closed-response instrument that comprises 30 items on children’s experimentation skills, data interpretation, and nature of science (NoS) understanding. Items on experimentation skills tap children’s ability to differentiate a conclusive from an inconclusive test of a hypothesis (Sodian et al., 1991), as well as their mastery of the control-of-variables strategy (i.e., vary one thing at a time, keep nonfocal variables constant). Children who can differentiate conclusive from inconclusive evidence, for instance, understand that—when trying to find out if someone is good at doing puzzles—this person should piece together a puzzle with many (conclusive evidence) and not just few (inconclusive evidence) pieces. Similarly, when trying to find out if cacao powder dissolves better in warm or cold milk, children with a command of the control-of-variables strategy understand that one should compare how well it dissolves in equal serves of warm and cold milk, rather than comparing how well it dissolves in a large glass of warm milk and a small glass of cold milk. Items on data interpretation measure children’s ability to make sense of simple patterns of covariation data and to understand that one cannot draw valid inferences with respect to a single variable from confounded data. For instance, a runner could observe that she runs faster when wearing her new running shoes and her new running suit. Children who understand that confounded data do not allow to draw valid inferences understand that this pattern of data will not allow to decide whether it is the shoes or the suit that influences how fast the runner runs. Finally, NoS items tap children’s understanding of what scientists do (they try to find out something about the world) and which types of questions they ask (questions whose answers provide explanations).

The SKI was validated in a German study. Koerber and Osterhaus (2019) used the SKI in a study in kindergarten and applied it to 227 six-year-olds. The administration of the SKI was completed during three individual interview sessions, lasting each approx. 20 min. During these interviews, trained researchers guided the children through all multiple-choice questions and recorded their answers. A scale analysis of the data obtained in this study showed that the SKI is a reliable instrument (Cronbach’s α = 0.78), and 6-year-olds performed significantly better than chance, with a mean correct performance of 42.5% correct (Koerber & Osterhaus, 2019). In line with the many studies that have associated scientific reasoning with children’s language skills (Koerber et al., 2017; Osterhaus et al., 2017; van de Sande et al., 2019; van der Graaf et al., 2018), also performance on the SKI was related to language skills, with correlation coefficients across studies ranging between 0.36 (Koerber & Osterhaus, 2021) and 0.41 (Koerber & Osterhaus, 2019).

1.1 This study

In the present study, we investigate the reliability and convergent validity of a shortened 10-item scale of the SKI that was translated into Mandarin. The 10 items of this shortened Chinese version of the Science-K Inventory (SC-SKI) were selected based on data from the Koerber and Osterhaus (2019) study that was conducted in Germany. In particular, we selected an item pool that would cover the broad aspects of scientific reasoning while simultaneously resulting in a reliable and balanced scale. To address the convergent validity of this shortened Chinese scale, we measured children’s language skills (i.e., their vocabulary understanding). Previous work (e.g., Koerber et al., 2017; Mayer et al., 2014; Osterhaus et al., 2017; van de Sande et al., 2019; van der Graaf et al., 2018) has revealed substantial associations between scientific reasoning and young children’s language skills, which are well expected and indicative of the broad association between scientific reasoning and (verbal) reasoning in general. Among children’s language skills, vocabulary understanding may be a particularly relevant aspect, especially for NoS. NoS requires that children understand science-specific terminology, including an understanding of what it means to ‘investigate’ something or to ‘make an assumption’ (Osterhaus et al., 2017).

To investigate the ability of the SC-SKI to detect expected performance differences, we compared the performance of an urban sample to a rural sample from Hunan Province, China. Because of the rural–urban disparity in schooling that prevails in China (e.g., Zhang, 2017), we reasoned that the children from urban areas should outperform children from more rural areas, which is a finding that, if it holds, will lend support to the usefulness of the SC-SKI and its ability to detect meaningful individual differences.

In addition, we assessed whether there are gender differences in this sample of young Chinese elementary school students. Some previous work (e.g., Lazonder et al., 2020) has observed such differences in older elementary school children, with males outperforming females, whereas other studies did not find such differences in elementary school (e.g., Osterhaus et al., 2017). Previous work has also identified significant associations between scientific reasoning subcomponents. For instance, Osterhaus et al. (2017) report a significant factor correlation of 0.54 for elementary school children’s experimentation skills and their NoS. In the present study, we therefore assess the correlation between subcomponents, asking whether significant associations emerge in this sample of Chinese elementary school children.

1.2 Aims and objectives

The present study had five main aims: (1) to address the reliability of the SC-SKI, (2) to show its convergent validity by investigating the association between children’s performance on the SC-SKI and their language skills, (3) to investigate differences in performance between an urban and a rural Chinese sample, (4) to investigate whether there are gender differences in the scientific reasoning performance of males and females, and (5) to investigate whether there are significant associations between scientific reasoning subcomponents.

2 Methods

2.1 Participants

The participants were 69 first-year elementary school students (31 females, 38 males) from two schools in the Hunan Province, China: one located in an urban area (n = 53) and one located in a rural area (n = 16). All children were aged 6 to 7 years; there were 43 six-year-olds and 26 seven-year-olds (M = 6.38, SD = 0.49). All children were at the end of their first semester. First-year curricula were similar across schools, and the children received instruction in Chinese, mathematics, and English education. Science-related courses were not offered at neither school. Parental and teachers’ informed consent and child assent were obtained for all participants.

2.2 Materials

Scientific reasoning Scientific reasoning was assessed using ten closed-response items from the shortened Chinese version of the Science–K(indergarten) Inventory (SC-SKI) (Koerber & Osterhaus, 2019). The items were chosen based on pilot studies from Germany (Koerber & Osterhaus, 2019, 2021), and they were translated into Mandarin (see Appendix). The SKI was translated by one of the researchers; a back-translation was done to confirm the accuracy of the translation.

The full version of the SKI (Koerber & Osterhaus, 2019) is a 30-item instrument developed to assess emerging scientific reasoning abilities in kindergarten and early elementary school. The items are administered in individual interviews, and the children are assessed for their abilities in experimentation and data interpretation, and their understanding of the nature of science. The shortened Chinese SC-SKI comprises 4 items on experimentation, 3 items on data interpretation, and 3 items on nature of science understanding. Items on experimentation assessed children’s ability to differentiate a conclusive from an inconclusive test (items Exp-1 and Exp-2), as well as children’s understanding of the control-of-variables strategy (items Exp-3 and Exp-4). Items on data interpretation tapped children’s ability to understand that confounded data patterns do not allow to draw conclusions with respect to a hypothesis (items Dat-1 to Dat-3). And items on nature of science understanding tested children’s understanding of what scientists do (item NoS-1) and what kinds of questions they ask (items NoS-2 and NoS-3). All items were presented with three answer options, and no corrective feedback was given.

Full credit (1 point) was given when the children selected the correct answer. All other (wrong) answers were awarded with 0 points.

Language (vocabulary) Children’s language skills were assessed using 10 items from the vocabulary test of the Wechsler Preschool and Primary Scale of Intelligence (Wechsler, 2003) that were translated by the researchers: shoe, bike, hat, nail, gasoline, donkey, seesaw, to participate, diamond, to hate. The researchers read out the word to the children and asked them to explain what they meant (e.g., ‘Could you explain the word shoe to me?’ or ‘What is a shoe?’). When the children were hesitant to respond or when they gave an incomplete answer, the experimenters would repeat the question or elaborate and ask children to tell them more about the word (e.g., ‘Please tell me more about a shoe.’). All sessions were recorded for subsequent coding.

For nouns, full credit (2 points) was given if the children either provided a correct synonym, a main function of the object, the main characteristic of the object (or several characteristics of the object), or a correct classification that is stated in the Xinhua Dictionary (11th edition). For verbs, full credit (2 points) was given if the children provided a precise description of the activity. Partial credit (1 point) was given if the children provided a correct but simple explanation or a synonym that was not the same as the word given (e.g., poultry-pigeon), if they explained it using an unconventional function (e.g., knife-to kill a person), provided a secondary character, mentioned the word when trying to explain the word, or used an action to represent the word. No credit (0 points) was given for answers that simply restated the question or that were wrong. Children would also receive 0 points when they spoke dialect. The test was discontinued after the children had answered 5 consecutive items wrong. The composite scores reported are the average of the total possible.

2.3 Procedure

A trained researcher conducted the individual interviews. Interviews were conducted online; data collection took place between December 1, 2020, and January 1, 2021. Scientific reasoning and language skills were assessed during two separate sessions. In the urban school, the two sessions took place on two separate days; in the rural school, both sessions were conducted on the same day. Researchers did not provide corrective feedback during or after a session. The order in which the three components of scientific reasoning (experimentation, data interpretation, and nature of science understanding) were assessed was counterbalanced across participants.

2.4 Analysis plan

The data were analyzed using IBM® SPSS® Statistics version 27 and R 4.0.4. Average scientific reasoning performance was computed as percent correct (on the SC-SKI), and we used a t-test to test whether children’s average performance significantly differed from chance (here 33.3%). Item difficulty and discrimination were calculated using the ‘sjPlot’ package for R (Lüdecke, 2021), and reliability (McDonald’s ωt) was computed using the ‘psych’ package for R (Revelle, 2022). In particular, a factor analysis was performed and omega total was computed for the general factor, as well as for three subfactors. Performance differences between groups (males vs females, urban vs rural sample) were assessed using t-tests (SPSS); correlations were calculated based on composited scores (SPSS). Correlations between subcomponents of scientific reasoning (i.e., experimentation and data interpretation skills, NoS) were computed based on composite scores.

3 Results

3.1 Core ability

The core performance data are given in Table 1. Dat-1 and Dat-2 (interpreting confounded data), as well as NoS-3 (what questions do scientists ask?) were difficult, and none of the children in the urban or rural sample achieved an average score of > 25% correct. The mean performance (in percent correct) across all items was 53.4% (SD = 30.3) in the urban school; it was 28.0% (SD = 23.1) in the rural school. In the urban school, more than 75% of the children gave a correct answer to items Exp-7 and Exp-8 (both assessing children’s understanding of the control-of-variables strategy) and item NoS-1 (what do scientists do?). In the rural school, more than 75% of the children gave a correct answer to only item NoS-1. Average performance across groups differed from chance guessing [t(68) = 6.653, p < 0.05]. However, this was not true for the performance of the children from the rural area, who did not perform significantly better than expected based on guessing [t(15) = − 1.559, p > 0.05]. Descriptively, there was a gender difference, with boys (M = 49.0%, SD = 31.3) achieving a higher average performance than girls (M = 46.5%, SD = 25.1). However, this descriptive difference was nonsignificant [t(9) = 0.663, p > 0.05], which is a finding that is in line with prior work showing no gender differences in early scientific reasoning (Koerber & Osterhaus, 2019; Koerber et al., 2015). The reliability of the SC-SKI was good, with McDonald’s ωt = 0.60 for the entire test, and ωt being 0.43, 0.56, and 0.62 for the three factors. Item discrimination and difficulty indices are given in Table 1.

Table 1 Percent Correct per Item in the Urban and Rural Samples

The average mean score (on a scale from 0 to 2 points) for the vocabulary test was 1.439 (SD = 0.488). There was no difference in performance between boys (M = 1.444, SD = 0.19) and girls (M = 1.44, SD = 0.13) [t(9) = − 0.045, p > 0.05). Children from the urban sample (M = 1.64, SD = 0.10) outperformed children from the rural area (M = 0.77, SD = 0.34), with children from the rural area providing less accurate explanations [t(9) = 10.214, p < 0.05].

3.2 Correlational analysis

The correlation between the composite scientific reasoning score and vocabulary score was significant and of substantial magnitude, with r = 0.54, p < 0.05. However, not all three components of scientific reasoning were significantly correlated with language skills (see Table 2). The correlation between nature of science understanding and language skills was insignificant (r = 0.11, p = 0.4), as was the correlation between data interpretation and experimentation skills, and data interpretation and nature of science understanding (r = 0.15, p = 0.20, and r = 0.02, p = 0.80, respectively).

Table 2 Correlation between Scientific Reasoning Components and Language Skills

4 Discussion

What are the psychometric properties of the SC-SKI, a 10-item scientific reasoning test for Mandarin-speaking children in early elementary school? That was the main question of the present study that found that the SC-SKI reveals a good reliability, as well as convergent validity and the ability to detect expected performance differences between young children in an urban and a rural sample.

To investigate convergent validity, we studied the association between children’s performance on the SC-SKI and language skills, which are two constructs that have been firmly associated across many studies (Koerber et al., 2017; Mayer et al., 2014; Osterhaus et al., 2017; van de Sande et al., 2019; van der Graaf et al., 2018). In line with earlier findings with 6-year-olds that documented correlation coefficients between 0.36 (Koerber & Osterhaus, 2021) and 0.41 (Koerber & Osterhaus, 2019), we found a correlation of 0.54 between the SC-SKI and a Mandarin vocabulary test. Finding this well expected association, which points to the close association between scientific reasoning and general (verbal) reasoning, as well as to the need for children to master science-specific vocabulary, is evidence of the convergent validity of the SC-SKI.

To investigate whether the SC-SKI is able to detect meaningful and expected performance differences between a sample of young children from an urban sample and those from a rural sample, we compared the performances of these two samples. Because of the disparities in school performance between urban and rural areas in China (e.g., Zhang, 2017), we reasoned that the children from the urban school should reveal a better performance than children from the rural area if the instrument was valid. And this was indeed the case: Children from the urban area performed significantly better than children from the rural area, whose performance did not exceed chance level. This finding supports the usefulness of the SC-SKI and its ability to detect meaningful individual differences, and at the same time, it shows the importance of fostering scientific reasoning skills from early on, which is likely to happen more frequently in the urban than rural school.

The present study also identified some areas of improvement for the SC-SKI. In particular, item discrimination indices were low for two data interpretation and two NoS items, suggesting necessary improvements to some of the items included in the SC-SKI. It is worth noting, however, that item discrimination was overall good, and in particular for items assessing children’s experimentation skills. A potential explanation for the poor discrimination of some of the data interpretation items, which assessed children’s ability to recognize confounded data and to understand that no conclusions can be drawn from this type of data, lies in their relative difficulty: Especially in the rural sample, only few children (< 10%) solved these items correctly. This finding is in line with prior research, showing that this particular aspect of data interpretation is rather challenging for young children (Osterhaus et al., 2020) and substantially more difficult than the interpretation of simple and conclusive patterns of covariation data (see Koerber et al., 2005; Piekny & Maehler, 2013).

Although the reliability of the SC-SKI was good, significant correlations did not emerge between all scientific reasoning subcomponents (i.e., experimentation and data interpretation skills, NoS). While we found a significant correlation between experimentation skills and NoS (0.25), children’s data interpretation skills were uncorrelated with any other scientific reasoning subcomponent. The significant association between experimentation and NoS confirms earlier findings from studies with German elementary school children (Osterhaus et al., 2017), which showed a substantial association between these constructs. The finding of the present study that children’s data interpretation skills were uncorrelated with any other scientific reasoning subcomponent can best be explained by the floor effect in data interpretation skills: Many children struggled with these items and especially the rural sample performed poorly. Future studies that address the relation between subcomponents of scientific reasoning should therefore include children of a more-diverse ability spectrum, assessing the scientific reasoning skills of older children who have already developed more-profound data interpretation skills.

For educators, these findings have several implications: First, when educators want to foster children’s scientific reasoning skills or make use of them in the classroom (e.g., in inquiry-based learning), they need to make sure that they select the subcomponents that are appropriate given the children’s ability level. In particular, data interpretation skills (understanding that confounded data patterns make it impossible to draw valid inferences) seem hard to acquire for young children, and hence elementary school teachers should focus on experimentation when they want to engage young children (i.e., first or second grade students) in scientific reasoning activities. Second, finding that subcomponents do not fully cohere suggests that skills should not be fostered in isolation, but educators should highlight what these different aspects of scientific reasoning have in common so that it will be easier for children to transfer their skills from one subcomponent to the other.

In line with previous work (Koerber & Osterhaus, 2019; Koerber et al., 2015), we did not observe any gender differences in scientific reasoning performance. Some researchers (e.g., Lazonder et al., 2020) have observed such differences in older elementary school children. It may well be that gender differences—if they exist—develop late in development, once children are confronted more with stereotypical images of scientists (who may often be depicted as males), which may result in a stronger identification with science topic in boys than girls. Future research should address this question, and cross-cultural studies may be particularly helpful in this respect, as cultures differ in how they portray science and scientists. The availability of a validated measure of scientific reasoning, such as the SC-SKI, is a necessary prerequisite to make possible and foster this type of research.

There are two shortcomings of the present study: First, we studied a relatively small sample of children. Needless to say, future work must study larger and more representative groups of children, including those from different provinces and age groups. Second, our urban and rural samples were not equal in size. Future work should draw larger samples from rural schools and address the specific learning histories of the children— in both urban and rural schools to account for variation between classrooms. The groundwork for such research is laid here, and the availability of the SC-SKI can be expected to promote such research.

5 Conclusion

The shortened Chinese version of the Science-K(indergarten) Inventory (SC-SKI) is overall a reliable and valid instrument to measure the scientific reasoning skills of Chinese 6- and 7-year-olds. Although further item improvements are necessary for some of the items, the SC-SKI is a valuable instrument and point of departure to foster cross-cultural research on young children’s scientific reasoning, which is an important twenty-first century skill.