1 Introduction

Tremendous technological changes are shaping our society and ways of working (Harteis et al., 2020; Ifenthaler et al., 2021; Kirschner & Stoyanov, 2020). In this context, computational thinking (CT) is regarded as a key 21st-century skill (Voogt et al., 2015; Wing, 2006; Yadav et al., 2016). The significance of CT in the twenty-first century may be evident (Barr et al., 2011; Buitrago Flórez et al., 2017; Wing, 2008): computing and computer technology pervade every field of study and workplace. CT aims at enabling humans to use these resources for solving problems.

In her seminal paper, Wing conceptualizes CT as “solving problems, designing systems, and understanding human behavior, by drawing on the concepts fundamental to computer science” (Wing, 2006, p. 33). As the theoretical basis for CT, the framework of Brennan and Resnick (2012) is often utilized. It comprises three dimensions (Brennan & Resnick, 2012, pp. 3–11). Computational concepts are common in many programming languages, but not restricted to them: sequences, loops, events, parallelism, conditionals, operators, and data. Computational practices are the processes in which students engage: imaging and building, testing and debugging, reusing and remixing, abstracting, and modularizing. Computational perspectives are shifts in points of view, relationships to others, and the digital world around: expressing, connecting, and questioning.

In CT research, the relationship between CT and programming is often thematized. Israel et al. (2015) regard the use of computers to model ideas and programming as an integral part of CT. Buitrago Flórez et al. (2017), as well as Lye and Koh (2014), argue that by means of programming, several core facets of CT can be addressed. Shute et al. (2017) concluded that there is a close relationship between CT and programming skills due to similar underlying cognitive processes. Hsu et al. (2018), based on their review of the literature, reported that programming is widely used to teach CT. Grover et al. (2016) maintained that programming has a positive influence on the experience on CT. Scherer et al. (2019), based on a meta-analysis, concluded that CT can be taught through programming. However, using professional programming languages like Java can be extremely difficult for students due to complex syntax, and it may be preferable to use visual programming languages (Lye & Koh, 2014; Repenning, 2017). Scratch, developed by the Massachusetts Institute of Technology Media Lab (https://scratch.mit.edu), is such a visual programming language that is heavily used as an instructional tool (Hsu et al., 2018).

The importance of CT assessment is regularly stressed (Grover & Pea, 2013; Ilic et al., 2018; Shute et al., 2017; Tang et al., 2020; Weintrop et al., 2021). However, it has to be highlighted that assessment is not an end in itself, but it should contribute to promoting student learning (Pellegrino et al., 2016). When assessing complex skills such as CT, the structure as well as the levels of the construct have to be considered (Seufert et al., 2021). For instance, the DigComp 2.1 framework that addresses digital competencies comprises five dimensions (structure), e.g., information literacy, as well as eight proficiency levels ranging from foundation to highly specialized (Carretero et al., 2017). In terms of CT, research about proficiency levels is in its infancy. The importance of modeling proficiency levels for a better understanding of this construct, however, is stressed by the 2018 International Computer and Information Literacy Study (ICILS) (Fraillon et al., 2019). Without proficiency levels, test results are difficult to comprehend and to communicate. Findings such as ‘the student belongs to the top 10% of all test takers’ or ‘the student answered 67% of the questions correctly’ are not very helpful for the purpose of fostering student learning. A proficiency level model could allow for a more vivid interpretation of the test results by relying on items with clearly specified characteristics that students are systematically able to master (AERA et al., 2014). This also makes it possible to set operationalized learning goals. For example, the goal could be that all students in a class are systematically able to use functions. Students who do not reach the corresponding proficiency level could receive specific guidance and support.

Due to the multifaceted nature of CT, it is unlikely that a single instrument is sufficient to comprehensively capture CT (Polat et al., 2021; Román-González et al., 2019). Rather, a system of various assessments may be necessary. Using multidimensional approaches could also reveal different CT profiles. For instance, some students may perform well in the realm of computational concepts, but at the same time poorly in the area of computational perspectives. By solely focusing on a specific facet of CT, students with specific strength and weakness profiles may not be identified. Knowledge about CT profiles, i.e., different types of computational thinkers, however, could be utilized for personalized guidance and support (Hofmans et al., 2020). Although such person-centered assessments are widespread in other research fields (e.g., Lohr et al., 2021; Meyer & Morin, 2016; Scherer et al., 2021; Tondeur et al., 2019), they are not common in CT research. The benefits of such techniques, however, are acknowledged in CT research (Román-González et al., 2019).

2 Theoretical Perspectives

2.1 CT Assessment Instruments

Several authors have reviewed CT assessment instruments (Kong, 2019; Román-González et al., 2019; Shute et al., 2017; Tang et al., 2020). Román-González et al. (2019) developed a useful classification of assessment instruments (Israel-Fishelson & Hershkovitz, 2022). Following this classification, diagnostic tools that aim at capturing students’ CT proficiency could be the most suitable basis for forming proficiency levels. Diagnostic tools are performance tests and do not require specific prior knowledge, e.g., a specific programming language. Hence, they can be used to evaluate learning gains by comparing pre- and post-test results, i.e., if students have reached a higher proficiency level after instruction.

Fig. 1
figure 1

Wright Map of CTt and proficiency levels

A drawback of many diagnostic tools is that they are not freely available (ICILS 2018: Fraillon et al., 2019; Fairy Assessment: Werner et al., 2012; Basic Programming Abilities: Mühling et al., 2015). Furthermore, several tools focus on the pre-secondary level (Chen et al., 2017; Kong & Wang, 2021; Relkin et al., 2021; Seiter & Foreman, 2013); therefore, the covered proficiency spectrum is limited. From the freely available diagnostic tools, the Computational Thinking Test (CTt) may be especially suitable for the purpose of forming proficiency levels. The CTt (Román-González, 2015) is a performance test for secondary students using the framework of Brennan and Resnick (2012) as a theoretical background. It defines CT as “the ability to formulate and solve problems by relying on the fundamental concepts of computing, and using logic-syntax of programming languages: basic sequences, loops, iteration, conditionals, functions and variables” (Román-González et al., 2017, p. 681). Sample items can be found in Figs. 2 and 3. Due to the use of a visual programming language, the CTt covers a broad range of instructional settings. To perform the CTt no knowledge in a specific programming language, e.g., Java, is necessary, which makes it a very flexible instrument. The CTt comprises 28 selected response items and can be taken online; the target group should be able to process the test in less than 45 min. The CTt is claimed to be unidimensional although different cognitive operations are involved when performing the items. This is based on the notion of Fischer (1973) that the items of a unidimensional construct may be linearly decomposed into cognitive operations; this may also be the case for CT (Mühling et al., 2015). According to the definition of the CTt, cognitive operations could be sequences, loops, conditionals, functions, and variables. These correspond with the computational concepts dimension of the Brennan and Resnick (2012) framework.

Fig. 2
figure 2

Item 6, containing sequencing

Fig. 3
figure 3

Item 8, containing sequencing and conditionals

Román-González et al., (2017) validated the CTt using a sample of 1,251 Spanish secondary students (5th to 10th grade) and classical test theory. The reliability of the test is sufficiently high (Cronbach’s alpha = 0.79). Chan et al. (2020) provided evidence for Rasch scalability of the CTt based on a sample of 153 upper-secondary students from Singapore. The CTt is increasingly used in research projects for assessing CT learning (e.g., Guggemos, 2021; Brackmann et al., 2017; Hooshyar et al., 2021; Rose et al., 2019; Zhao & Shute, 2019). In light of this, relying on the CTt may be in line with the call for using standardized instruments to ensure comparability across studies (Shute et al., 2017).

Despite its suitability for assessing computational concepts and, to some degree, computational practices, the CTt also has disadvantages: it neglects computational perspectives (Román-González et al., 2017). For capturing such perspectives, perception-attitude scales may be suitable (Román-González et al., 2019). They capture self-efficacy beliefs by means of self-assessment. In general, complementing a performance test with a self-assessment instrument may be beneficial to obtain a comprehensive picture of a construct (Rosman et al., 2015). Again, we aim at using standardized instruments and those with a specific focus on CT. This excludes computer attitude scales, e.g., Denner et al. (2014), Ericson and McKlin (2012), and Yadav et al. (2014). For the same reason, we do not consider generic self-efficacy or motivation scales. A viable option to capture computational perspectives may be the Computational Thinking Scales (CTS) (Polat et al., 2021). They were developed by Korkmaz et al. (2017) and is a standardized self-assessment instrument for capturing CT (Durak & Saritepeci, 2018; Israel-Fishelson & Hershkovitz, 2022; Shute et al., 2017). The authors utilize the International Society for Technology in Education (ISTE, 2015) framework of CT, namely, the five dimensions of creativity, algorithmic thinking, cooperativity, critical thinking, and problem solving. Descriptions of these dimensions can be found in Table 1.

Table 1 CT dimensions (Korkmaz et al., 2017)

The CTS consists of 29 self-assessment questions. It has been validated by means of confirmatory factor analysis using a sample of 580 Turkish undergraduate students. Fit-values are decent (Korkmaz et al., 2017, p. 565): CFI = 0.95, RMSEA = 0.06.

Overall, the CTt and the CTS may be complementary assessment tools that can provide a comprehensive picture of students’ CT ability (Polat et al., 2021). The CTt, as a unidimensional performance test, seems to be suitable to form proficiency levels, especially because the cognitive operations are documented. The CTS, as a multidimensional self-assessment instrument, can be used to identify CT profiles.

2.2 The Present Study

The current study aims at reaching a better understanding of CT as a construct. To this end, we contribute to more useful interpretations of CT assessment findings. Concerning performance tests, the importance of proficiency levels is stressed (Fraillon et al., 2019; OECD, 2017). However, the 2018 ICILS refrained from developing a proficiency level model due to the small number of CT test items in the study (Fraillon et al., 2020); only proficiency regions were described. When forming proficiency levels, referring to the cognitive operations involved when performing the items could be advantageous from a construct validity point of view (Embretson & Daniel, 2008). For instance, it would not be meaningful to form CT proficiency levels based on the text complexity of the items. In this regard, we can take advantage of the fact that for the CTt, the cognitive operations necessary for performing the items are documented. As the cognitive operations correspond with the computational practices of the Brennan and Resnick (2012) framework, this could be a sound theoretical basis. The paper at hand demonstrates that the cognitive operations can predict the difficulty of CT test items and can be utilized to form meaningful proficiency levels (Hartig et al., 2012). Our first research question is:

RQ1 What CT proficiency levels can be identified based on the CTt?

As already highlighted, the purpose of an assessment is to facilitate student learning. In the case of CT as a multifaceted construct, it is unlikely to observe only one CT profile. Rather, it can be expected to identify different types of computational thinkers. For teachers, knowledge about CT profiles could be helpful in reducing complexity. Students in the same profile could benefit from similar treatment (Hofmans et al., 2020); teachers may design their instructional measures around the identified profiles. Despite their usefulness for a better understanding of CT, to our knowledge, such person-centered methods of assessment have not yet been used in CT research. The CTS may be suitable for identifying latent profiles as it comprises five dimensions. Our second research question is:

RQ2 What CT profiles can be identified based on the CTS?

The CTt and CTS are regarded as complementary instruments that can offer a comprehensive picture of student CT (Polat et al., 2021; Román-González et al., 2019). Against this backdrop, we consider a third research question:

RQ3 How are the CTt and CTS results related to each other?

Since, recently, Polat et al. (2021) addressed the same research question, we will compare their findings with those of the present study.

3 Method

3.1 Sample and Adaption of Instruments

Our sample comprises 202 upper-secondary students from German-speaking Switzerland. They all attended the 11th (second last) grade at a Kantonsschule (high school), which is the most demanding school type in Switzerland. Data were collected at the beginning of the school year 2018/19. The CTt, CTS, and context questions were administered using Unipark. Teachers supervised the students and ensured an adequate test environment, e.g., preventing copying from their neighbor. The intended test time was 90 min. Ninety-five percent of the students were able to finish the test within this time; teachers allowed every student to complete the work. On average, the students were 17.23 years old (SD = 0.85 years) and 56% were female. They experienced, on average, 2.89 h (SD = 1.20 h) of computer science instruction in the past; students reported tigerjyton (https://www.tigerjython.ch/engl/) as the most generally used learning environment. Tigerjyton addresses important computational concepts such as sequencing, conditionals, functions, and loops. Overall, 77% of the students claim to be able to program, e.g., in Java or Python. To evaluate test motivation, we can draw information from two items within the context questionnaire: ‘When performing the tasks, I disengaged’ and ‘My mind was elsewhere when I was performing the tasks’ (Prenzel et al., 1998). Cronbach’s alpha equals 0.78. On average, the students disagreed concerning a lack in test motivation: M = 3.36 and SD = 1.66, based on a seven-point scale of rating. This is consistent with an absence of missing data; the omission of items can act as a proxy for a lack of test motivation (Ulitzsch et al., 2020). We also checked for multivariate outliers using Mahalanobis distances (Leys et al., 2019); no student is classified as an outlier at the 1% (and 5%) significance level.

Since the CTt was designed for 5th to 10th grade students, we replaced the five easiest items by five equivalent but more difficult ones. Equivalent means comparable in the environmental interface (canvas vs. maze), answer style, and required task. To this end, we drew on the initial pool of CTt items, namely forty, as well as where experts gauged an item difficulty (Román-González, 2015). The replacement of very easy items may be advantageous as they do not have an evidentiary value (Köhler & Hartig, 2017). A pre-test indicated that almost all the students within the target group (11th grade) would master the easiest five items. Figure 4 depicts an item that was integrated within our German CTt version in comparison to the original version. The item numbering in this paper always refers to our German version; the numbering allocation of the original CTt version can be found in Table 2. To adapt the CTS from English to German we applied a back-translation approach (Maneesriwongul & Dixon, 2004).Footnote 1

Fig. 4
figure 4

Item 28, containing sequencing, functions, and variables.

Table 2 Item difficulty, fit, and DIF-effects of CTt (N = 202)

3.2 RQ1: Forming Proficiency Levels

3.2.1 Psychometric Test Validation

Before forming proficiency levels, a psychometric test validation is necessary. If the CTt was Rasch scalable, this would imply specific objectivity: students (proficiency) and items (difficulty) can be located on a common Logit scale. This allows for a criterion-referenced test interpretation (Hartig & Frey, 2013). If the proficiency of a person equals the difficulty of an item (same location on the Logit scale), the expected probability of a correct response will be 50%. The proficiency of students can be described by referring to items that they are expected to master with a specified probability.Footnote 2 Specific objectivity would be violated, for instance, if some items were more difficult for males than for females.

For assessing Rasch scalability, we draw on the framework of Bühner (2011, p. 547). First, we carry out Andersen’s likelihood ratio test (LRT) (Andersen, 1973) using the R-package ‘eRm 1.0–1’ (Mair & Hatzinger, 2007). A significant LRT would indicate that the items work differently in specific subgroups, i.e., different parameter estimations for the difficulty of the items are obtained. In order to perform the LRT, the students in the sample have to be split up into subgroups. We use the median of the CTt raw score, gender (male vs. female), age (above and below average), computer literacy (above and below average), and ability to program (yes vs. no) as split criteria (Chan et al., 2020; Guggemos et al., 2019). Computer literacy is captured by the dimension practical computer knowledge of the INCOBI-R (Richter et al., 2010), and ability to program via student self-reporting. In case of a significant LRT, the next step is to check which items work differently in the subgroups. For instance, students who are able to program may have a systematic advantage in answering specific items (DIF-effect). Based on DIF-analyses (‘TAM 3.5–19’ package in R; Robitzsch et al., 2020), we may exclude items that systematically advantage or disadvantage specific subgroups. Test fairness (absence of DIF) is an important characteristic of an assessment instrument (AERA et al., 2014). In our case, it is of specific importance as it is the prerequisite for locating all students on one logit scale. In line with Penfield and Algina (2006, pp. 307–308), a DIF of less than 0.43 Logit may be negligible, between 0.43 and 0.64 Logit moderate, and above 0.64 Logit large.

The LRT and DIF-analyses rely on pre-specified split criteria. However, there may also be latent subpopulations of individuals for which the CTt works differently or who show deviant response behavior. Such latent subpopulations can be identified with a mixed Rasch analysis (‘mixRasch’ 1.1 package in R; Willse, 2011). An example of a latent subgroup could be students who are guessing in order to solve the selected response CTt items. If the mixed Rasch analysis reveals a one-class solution, this would be evidence for the overall fairness of the CTt. To identify the optimal number of latent classes, information criteria are used. They consider goodness of fit and penalize model complexity. We compare models with 1 to 6 latent classes and select the one with the lowest Akaike’s information criterion (AIC), as recommended by Bühner (2011, p. 547).

Besides person homogeneity, unidimensionality is necessary to justify the allocation of students and all items on a common Logit scale. The CTt is designed to be unidimensional. Hence, we do not have any assumptions about meaningful factors, other than CT driving students’ response behavior. To check for unidimensionality, we relied on confirmatory factor analysis (‘lavaan 0.6–7’ package in R; Rosseel, 2012). Since the data are ordinal (correct/incorrect), we applied a WLSMV-estimator (Li, 2016). A chi-square test acts as a global fit test. Furthermore, we rely on CFI, TLI, RMSEA, and SRMR as fit measures. Cut-off values for a decent fit may be: CFI and TLI > 0.95, RMSEA < 0.08, and SRMR < 0.11 (Bühner, 2011, pp. 425–427). Poor fit measures would indicate omitted factors that drive the response behavior. For example, if different answer styles were used (see Figs. 3 and 4), this could explain (besides CT) the answer behavior.

The linear logistic test model, as an extension of the Rasch model (Fischer, 1973), allows us to assess whether the cognitive operations involved in the CTt, e.g., sequencing, can explain a substantial proportion of item difficulty. This may be the prerequisite for forming proficiency levels based on the cognitive operations (AERA et al., 2014). A proportion of explained variance (R2) of 26% might be the minimum that justifies the use of the cognitive operations for forming proficiency levels (Hartig et al., 2012). We utilize the ‘eRm 1.0–1’ package in R to estimate the linear logistic test model.

After having checked Rasch scalability, we examine if the items meet the cut-off values applied in the PISA studies (OECD, 2017, pp. 131–134; OECD, 2015, pp. 148–151). The deviance from the item discrimination implied by the Rasch model is evaluated by means of the weighted mean square error (wMNSQ = Infit). Discrimination, along these lines, means to separate students in terms of their CT proficiency. For example, if all students were able to master a certain item, this item would have no discriminatory power. The wMNSQ should lie between 0.8 and 1.2; however, wMNSQ values up to 1.33 might be acceptable (Wilson, 2005, p. 129). Items above the upper limit have a too low discrimination, whereas items below the lower limit have a too high discrimination. The point-biserial correlation is a measure for item discrimination in classical test theory and should be above 0.30. The percentage of correct answers should fall between 20 and 90%. Not more than 10% of missing data should be present.

3.2.2 Building Proficiency Levels

For a criterion-referenced interpretation of the CTt results, we form proficiency levels utilizing the approach in the PISA studies (OECD, 2017, pp. 276–287). To this end, we split up the continuum of CT. We choose a width of 1.0 logits for the proficiency levels and a response probability of 62%. This means that students at the bottom of a proficiency level are expected to solve items at the bottom of that level with a probability of 62%, and at the top of the level with a probability of 38%. We opted for a width of 1.0 logits for the proficiency levels, instead of 0.8 as used in the PISA studies, because the manifestation of the cognitive operations indicates a width of one logit, which is permitted (OECD, 2017, p. 281). For every proficiency level, we provide an anchor item. These items are located about 0.5 Logit below the start of the respective level on the Wright map, corresponding with a response probability of about 62%.

3.3 RQ2: Identifying CT Profiles

3.3.1 Psychometric Test Validation

Before identifying latent profiles, we have to evaluate the psychometric properties of the CTS. Since we measured the items on a seven-point scale of rating, ranging from ‘not true at all’ to ‘entirely true’, we utilize confirmatory factor analysis with an MLR-estimator (Robitzsch, 2020). To assess the overall model fit, we use CFI, TLI, RMSEA, and SRMR. Moreover, we check for convergent and discriminant validity relying on the average variances extracted and the heterotrait–monotrait ratio. An average variance extracted greater than 0.5, and a heterotrait–monotrait ratio smaller than 0.85, indicate sufficient convergent and discriminant validity (Hair et al., 2019). An average variance extracted above 0.5 implies that more than 50% of the item variance can be explained by the corresponding factor and less than 50% is error variance. A heterotrait–monotrait ratio below 0.85 indicates that the used items capture empirically distinguishable constructs. Revell’s omega total (ω) acts as a measure for internal consistency reliability because it is superior to Cronbach’s alpha (α) (McNeish, 2018). Since α is widely used, however, we report it along with ω.

Analogous to the DIF-analyses for the CTt, we have to assess the measurement invariance of the CTS, i.e., a similar meaning of the constructs among subgroups, e.g., among males and females. To control for measurement invariance we apply the approach of van de Schoot et al. (2012). Since our aim is to build latent classes based on manifest means, we have to demonstrate full uniqueness measurement invariance. We use a likelihood ratio test to compare the unrestricted model where all parameters are freely estimated with a model where loadings, intercepts, and error variances across groups are restricted to be equal. In line with the DIF-analyses for the CTt, we form groups based on gender, age, computer literacy, and ability to program.

3.3.2 Identifying Latent Profiles

We use the ‘tidyLPA 1.0.8’ R-package in combination with MPlus 8 to identify student CT profiles by means of a latent profile analysis (LPA) (Hallquist & Wiley, 2018; Rosenberg et al., 2018). We apply an MLR-estimator (Scherer et al., 2021); missing data are not present. In light of our rather small sample size, we have to restrict variances to be equal across profiles and the covariance among the variables to be zero in order to achieve convergence (Meyer & Morin, 2016). The critical step in the LPA is to identify an appropriate number of profiles. This decision might be based on information criteria and likelihood ratio tests, as well as on conceptual deliberations (Scherer et al., 2021). Against this backdrop, we first assessed different class solutions. Following Morin and Marsh (2015) and Hofmans et al. (2020), we report the information criteria AIC, CAIC, BIC, and aBIC, as well as the bootstrap likelihood ratio test (BLRT). Since these criteria may point to a different number of optimal profiles, we also utilize the approach of Akogul and Erisoglu (2017) where information criteria are weighted to determine the optimal number of latent profiles (from an empirical point of view). The number of constructs in the LPA could be a reasonable maximum for the number of latent classes (Tondeur et al., 2019), i.e., if four constructs are considered in the LPA, four profiles could be the maximum. The identified solution should have a sufficiently high precision of classification, indicated by an entropy greater than 0.7. However, the entropy should not be used as a model selection criterion (Sarstedt et al., 2011). To demonstrate the robustness of the findings we conduct a replication of the LPA with 100 bootstrap samples of 150 students from our overall sample of 202 students (Vanslambrouck et al., 2019). Besides this, the profiles should be substantially different from each other, which can be checked by means of a MANOVA (Tondeur et al., 2019). Afterwards, we evaluate if this approach yields a meaningful solution. The profiles should be of reasonable size and show substantial shape differences, i.e., specific strength and weakness profiles that not only differ in levels but also in their pattern (Morin & Marsh, 2015).

4 Results

4.1 RQ1: CT Proficiency Levels

4.1.1 Psychometric Validity of the CTt (German Version)

Of the 28 items, the students in the sample answered on average 18.45 items correctly (SD = 5.71, median = 19, min = 6, max = 28). Concerning Rasch scalability, the LRT yielded mixed results. We did not find significant DIF-effects in terms of gender (χ2 = 36, df = 27, p = 0.11), age (χ2 = 16, df = 26, p = 0.94), or computer literacy (χ2 = 30, df = 26, p = 0.26). However, utilizing the median of the CTt score and ability to program as a split criterion yielded significant DIF: χ2 = 77, df = 27, p < 0.01 and χ2 = 48, df = 25, p < 0.01, respectively. Four items caused this overall DIF-effect. Item 1 was far too easy for the students in our sample (-4.80 Logits) and therefore has no discriminatory power. Item 10 may have caused problems due to a different response format. The provided answer ‘Options A and C are correct’ might have confused students: many high-achieving students selected Option A. Items 1 and 10 can be found in Appendix 1. For items 11 and 20, we could not find a reason on the content level. Moreover, the DIF-effects are only light to moderate: Logit = 0.58 and 0.51, respectively. Against this background and considering content validity in terms of alignment with the framework of Brennan and Resnick (2012), we decided to exclude items 1 and 10 from the test and retain items 11 and 20. All further analysis was carried out without items 1 and 10, i.e., with 26 items.

The mixed Rasch analysis revealed a one-class solution; the AIC is lower in comparison to any multiclass solution (e.g., AIC for one class = 4944, AIC for two classes = 4982, and AIC for three classes = 5024).

The assumption of unidimensionality (item homogeneity) of the CTt is justified. The CFA with the 26 items loading on a single factor showed a decent fit: χ2(199) = 341 (p = 0.049), CFI = 0.964, TLI = 0.961, RMSEA = 0.026 (90% CI [0.000, 0.039]), SRMR = 0.063.

The linear logistic test model indicated the following cognitive operations as predictors of CTt item difficulty: sequencing: 1.82 Logit, 95% CI [1.46, 2.18], conditionals: 0.24 Logit, 95% CI [0.09, 0.39], functions: 0.77 Logit, 95% CI [0.54, 0.99], and variables: 2.02 Logit, 95% CI [1.77, 2.02]. Overall, these characteristics can explain 62.8% of the difficulty variance of the 26 items, which is well above the minimum acceptable value of 26%.

Concerning the cut-off values from the PISA studies, in general, all items show good values. The wMNSQ lies between 0.89 and 1.15 with the exception of item 18. This item has a wMNSQ of 1.22, which is slightly above the cut-off value of 1.2, but below 1.33. All point-biseral correlations are higher than 0.30. The percentage of correct answers for all items lies between 90 and 25%. Every student fully processed the items; missing values are not present. Table 2 summarizes the item characteristics.

As the Wright map (see Fig. 1) indicates, the items are slightly too easy for the students in our sample. Nevertheless, the EAP/PV-reliability equals 0.85, WLE-reliability 0.81, which is sufficiently high for research purposes. However, if more difficult items were used, we could expect an even higher reliability.

4.1.2 Proficiency Levels

The proficiency levels are illustrated in Fig. 1. All anchor items, in bold, have negligible DIF-effects (< 0.41 Logit).

Level I (− 1.75 ≤ Logit <  − 0.75) and below: Level I is characterized by tasks using a flow structure without further elements, like conditionals or variables. Item 6 is an anchor item for this level (see Fig. 2). Of the students in the sample, 7.4% do not achieve level I; hence, they are systematically unable to perform sequencing tasks.

For younger students, it could be necessary to insert a level below level I, which contains simple flow structures. Students in our sample, however, solved tasks like the excluded item 1 (see Appendix 1) in almost all cases. Of the students in the sample, 29.7% are on level I, i.e., they are able to systematically solve sequencing tasks.

Level II (− 0.75 ≤ Logit < 0.25): Level II comprises tasks that include conditionals and/or functions. These are core elements of CT and important in many domains. For instance, they are necessary for an effective use of spreadsheet software like Microsoft Excel. Item 8 is an anchor item for this level (see Fig. 3). In comparison to item 6, the increase in difficulty may be attributed to the use of conditionals. Of the students in the sample, 22.8% are on level II.

Level III (0.25 ≤ Logit): On level III, items typically include the use of variables, which is also a core concept of CT (and programming). Item 28 is an anchor for level III (see Fig. 4). Of the students in the sample, 40.1% are on level III; hence, they are systematically able to cope with sequencing, conditionals, functions, and variables.

We cannot set a fourth level because we do not have meaningful item characteristics that justify building such a level, i.e., no suitable anchor items are available.

4.2 RQ2: CT Latent Profiles

4.2.1 Psychometric Validity of the CTS (German Version)

In our sample, the fit-values of the initial version of the CTS with 29 questions indicated room for improvement: YB-χ2(340) = 657 (p < 0.001), CFI = 0.881, TLI = 0.868, RMSEA = 0.073, SRMR = 0.095. The reasons are mainly due to cross loadings. For instance, the first question of algorithmic thinking also loads significantly on critical thinking and creativity. Discriminant validity is not ensured. Based on a content review, we selected three items for each of the five dimensions. This approach yielded a decent fit: YB-χ2 (80) = 85 (p = 0.341), CFI = 0.997, TLI = 0.996, RMSEA = 0.018 (90% CI [0.000, 0.047]), SRMR = 0.040. Convergent and discriminant validity are fulfilled. The average variance extracted is greater than 0.543 for all five constructs. The heterotrait–monotrait ratio is smaller than 0.706 for all combination of constructs. The five dimensions are reliably measured (α and ω > 0.77). The characteristics of the refined version of the CTS can be found in Table 3; the questions can be seen in Appendix 2.

Table 3 Characteristics of used CTS items and constructs (N = 202)

Full uniqueness measurement invariance is ensured for the four considered subgroups: gender: Δχ2 (105) = 121.84, p = 0.125; age:Δχ2 (105) = 116.98, p = 0.200; computer literacy: Δχ2 (105) = 122.92, p = 0.112; and ability to program: Δχ2 (105) = 128.94, p = 0.056. Hence, it may be justified to use manifest means for the LPA.

4.2.2 Latent Profiles

The descriptive statistics in Table 3 show that, on average, the students assess their creative thinking, algorithmic thinking, cooperativity, critical thinking, and problem solving above the neutral scale mean (= 4). The highest latent correlation appears between creativity and critical thinking (ρ = 0.71, p < 0.001). However, there are also small and statistically insignificant correlations, e.g., between algorithmic thinking and cooperativity (ρ = 0.15, p = 0.079).

When identifying latent profiles, the dimension problem solving was problematic. The likely reason is the reverse coding of the corresponding items; see Appendix 2. Including this dimension in the LPA yielded spurious profiles, e.g., a profile with students who score very low in problem solving and very high in all other dimensions. Since considering this construct could bias the findings, we removed it from the further analysis. This may also be suitable from a conceptual point of view. Creativity, algorithmic thinking, critical thinking, and cooperativity might all be necessary during the course of problem solving; problem solving might be a construct of a different nature.

Table 4 depicts the information criteria, the BLRT results, and the entropies. Based on the information criteria AIC and aBIC, a five-profile solution would be optimal; CAIC, BIC, and BLRT, as well as the analytic hierarchy process of Akogul and Erisoglu (2017), point to four profiles. Moreover, the entropy for this solution is sufficiently high (0.763). A replication of the LPA with 100 bootstrap samples of 150 students lends support to a four-class solution. In 2% of the cases, three profiles are optimal; in 30% of the cases, four profiles; and in 68% of the cases, five profiles. An inspection of the five-class solutions, however, revealed problematic profiles, e.g., profiles with only two students. Since four profiles may be the maximum from a conceptual point of view and three or less profiles seem to be insufficient, four latent profiles could be the appropriate number.

Table 4 Information criteria, entropies and BLRT results for one to five latent profiles

The MANOVA yielded significant different means among the four profiles: F (12, 516) = 69.90, Wilk’s Λ = 0.078. p < 0.001, η2 = 0.573. In other words, 57.3% of the variance can be explained by profile membership. These findings and Fig. 5 point to sufficiently distinct profiles.

Fig. 5
figure 5

Description of the four latent profiles based on CTS (N = 202)

The identified profiles may be meaningful from a conceptual point of view:

  • Profile 1 (creative thinkers with a focus on collaboration): This profile comprises 77 (38.1%) students and is the largest one. Students in this profile perceive their creative thinking to be well above the neutral scale mean. Moreover, they also assess their cooperativity and critical thinking, on average, as above the neutral scale mean. However, their perceived algorithmic thinking skills are rather low.

  • Profile 2 (low-level computational thinkers): These 35 (17.3%) students, on average, self-assess all dimensions of CT as below the scale mean, i.e., they consider themselves as not being capable of performing sufficiently well in the four CT dimensions.

  • Profile 3 (computational thinkers with low cooperativity): This profile consists of 25 (12.4%) students, which is the smallest profile. Overall, the students self-assess their creative and critical thinking as very high. Their algorithmic thinking is, on average, slightly above the scale mean. However, these students report very low cooperativity.

  • Profile 4 (high-level computational thinkers): Sixty-five (32.2%) students belong to this profile and report high levels across all four CT dimensions.

4.3 RQ3: Relationship Between CTt and CTS Results

The latent correlations between the CTt and the five dimensions of the CTS are with creativity 0.271 (p = 0.002), algorithmic thinking 0.309 (p < 0.001), cooperativity -0.003 (p = 0.956), critical thinking 0.408 (p < 0.001), and problem solving 0.154 (p = 0.085). Considering all CTS dimensions as independent variables, and the CTt as a dependent variable in a latent regression, only algorithmic thinking is statistically significant (b = 0.319 Logit, p < 0.001). Figure 6 presents the relationship between the three proficiency levels (CTt) and the four latent profiles (CTS). The notches in the boxplots represent the 95% confidence intervals. As can be seen, students in profiles 1 and 2 score significantly lower in the CTt than students in profiles 3 and 4. The median proficiency in profiles 1 and 2 is around the threshold of proficiency level II, and in profiles 3 and 4, around the threshold of proficiency level III.

Fig. 6
figure 6

CTt proficiency in Logit and proficiency levels (1, 2, 3) grouped by latent profile membership (CTS). Notches indicate 95% confidence intervals

5 Discussion, Limitations, and Outlook

5.1 Discussion

The aim of this paper was to contribute to more vivid interpretation of CT assessment findings. To this end, we relied on a performance test (CTt) and a self-assessment instrument (CTS). Both are freely available and internationally accepted standardized instruments (Shute et al., 2017).

In terms of the CTt (RQ1), the main advantage is its reliance on the well-established framework of Brennan and Resnick (2012). Furthermore, the CTt details the cognitive operations that underlie the test items. This may be the basis for a theoretically founded interpretation of the test results. Utilizing IRT, we were able to demonstrate Rasch scalability, which implies the specific objectivity of the test. This is the basis for a criterion-referenced test interpretation. The presented proficiency level model for CT is an additional step in capturing the construct of CT: it adds a model for describing item difficulty and cognitive demand. A proficiency level model could help to communicate the test results. A statement like ‘Student A is able to solve tasks that contain sequencing and conditionals’ is more informative than ‘Student A solved 16 out of 26 items correctly’. It could also help teachers to better tailor their instruction to specific student needs. For instance, a student on proficiency level II might not be able to correctly solve items with variables, but is able to do so as regards items with conditionals and functions. This insight can be utilized for scaffolding processes. Before designing instructional measures, it may be important to evaluate students’ prior knowledge (Bransford et al., 2000). The CTt can carry out this purpose as it does not require specific prior knowledge, such as a programming language.

Concerning the CTS (RQ2), our confirmatory factor analysis indicated, in line with Korkmaz et al. (2017), that five dimensions can be identified. Due to substantial cross-loadings, we had to exclude items from the original version, which contained 29 items. However, the author of this initial version also reduced the initial scale to 20 items in a later study (Korkmaz & Bai, 2019). By means of an LPA, we identified four profiles.

Concerning specific strengths and weaknesses, profiles 1 and 3 are of special interest. Students in profile 1 report a weakness in algorithmic thinking. Algorithmic thinking can be regarded as the link between CT and computer science (Doleck et al., 2017) and may be a technical skill. To foster students’ algorithmic thinking skills, it may be preferable to rely on visual programming languages because professional programming languages can be regarded as difficult or boring (Lye & Koh, 2014; Repenning, 2017). Scratch could be a suitable option for this purpose. Grover et al. (2015) demonstrated the effectiveness of Scratch in fostering algorithmic thinking. Moreover, scalable game design (Repenning, 2018) could be a viable option to improve students’ algorithmic thinking. The advantage is that students start with a project instead of first being confronted with code and syntax. Students can bring in their creative and critical thinking skills and afterwards learn technical aspects, which keeps them in the zone of proximal flow (Repenning et al., 2015). Besides this, teachers could form groups where students in profile 1 collaboratively learn with students from other profiles, especially students in profile 4 who have high levels of perceived algorithmic thinking skills. Students in profile 1 can bring in their creativity and critical thinking and benefit from the other students’ high algorithmic thinking skills.

Profile 3 is remarkable because these students report low cooperativity and, at the same time, high creative and critical thinking. Colloquially, these students might be referred to as nerds. For these students it could be important to increase their cooperativity, which is regarded as a core 21st-century skill (van Laar et al., 2017). Again, Scratch could be a promising approach because it relies on a social learning paradigm and allows members of the community to learn from each other through the opportunity to share and extend projects (Jiang et al., 2021; Repenning et al., 2015; Resnick & Rusk, 2020; Shute et al., 2017). Students in profile 3 could become aware of the benefits of collaboration when working on Scratch projects. Teachers may put a special focus on students in this profile in order to integrate them into the group and facilitate teamwork.

Overall, 67.8% of the students in our sample might require specific support. Students in CT profile 1 may need help with increasing their algorithmic thinking skills. Students in profile 2 score low in all four CTS dimensions. Students in profile 3 seem to have a deficit in cooperativity. Only students in profile 4 report high values among all four CTS dimensions. Teachers could make use of these students to support fellow students with deficits in one or more CT dimension. In general, collaboration seems to be conducive for fostering CT (Denner et al., 2014).

Concerning the relationship between the CTt and CTS results (RQ3), the correlations found in our study are well in line with the correlations reported by Polat et al. (2021), indicating that the findings might be robust across various populations. Moreover, we show that when considering all CTS dimensions as independent variables and the CTt as the dependent variable in a latent regression, only algorithmic thinking is a statistically significant predictor. Hence, the CTt may primarily be related with algorithmic thinking of the ISTE framework. This can also be seen from Figs. 5 and 6. Students in profiles 3 and 4, with the highest reported algorithmic thinking, achieve significantly higher CTt scores than students in profiles 1 and 2, and reach higher proficiency levels.

Since the CTt and CTS rely on different methods, i.e., performance test and self-assessment, it is likely that constructs of a different nature are captured. Self-assessments might capture self-efficacy beliefs (Scherer et al., 2017), which play an important role in predicting (intended) behavior (Fishbein & Ajzen, 2010). For instance, students’ perceived cooperativity may be a good predictor for their actual collaboration in CT projects. Moreover, self-assessments are very cost-efficient (Scherer et al., 2017). The CTS requires less than five minutes of test time. Concerning the identification of latent profiles, self-assessment instruments seem to be an established method (e.g., Scherer et al., 2021). Overall, self-assessments may not be inferior to performance tests.

If the purpose is to investigate the nomological net of CT, however, performance tests may be more suitable than self-assessments. Polat et al. (2021) investigated the relationship of CT with the external variables of mathematics and information technologies course achievement. The correlations are substantially higher when the CTt instead of the CTS is used. Moreover, forming proficiency levels based on self-assessments may not be a viable option.

Overall, the CTt and CTS may well complement each other. If we had only used the CTt, the remarkable CT profiles 1 and 3 would not have been detected; students in these profiles may need specific attention.

5.2 Limitations

Our study is not without limitations. A general limitation is that our sample is narrow in scope as it comprises only students from German-speaking Switzerland and from one type of school (high school). Concerning RQ1 and the use of the CTt, one disadvantage is its reliance on dichotomous constructed response items. It may not be able to capture higher-level computational concepts. The formed proficiency levels can be interpreted in a meaningful way because they are linked to previously specified characteristics, e.g., sequencing. However, for high-achieving students, items that cover more advanced computational concepts would be necessary. We will come back to this point in the outlook section below.

Concerning RQ2 and the use of the CTS, we had to exclude items from the original version due to a lack of discriminant validity. On the one hand, this exclusion contributes to the psychometric validity of the instrument and content validity may still be achieved; on the other hand, however, results based on the full version of the CTS could be difficult to compare with our refined version. Besides this, we did not consider the CTS dimension problem solving in the LPA. The reason for this was that all the items that operationalize problem solving are reversely coded; consideration of the dimension in the LPA yielded spurious profiles. Moreover, due to our rather small sample size, we had to use a parsimonious LPA model where equal variances across the profiles and covariances of zero are assumed. These assumptions are restrictive (Scherer et al., 2021). Moreover, we used manifest means as the basis for the LPA. This may be unavoidable due to our sample size but it neglects measurement error (Meyer & Morin, 2016). As a robustness check, we also used regression factor scores instead of manifest means; it yielded similar results (see Appendix 3). Against the backdrop of our cross-sectional sample of high school students, it is doubtful whether the identified four profiles can be replicated in samples with younger students and be consistent across time (Meyer & Morin, 2016). Moreover, the identification of the profiles was exploratory, which may be inherent to the LPA but not ideal from a theoretical point of view (Hofmans et al., 2020). Based on our simulation approach, however, we may conclude that four profiles could be a reasonable minimum.

5.3 Outlook

The usefulness of the CTt for assessing students with a CT proficiency equal or higher to those in our sample could be increased. To this end, it would be necessary to set further meaningful proficiency levels. In order to anchor these proficiency levels, items of greater difficulty should be constructed on a criterion basis. Such a criterion could be the CT concept diffusion (Repenning, 2017, pp. 18–19). Diffusion goes beyond conditionals, functions, and variables. It is a kind of artificial intelligence and can, for instance, be used to move objects dynamically. As can be seen from the tasks depicted in Figs. 2 and 3, instructions are required to lead Pacman to the ghost. However, if Pacman should autonomously find the ghost, utilizing diffusion is a viable option (see Repenning, 2006). It can be claimed that the ghost has a ‘scent’ that spreads in the maze. This process is modeled by means of diffusion equations. Pacman can then use a hill-climbing approach to find the ghost. It checks the scent concentration in all four neighboring fields and moves to the field with the highest concentration. This is an efficient approach because it takes obstacles like walls into account. To test students’ understanding of this concept, they could be asked how long it will take, depending on the speed of diffusion, to detect the ghost. They may also predict the path Pacman will choose to reach the ghost or implement the hill-climbing approach using visual code blocks. We think including diffusion into the CTt is promising because it is an important concept in many domains (Repenning, 2006). Examples are the diffusion of heat (physics) or osmosis (biology).

In future research, it could be promising to use performance tests to capture the five dimensions covered by the CTS and form proficiency levels. Suitable tests are available and could be adapted for CT: for creativity (Israel-Fishelson & Hershkovitz, 2022, Appendix C), for algorithmic thinking (Román-González et al., 2017), for cooperativity (Salas et al., 2017), for (complex) problem solving (Greiff et al., 2013), for collaborative problem solving (Stadler et al., 2020), and for critical thinking (Ennis, 1993). If a self-assessment instrument should be used, the computing attitudes survey (Dorn & Tew, 2015) could be a viable alternative to the CTS.

6 Conclusion

This paper contributes to a better understanding of computational thinking as a construct. Based on a proficiency level model, distinct levels of computational thinking performance can be described, i.e., characteristics of computational thinking tasks that students on a specific level are systematically able to master but which cannot be mastered by students on a lower level. We formed proficiency levels based on the Computational Thinking Test, which covers the cognitive operations (computational concepts) of sequencing, conditionals, functions, and variables. Moreover, we identified latent profiles based on the Computational Thinking Scale. Our findings indicated that 67.8% of the students may need specific guidance and support. Students in the first profile (38.1%) reported deficits in algorithmic thinking while students in the second profile (17.3%) self-assessed themselves as low in all four dimensions: creative thinking, algorithmic thinking, cooperativity, and critical thinking. The third profile (12.4%) comprises students who were reluctant to cooperate. Knowledge about the identified proficiency levels and the four distinctive computational thinking profiles could help teachers offer person-centered guidance and support to their students.