Psychometric Models for a New State Science Assessment Aligned to the Next Generation Science Standards

Chen, Jing; Lee, Jonghwan; Nichols, Paul; Schneider, M. Christina

doi:10.1007/978-3-030-74772-5_36

Jing Chen⁶,
Jonghwan Lee⁶,
Paul Nichols⁶ &
…
M. Christina Schneider⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 353))

Included in the following conference series:

The Annual Meeting of the Psychometric Society

995 Accesses

Abstract

The complexity of the Next Generation Science Standards (NGSS) poses significant task design, psychometric, and practical challenges for assessments. This study focuses on the psychometric challenges and explores an appropriate measurement model to interpret scores for an NGSS-aligned state science assessment. Multiple item response theory (IRT) models based on content specifications were applied to the data collected from a pilot test of the newly developed science assessment to identify the most appropriate model. Results suggest that although the three-dimensional IRT model that aligns with the NGSS dimensions provides slightly better overall model fit than the unidimensional IRT model and the testlet model, the item-level fit of the three-dimensional model is poor. Implementing multidimensional IRT (MIRT) models requires large sample sizes and a much longer estimation time, which poses challenges in an operational setting. Future studies can be conducted to further evaluate the need for using MIRT models and the robustness of a unidimensional model under various test conditions.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Investigating the Quality of Psychometric Properties Via Rasch Model

Normalizing the Use of Single-Item Measures: Validation of the Single-Item Compendium for Organizational Psychology

Article 14 April 2022

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

Keywords

1 Introduction

Unlike traditional unidimensional science standards, the Next Generation Science Standards (NGSS; NGSS Lead States, 2013) emphasize three distinct dimensions: Disciplinary Core Ideas (DCIs), Science and Engineering Practices (SEPs), and Crosscutting Concepts (CCCs). These dimensions are combined to form performance expectations that reflect the inherent complexity in scientific understanding and reasoning. The complexity of the standards and the new task types they require poses significant challenges for psychometric modeling (Gorin & Mislevy, 2013).

The explicit dimensionality in the construct as defined by the NGSS impacts the choice of measurement models for an NGSS assessment. Meanwhile, to measure the NGSS, performance tasks are designed to elicit responses that are more aligned with the targeted reasoning and higher cognitive skills. These tasks often include contextualized and multidimensional items to measure real-world problem-solving skills, which may violate the assumptions of traditional psychometric models (Martineau, 2017). The psychometric challenges introduced by the NGSS require appropriate models to assess the dimensionality and to estimate item and person parameters.

The goal of this study is to identify an appropriate measurement model for an NGSS-aligned state summative science assessment. The assessment was recently created to align to the state’s college and career ready standards for science designed around NGSS’ three-dimensional science learning. Because of the multidimensional nature of the assessment, the most appropriate measurement model that could be supported by learning theories, capture the patterns within the data, and be feasible to use in an operational setting was investigated. The following sections provide more details about the science assessment and its pilot administration, the dimensionality analyses and results, and a discussion of the findings.

2 Science Pilot Overview

This study was conducted based on data from a pilot test of a new state science assessment administered in Grade 5 and Grade 8 in Spring 2019. The assessment is based on performance tasks, which are phenomena-based scenarios with multiple items to elicit responses that show students’ understanding of the DCIs, SEPs, and CCCs. The items are minimally two dimensional. A variety of technology-enhanced item types are used that allow students to show their thinking more fully. For example, the drag-and-drop technology-enhanced item type requires students to drag and drop items into groups. Within each group, students can rank items by dragging and dropping them into place.

Each grade-level pilot test had two test forms (Form A and Form B) that each consisted of two tasks and several items. The two forms at Grade 5 had 11 and 14 items, respectively, and the two forms at Grade 8 had 17 and 18 items, respectively. All items were scored dichotomously. The pilot test was intentionally short to reduce the time students spent away from the classroom.

The student sample for this study was a convenience sample based on schools’ availability and willingness to participate. Table 1 presents the total number of students who took the test by grade and form. The student sample’s demographic information (including sex and ethnicity) presented in Table 2 suggests that the sample had demographic characteristics similar to the state’s general student population at these two grade levels. The differences in percentages between the sample and the general population are all smaller than 5%. In addition, because the two forms at each grade were randomly administered to students within the same school, students were comparable across the forms in terms of their demographics.

Table 1 Pilot sample

Full size table

Table 2 Demographic information: Pilot sample vs. general population of the state

Full size table

3 Dimensionality Analysis

3.1 Description of Four Datasets and Three IRT Models

Four datasets were used in the analyses, one for each form and grade. Table 3 provides the number of students who took the form, the number of tasks and items, and the total score points for each form.

Table 3 Study datasets

Full size table

Three IRT models based on content specifications were fit to the data to compare the model fitness and investigate the dimensionality of the assessment: 1) a unidimensional IRT model, 2) a three-dimensional IRT model, and 3) a testlet model. Figure 1 shows a graphic illustration of each model. All the analyses were conducted using the R mirt package (Chalmers, 2012).

3.2 Unidimensional IRT Model (Model 1)

First, unidimensional models were applied to fit the data. Three unidimensional models were examined to determine the best fit: Rasch one-parameter logistic (1PL; Rasch, 1960), two-parameter logistic (2PL; Birnbaum, 1968), and three-parameter logistic (3PL; Lord, 1980). The equations for each model are presented below.

$$ P\left({U}_{ij}=1|{\theta}_j,{b}_i\right)=\frac{e^{\theta_j-{b}_i}}{1+{e}^{\theta_j-{b}_i}} $$

(1PL)

$$ P\left({U}_{ij}=1|{\theta}_j,{b}_i\right)=\frac{e^{a_i\left({\theta}_j-{b}_i\right)}}{1+{e}^{a_i\left({\theta}_j-{b}_i\right)}} $$

(2PL)

$$ P\left({U}_{ij}=1|{\theta}_j,{b}_i\right)={c}_i+\left(1-{c}_i\right)\frac{e^{a_i\left({\theta}_j-{b}_i\right)}}{1+{e}^{a_i\left({\theta}_j-{b}_i\right)}} $$

(3PL)

where θ _j, b _i, a _i and c _i are the person, item difficulty, discrimination, and guessing parameters, respectively.

To evaluate model fit, Akaike’s Information Criterion (AIC; Akaike, 1973) and the Bayesian Information Criterion (BIC; Schwarz, 1978) were consulted. The better-fitting model is the one with a lower AIC or BIC value. BIC penalizes model complexity more heavily than AIC, which may result in an inconsistent model preference. Table 4 presents the fitting results from the Rasch, 2PL, and 3PL models for each test form. The lowest AIC and BIC values for each dataset are bolded. Though the 3PL model fits the data best for two of the four forms as indicated by the lowest AIC and BIC values, the model has a convergence problem for Grade 8 Form B, and the BIC value indicates that the 2PL model fit better than the 3PL model for the dataset from Grade 5 Form A. Lack of convergence is an indication that the data do not fit the model well because there are too many poorly fitting observations. The 2PL model generally fits much better than the 1PL model. Though it fits the data slightly worse than the 3PL model in some cases, it does not have the same convergence problem as the 3PL model. Thus, a 2PL model was preferred and was selected as Model 1 for the study analyses.

Table 4 Model-fit comparison between unidimensional 1PL, 2PL, and 3PL models

Full size table

3.3 Three-Dimensional IRT Model (Model 2)

Second, a three-dimensional IRT model (Model 2) was applied to fit the data. This model assumes the underlying domains as DCIs, SEPs, and CCCs. This three-dimensional model is the multidimensional extension of the 2PL model (Reckase, 2009). The form of the model is given by

$$ P\ \left({U}_{ij}=1|{\boldsymbol{\theta}}_{\boldsymbol{j}},{\boldsymbol{a}}_{\boldsymbol{i}},{d}_i\right)=\frac{e^{{\boldsymbol{a}}_{\boldsymbol{i}}{\boldsymbol{\theta}}_{\boldsymbol{j}}^{\prime }+{d}_i}}{1+{e}^{{\boldsymbol{a}}_{\boldsymbol{i}}{\boldsymbol{\theta}}_{\boldsymbol{j}}^{\prime }+{d}_i}} $$

where a is a 1 × m vector of item discrimination parameters and θ is a 1 × m vector of person coordinates with m indicating the number of dimensions in the coordinate space (i.e., m is 3 in this case). The intercept term, d, is a scalar. The exponent of e in this model can be expanded to show how the elements of the a and θ vectors interact.

$$ {\boldsymbol{a}}_{\boldsymbol{i}}{\boldsymbol{\theta}}_{\boldsymbol{j}}^{\prime }+{d}_i={a}_{i1}{\theta}_{j1}+{a}_{i2}{\theta}_{j2}+\dots +{a}_{im}{\theta}_{jm}+{d}_i $$

The latent traits of this three-dimensional model were set to be correlated because students’ abilities in these dimensions are expected to be related to some extent. The empirical results also suggest that the model fits the data better when the latent traits are set to be correlated.

3.4 Testlet Model (Model 3)

A 2PL testlet model (Bradlow et al., 1999) was also applied to fit the data. Because the pilot test was composed of testlet-based items, which may violate the local independence assumption of IRT models, a testlet model was applied to the data to examine the testlet effect. The testlet model assumes a single primary dimension (i.e., general knowledge and abilities in science) and several uncorrelated specific dimensions according to testlets (i.e., tasks) after accounting for the primary dimension. For a testlet model, an item’s slope for the specific dimension is constrained to equal the item’s slope for the general dimension (Cai, 2010). The 2PL testlet model is given as

$$ {P}_j\left({\theta}_i\right)=\frac{1}{1+{e}^{-{a}_j\left({\theta}_i-{b}_j-{\gamma}_{id(j)}\right)\prime }} $$

where p _j(θ _i) is the probability of a correct response to item j for examinee i, θ _i is examinee i’s latent ability, a _j and b _j are the item discrimination and difficulty parameters, and γ _id(j) is a person-specific testlet effect that is assumed to follow a distribution N(0, σ²γid(j)).

3.5 IRT Model-Fit Comparisons

Model fit among Models 1, 2, and 3 was compared. Each model was applied to the four datasets. Table 5 presents the model-fit comparison results for all four datasets. The lowest AIC and BIC statistics are bolded. All the AIC and BIC statistics suggest that Model 2 fits the data best with the exception of the BIC statistics for Grade 5 Form A. Overall, Model 2 (three-dimensional IRT model) provides the best fit across all four datasets.

Table 5 Model-fit comparison between Models 1, 2, and 3

Full size table

3.6 Item Fit Statistics

Overall, the three-dimensional IRT model (Model 2) fit the data better than the other two models. To further examine the fitness of the three-dimensional model, the chi-squared-based item-level fit index (S-X ²; Orlando & Thissen, 2000, 2003) was evaluated to see if the model fits the data well at the individual item level. Item fit statistics from the 2PL unidimensional model were used as a baseline for the comparison. The results from the chi-square-based item-level goodness-of-fit tests suggest that more items have bad fit (i.e., p-value <0.05) from the three-dimensional model than from the unidimensional model. For example, four items on Grade 8 Form B showed poor fit to the unidimensional model. However, for the three-dimensional model, these four items and five additional items showed poor model fit. Similar patterns were discovered for the other forms.

All four items that did not fit well to the unidimensional model were technology-enhanced items that required students to enter a short response that is scored as either correct or incorrect. It is possible that students rely on different abilities to respond to these items compared to the abilities measured by the multiple-choice items. A close look of the items by content experts is needed to identify the potential causes of item misfit.

3.7 Local Dependency Among Items Within a Task

Although the testlet model fits slightly better than the unidimensional model, the extent to which the local independence assumption is violated was examined using a popular local independence statistic, Yen’s Q3 index. Index values greater than 0.20 indicate a degree of local dependence that should be examined by test developers (Chen & Thissen, 1997). Among the 435 item pairs across forms, only two pairs of items had a residual correlation greater than 0.20, suggesting that local item independence generally holds for all forms.

4 Discussion

In general, based on the pilot test data, the model fit statistics suggest that the three-dimensional IRT model that aligns with the DCI, SEP, and CCC dimensions (Model 2) provides slightly better overall fit than the unidimensional model (Model 1) and the testlet model (Model 3). However, the fit of the three-dimensional model at the item level is poor. Another issue to consider for this model is that the NGSS dimensions may not be conceptualized in the same manner that test score dimensionality has been conceptualized, which may create some confusion (Martineau, 2017). The use of the term “dimensionality” in NGSS may be better described as “complex” performance (Dunbar et al., 1991), which involves knowledge and skills across a number of domains or subjects.

Local independence is a fundamental assumption of unidimensional models. Fitting a unidimensional model in the presence of local dependencies may result in biased item parameters and standard errors of measurement (Yen, 1993). The American Institute of Research (AIR) applied a Rasch testlet model (Wang & Wilson, 2005) to calibrate NGSS-aligned science assessments for multiple states (Rijmen, 2018). However, for the new science assessment used in this study, the local independency assumption still generally holds and the testlet model only provides slightly better fit than the unidimensional model.

It is important to note that the data used in this study were collected from a pilot test, so the quality of some items may be low. These items may impact the model fitness results. Students’ low motivation for the pilot test may also have affected the quality of the data. The relatively short test length compared to a regular state assessment limited the number of items to be administered for each dimension. All these factors may cause the structure of the pilot data to not strongly resemble the structure of data from operational assessments. It will be worth conducting the dimensionality analysis again using data from the operational test to identify the most appropriate measurement model for the assessment.

Unidimensional IRT models are widely used in testing programs. In contrast, MIRT models are rarely implemented in any state testing program due to its complexity. They require a large sample size to obtain accurate parameter estimates and take a much longer estimation time, which pose challenges in an operational setting. The sample size of an operational test will be much larger than the sample size of this study that used pilot data. Applying a multidimensional model will significantly increase computation time. Implementing MIRT models in operation will likely be a new practice for most vendors working with states. The need for more complex measurement models needs to be further evaluated. Data from the operational test will be collected to further evaluate the need of using MIRT models and examine the robustness of the unidimensional model under various test conditions in future studies.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the second international symposium on information theory (pp. 267–281). Akademiai Kiado.
Google Scholar
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.
Google Scholar
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
Article MATH Google Scholar
Cai, L. (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75, 581–612.
Article MathSciNet MATH Google Scholar
Chalmers, P. R. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.
Article Google Scholar
Chen, W. H., & Thissen, D. (1997). Local dependence indices from item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.
Article Google Scholar
Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4(4), 289–303.
Article Google Scholar
Gorin, J. S., & Mislevy, R. J. (2013). Inherent measurement challenges in the Next Generation Science Standards for both formative and summative assessment. Commissioned paper presented at the K–12 Center at ITS Invitational Research Symposium on Science Assessment, Washington DC.
Google Scholar
Lord, F. M. (1980). Application of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Google Scholar
Martineau, J. (2017). The intersection of measurement model, equating, and the Next Generation Science Standards. Center for Assessment. https://www.nciea.org/sites/default/files/inline-files/Martineau_RILS%20-%20Brief%20on%20NGSS%20Measurement%20Models%20and%20Equating%20-%20Final.pdf
NGSS Lead States. (2013). Next generation science standards: For states, by states. The National Academic Press. https://www.nextgenscience.org/search-standards
Google Scholar
Orlando, M., & Thissen, D. (2000). New item fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.
Article Google Scholar
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X²: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289–298.
Article MathSciNet Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research (Expanded edition, 1980. University of Chicago Press).
Google Scholar
Reckase, M. (2009). Multidimensional item response theory. Springer.
Book MATH Google Scholar
Rijmen, F. (2018). Scoring and reporting for assessments developed for the new science standards. Paper presented at the National Conference on Student Assessment.
Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Article MathSciNet MATH Google Scholar
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126–149.
Article MathSciNet Google Scholar
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187–213.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

NWEA, Portland, OR, USA
Jing Chen, Jonghwan Lee, Paul Nichols & M. Christina Schneider

Authors

Jing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jonghwan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Paul Nichols
View author publications
You can also search for this author in PubMed Google Scholar
M. Christina Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Chen .

Editor information

Editors and Affiliations

Department of Statistics, USBE, Umeå University, Umeå, Västerbottens Län, Sweden
Marie Wiberg
Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
Dylan Molenaar
Facultad de Matemáticas, Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge González
Kellogg School of Management, Northwestern University, Evanston, IL, USA
Ulf Böckenholt
Department of Educational Psychology, University of Wisconsin-Madison, Madison, WI, USA
Jee-Seon Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Lee, J., Nichols, P., Schneider, M.C. (2021). Psychometric Models for a New State Science Assessment Aligned to the Next Generation Science Standards. In: Wiberg, M., Molenaar, D., González, J., Böckenholt, U., Kim, JS. (eds) Quantitative Psychology. IMPS 2020. Springer Proceedings in Mathematics & Statistics, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-74772-5_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-74772-5_36
Published: 23 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74771-8
Online ISBN: 978-3-030-74772-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Psychometric Models for a New State Science Assessment Aligned to the Next Generation Science Standards

Abstract

Similar content being viewed by others

Investigating the Quality of Psychometric Properties Via Rasch Model

Normalizing the Use of Single-Item Measures: Validation of the Single-Item Compendium for Organizational Psychology

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Keywords

1 Introduction

2 Science Pilot Overview

3 Dimensionality Analysis

3.1 Description of Four Datasets and Three IRT Models

3.2 Unidimensional IRT Model (Model 1)

3.3 Three-Dimensional IRT Model (Model 2)

3.4 Testlet Model (Model 3)

3.5 IRT Model-Fit Comparisons

3.6 Item Fit Statistics

3.7 Local Dependency Among Items Within a Task

4 Discussion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Psychometric Models for a New State Science Assessment Aligned to the Next Generation Science Standards

Abstract

Similar content being viewed by others

Investigating the Quality of Psychometric Properties Via Rasch Model

Normalizing the Use of Single-Item Measures: Validation of the Single-Item Compendium for Organizational Psychology

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Keywords

1 Introduction

2 Science Pilot Overview

3 Dimensionality Analysis

3.1 Description of Four Datasets and Three IRT Models

3.2 Unidimensional IRT Model (Model 1)

3.3 Three-Dimensional IRT Model (Model 2)

3.4 Testlet Model (Model 3)

3.5 IRT Model-Fit Comparisons

3.6 Item Fit Statistics

3.7 Local Dependency Among Items Within a Task

4 Discussion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation