Abstract
In order to precisely assess the cognitive achievement and abilities of students, different types of items are often used in competence tests. In the National Educational Panel Study (NEPS), test instruments also consist of items with different response formats, mainly simple multiple choice (MC) items in which one answer out of four is correct and complex multiple choice (CMC) items comprising several dichotomous “yes/no” subtasks. The different subtasks of CMC items are usually aggregated to a polytomous variable and analyzed via a partial credit model. When developing an appropriate scaling model for the NEPS competence tests, different questions arose concerning the response formats in the partial credit model. Two relevant issues were how the response categories of polytomous CMC variables should be scored in the scaling model and how the different item formats should be weighted. In order to examine which aggregation of item response categories and which item format weighting best models the two response formats of CMC and MC items, different procedures of aggregating response categories and weighting item formats were analyzed in the NEPS, and the appropriateness of these procedures to model the data was evaluated using certain item fit and test fit indices. Results suggest that a differentiated scoring without an aggregation of categories of CMC items best discriminates between persons. Additionally, for the NEPS competence data, an item format weighting of one point for MC items and half a point for each subtask of CMC items yields the best item fit for both MC and CMC items. In this paper, we summarize important results of the research on the implementation of different response formats conducted in the NEPS.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Adams, R., & Wu, M. (2002). PISA 2000 technical report. Paris, France: OECD.
Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. Brandon-Tuma (Ed.), Sociological methodology (pp. 33–80). San Francisco, CA: Jossey-Bass.
Ben-Simon, A., Budescu, D. V., & Nevo, B. (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1), 65–88.
Blömeke, S., Kaiser, G., & Lehmann, R. (2010). TEDS-M 2008—Professionelle Kompetenz und Lerngelegenheiten angehender Primarstufenlehrkräfte im internationalen Vergleich. Münster: Waxmann.
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Mahwah, NJ: L. Erlbaum.
Downing, S. M. (2006). Selected-response item formats in test development. In S. M. Downing, & T. M. Haladyna (Eds.), Handbook of test development (pp. 3–26). Mahwah, NJ: Erlbaum.
Ferrara, S., Huynh, H., & Michaels, H. (1999). Contextual explanations of local dependence in item clusters in a large-scale hands-on science performance assessment. Journal of Educational Measurement, 36(1), 119 – 140.
Gräfe, L. (2012). How to deal with missing responses in competency tests ? A comparison of data- and model-based IRT approaches (Unpublished diploma thesis). Friedrich-Schiller-University Jena, Jena, Germany.
Hahn, I., Schöps, K., Rönnebeck, S., Martensen, M., Hansen, S., Saß, S., … Prenzel, M. (2013). Assessing scientific literacy over the lifespan—A description of the NEPS science framework and the test development. Journal of Educational Research Online, 5(2), 110 – 138.
Haberkorn, K., Pohl, S., & Carstensen, C. (2015). Incorporating different response formats of competence tests in an IRT model. Manuscript submitted for publication.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27.
Haladyna, T. M., & Rodriguez, M. C. (2013) Developing and validating test items. New York, NY: Routledge.
Hsu, T. C. (1984). The merits of multiple-answer items as evaluated by using six scoring formulas. Journal of Experimental Education, 52(3), 152–158.
Huynh, H. (1994). On equivalence between a partial credit item and a set of independent Rasch binary items. Psychometrika, 59(1), 111–119.
Kline, T. (2005). Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage.
Lukhele, R., & Sireci, S. G. (1995, April). Using IRT to combine multiple-choice and free-response sections of a test on to a common scale using a priori weights. Paper presented at the annual conference of the National Council on Measurement in Education, San Francisco, CA.
Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.
Neumann, I., Duchardt, C., Grüßing, M., Heinze, A., Knopp, E., & Ehmke, T. (2013). Modeling and assessing mathematical competence over the lifespan. Journal of Educational Research Online, 5(2), 80 – 109.
OECD (2009). PISA 2006 technical report. Paris, France: OECD.
Olson, J. F., Martin, M. O., & Mullis, I. V. S. (Eds.). (2008). TIMSS 2007 technical report. Chestnut Hill, MA: Boston College.
Osterlind, S. J. (1998). Constructing Test Items: Multiple-Choice, Constructed-Response, Performance, and Other Formats. Dordrecht, Netherlands: Kluwer Academic.
Penfield, R. D., Myers, N. D., & Wolfe, E. W. (2008). Methods for assessing item, step, and threshold invariance. Polytomous items following the partial credit model. Educational and Psychological Measurement, 68(5), 717 – 733.
Pohl, S., & Carstensen, C. H. (2012). NEPS technical report—Scaling the data of the competence tests. (NEPS Working Paper No. 14). Bamberg: University of Bamberg, National Educational Panel Study.
Pohl, S., & Carstensen, C. H. (2013). Scaling the competence tests in the National Educational Panel Study—Many questions, some answers, and further challenges. Journal of Educational Research Online, 5(2), 189 – 216.
Pohl, S., Gräfe, L., & Rose, N. (2013). Dealing with omitted and not reached items in competence tests—Evaluating approaches accounting for missing responses in IRT models. Educational and Psychological Measurement, 74(3), 423 – 452.
Rodriguez, M. (2002). Choosing an item format. In G. Tindal, & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 213–231). Mahwah, NJ: Erlbaum.
Schöps K., & Saß, S. (2013). NEPS technical report for science—Scaling results of Starting Cohort 4 in ninth grade. (NEPS Working Paper No 23). Bamberg: University of Bamberg, National Educational Panel Study.
Senkbeil, M. & Ihme, J. M. (2012). NEPS technical report for computer literacy—Scaling results of Starting Cohort 4 in ninth grade. (NEPS Working Paper No. 17). Bamberg: University of Bamberg, National Educational Panel Study.
Senkbeil, M., Ihme, J. M., & Wittwer, J. (2013). The Test of Technological and Information Literacy (TILT) in the National Educational Panel Study: Development, empirical testing, and evidence for validity. Journal of Educational Research Online, 5(2), 139–161.
Si, C. B. (2002). Ability estimation under different item parameterization and scoring models (Doctoral dissertation). Retrieved from http://digital.library.unt.edu/ark:/67531/metadc3116/m2/1/high_res_d/dissertation.pdf
Stucky, B. D. (2009). Item response theory for weighted summed scores (Master’s thesis). Retrieved from https://cdr.lib.unc.edu/indexablecontent?id=uuid:03c49891-0701-47b8-af13-9c1e5b60d52d&ds=DATA_FILE
Sykes, R. C., & Hou, L. (2003). Weighting constructed-response items in IRT-based exams. Applied Measurement in Education, 16(4), 257 – 275.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28(3), 237 – 247.
Weinert, S., Artelt, C., Prenzel, M., Senkbeil, M., Ehmke, T., & Carstensen C. H. (2011). Development of competencies across the life span. In H.-P. Blossfeld, H.-G. Roßbach, & J. von Maurice (Eds.), Education as a lifelong process: The German National Educational Panel Study (NEPS) (pp. 67 – 86). Wiesbaden: VS Verlag für Sozialwissenschaften.
Wongwiwatthananukit S., Bennett, D. E., & Popovich N. G. (2000). Assessing pharmacy student knowledge on multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items. American Journal of Pharmaceutical Education, 64(1), 1 – 10.
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. (2007). ACER ConQuest 2.0—Generalised item response modelling software. Camberwell, Australia: ACER Press.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187–213.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Fachmedien Wiesbaden
About this chapter
Cite this chapter
Haberkorn, K., Pohl, S., Carstensen, C. (2016). Scoring of Complex Multiple Choice Items in NEPS Competence Tests. In: Blossfeld, HP., von Maurice, J., Bayer, M., Skopek, J. (eds) Methodological Issues of Longitudinal Surveys. Springer VS, Wiesbaden. https://doi.org/10.1007/978-3-658-11994-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-658-11994-2_29
Published:
Publisher Name: Springer VS, Wiesbaden
Print ISBN: 978-3-658-11992-8
Online ISBN: 978-3-658-11994-2
eBook Packages: EducationEducation (R0)