UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched...

21
UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl) UvA-DARE (Digital Academic Repository) Contributions to latent variable modeling in educational measurement Zwitser, R.J. Link to publication Citation for published version (APA): Zwitser, R. J. (2015). Contributions to latent variable modeling in educational measurement. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date: 10 Nov 2020

Transcript of UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched...

Page 1: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Contributions to latent variable modeling in educational measurement

Zwitser, R.J.

Link to publication

Citation for published version (APA):Zwitser, R. J. (2015). Contributions to latent variable modeling in educational measurement.

General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s),other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, statingyour reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Askthe Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam,The Netherlands. You will be contacted as soon as possible.

Download date: 10 Nov 2020

Page 2: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

Bibliography

Adams, R. (2011, 19 April). Comments on Kreiner 2011: Is the

foundation under PISA solid? A critical look at the scaling model

underlying international comparisons of student attainment. Retrieved

from http://www.oecd.org/pisa/47681954.pdf

Adams, R., Wilson, M., & Wang, W. (1997). The multidimensional random

coefficients multinomial logit model. Applied Psychological Measurement ,

21 (1), 1-23.

American Educational Research Association, American Psychological

Association, & National Council on Measurement in Education. (1999).

Standards for educational and psychological testing. Washinton, DC:

American Educational Research Association.

Andersen, E. B. (1973a). Conditional inference and models for

measuring. (Unpublished doctoral dissertation). Mentalhygiejnisk

Forskningsinstitut.

Andersen, E. B. (1973b). A goodness of fit test for the Rasch model.

Psychometrika, 38 , 123-140.

Bechger, T. M., & Maris, G. (2014). A statistical test for differential item pair

functioning. Psychometrika.

Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view

on DIF (Measurement and Research Department Reports No. 2010-4).

Cito.

Birnbaum, A. (1968). Some latent trait models and their use in inferring

an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical

theories of mental test scores (p. 395-479). Reading: Addison-Wesley.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation

of item parameters. Psychometrika, 46 , 443-460.

97

Page 3: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

98 bibliography

Bolsinova, M., Maris, G., & Hoijtink, H. (2012, July). Unmixing Rasch scales.

Paper presented at the V European Congress of Methodology, Santiago

de Compostela, Spain.

Brennan, R. L. (2006). Perspectives on the evaluation and future of educational

measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed.,

chap. 1). Westport: Praeger Publishers.

Conover, W. J. (1999a). Practical nonparametric statistics (3rd ed.). New

York: John Wiley & Sons.

Conover, W. J. (1999b). Statistics of the Kolmogorov-Smirnov type. In

W. J. Conover (Ed.), Practical nonparametric statistics (3rd ed., p. 428-

473). John Wiley & Sons.

Council of Europe. (2012). First european survey on language competences:

Technical report. Retrieved from http://www.surveylang.org/

Cronbach, L. J., & Gleser, G. C. (1965). Psychological test and personnel

decisions (2nd ed.). Urbana: University of Illinois Press.

Dieterich, C. (2013, March). In or out, DJIA companies reflect

changing times. The Wall Street Journal . Retrieved from

http://online.wsj.com/news/articles/

SB10001424127887324678604578342113520798752

Doob, J. (1949). Heuristic approach to the Kolmogorov-Smirnov theorems.

The Annals of Mathematical Statistics, 20 , 393-403.

Eggen, T. J. H. M., & Verhelst, N. D. (2011). Item calibration in incomplete

designs. Psychologica, 32 , 107-132.

Fischer, G. H. (1974). Einfuhrung in die Theorie Psychologischer Tests. Bern:

Verlag Hans Huber. (Introduction to the theory of psychological tests.)

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.

Philosophical Transactions of the Royal Society of London. Series A,

Containing Papers of a Mathematical or Physical Character , 222 , 309-

368. doi: 10.1098/rsta.1922.0009

Glas, C. A. W. (1988). The Rasch model and multistage testing. Journal of

Educational Statistics, 13 , 45-52.

Glas, C. A. W. (1989). Contributions to estimating and testing Rasch models

(Unpublished doctoral dissertation). Arnhem: Cito.

Glas, C. A. W. (2000). Item calibration and parameter drift. In W. J. Van der

Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory

Page 4: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

99

and practice (p. 183-199). Kluwer Academic Publishers.

Glas, C. A. W. (2010). Item parameter estimation and item fit analysis. In

W. J. Van der Linden & C. A. W. Glas (Eds.), Elements of adaptive

testing (p. 269-288). Springer.

Glas, C. A. W., Wainer, H., & Bradlow, E. (2000). MML and EAP estimation in

testlet-based adaptive testing. In W. J. Van der Linden & C. A. W. Glas

(Eds.), Computerized adaptive testing: Theory and practice (p. 271-287).

Kluwer Academic Publishers.

Goldstein, H. (2004). International comparisons of student attainment: some

issues arising from the PISA study. Assessment in Education, 11 (3),

319-330. doi: 10.1080/0969594042000304618

Grayson, D. A. (1988). Two-group classification in latent trait theory: Scores

with monotone likelihood ratio. Psychometrika, 53 , 383-392.

Guttman, L. (1950). The basis for scalogram analysis. In S. Stouffer,

L. Guttman, E. Suchman, P. Lazarsfeld, S. Star, & J. Clausen (Eds.),

Measurement and Prediction (Vol. 4, p. 60-90). Princeton, NY: Princeton

University Press.

Hemker, B. T., Sijtsma, K., Molenaar, I. W., & Junker, B. W. (1997).

Stochastic ordering using the latent trait and the sum score in polytomous

IRT models. Psychometrika, 62 (3), 331-347.

Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-

Haenszel null hypothesis. Psychometrika, 70 (3), 497-516.

Holland, P., & Wainer, H. (Eds.). (1993). Differential item functioning.

Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Huynh, H. (1994). A new proof for monotone likelihood ratio for the sum of

independent Bernoulli random variables. Psychometrika, 59 , 77-79.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking.

methods and practices (Second ed.). New York: Springer.

Kreiner, S. (2011). Is the foundation under PISA solid? A critical

look at the scaling model underlying international comparisons of

student attainment. (Tech. Rep.). Dept. of Biostatistics, University of

Copenhagen.

Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in

health-related scales: Analysis by graphical loglinear Rasch models. In

M. Von Davier & C. H. Carstensen (Eds.), Multivariate and mixture

Page 5: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

100 bibliography

distribution Rasch models (p. 329-346). New York: Springer.

Kreiner, S., & Christensen, K. B. (2013). Analyses of model fit and robustness.

A new look at the PISA scaling model underlying ranking of countries

according to reading literacy. Psychometrika. doi: 10.1007/s11336-013-

9347-z

Kubinger, K. D., Steinfeld, J., Reif, M., & Yanagida, T. (2012). Biased

(conditional) parameter estimation of a rasch model calibrated item pool

administered according to a branched testing design. Psychological Test

and Assessment Modeling , 52 (4), 450-460.

Le, L. T. (2007). Effects of item positions on their difficulty

and discrimination: A study in PISA science data across test

language and countries. Paper presented at the 72nd Annual Meeting

of the Psychometric Society, Tokyo, Japan. Retrieved from

http://research.acer.edu.au/pisa/2/

Linthorne, N. (2014, August). Wind assistance in the 100m sprint. Retrieved

from http://www.brunel.ac.uk/ spstnpl/Publications/

Lord, F. M. (1971a). The self-scoring flexilevel test. Journal of Educational

Measurement , 8 (3), 147-151.

Lord, F. M. (1971b). A theoretical study of two-stage testing. Psychometrika,

36 , 227-242.

Lord, F. M. (1980). Application of item response theory to practical testing

problems. Mahway, New Jersey: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores.

Reading, MA: Addison-Wesley.

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model.

Journal of educational measurement , 17 (3), 179-193.

Maris, G. (2008). A Note on ”Constant Latent Odds-Ratios Models and the

Mantel-Haenszel Null Hypothesis” Hessen, 2005. Psychometrika, 73 (1),

153-157.

Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013a, July). Composition

algorithms for conditional distributions. Paper presented at the 78th

Annual Meeting of the Psychometric Society, Arnhem, The Netherlands.

Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013b, October). A non-

parametric estimator of latent variable distributions. Paper presented at

the RCEC workshop on IRT and Educational Measurement, Enschede,

Page 6: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

101

The Netherlands.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika,

47 , 149-174.

Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical

comparison of the Mokken and the Rasch approach to IRT. Applied

Psychological Measurement , 14 (3), 283-298.

Milgrom, P. R. (1981). Good news and bad news: Representation theorems

and application. The Bell Journal of Economics, 12 (2), 380-391.

Mislevy, R. J. (1998). Implications of market-basket reporting for achievement-

level setting. Applied Psychological Measurement , 11 (1), 49-63.

Mokken, R. (1971). A theory and procedure of scale analysis. The Hague:

Mouton.

Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially

consistent observations. Econometrica, 16 , 1-32.

OECD. (2004). Learning for tomorrows world: First results from PISA 2003.

Retrieved from www.oecd.org/dataoecd/1/60/34002216.pdf

OECD. (2007). PISA 2006: Science competencies for tomorrows world:

Volume 1: Analysis.

OECD. (2009a). PISA 2006 technical report.

OECD. (2009b). PISA data analysis manual.

OECD. (2012). The policy impact of PISA: An exploration of the normative

effects of international benchmaring in school system performance

(OECD Education Working Paper No. 71). Organisation for Economic

Co-operation and Development.

OECD. (2014). PISA 2012 results in focus: What 15-year-olds know and what

they can do with what they know.

Oliveri, M. E., & Ercikan, K. (2011). Do different approaches to examining

construct comparability in multilanguage assessments lead to similar

conclusions? Applied Measurement in Education, 24 (4), 349-366. doi:

10.1080/08957347.2011.607063

Oliveri, M. E., & Von Davier, M. (2011). Investigation of model fit and score

scale comparability in international assessments. Psychological Test and

Assessment Modeling , 53 (3), 315-333.

Oliveri, M. E., & Von Davier, M. (2014). Toward increasing

fairness in score scale calibrations employed in international large-

Page 7: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

102 bibliography

scale assessments. International Journal of Testing , 14 (1), 1-21. doi:

10.1080/15305058.2013.825265

Post, W. J. (1992). Nonparametric unfolding models. A latent structure

approach. Leiden: DSWO Press.

R Development Core Team. (2013). R: A language and environment for

statistical computing [Computer software manual]. Vienna, Austria.

Retrieved from http://www.R-project.org

Rasch, G. (1960). Probabilistic models for some intelligence and attainment

tests. Copenhagen: The Danish Institute of Educational Research.

(Expanded edition, 1980. Chicago, The University of Chicago Press)

Ross, S. M. (1996). Stochastic processes (Second ed.). New York: John Wiley

& sons.

Rubin, D. (1976). Inference and missing data. Biometrika, 63 , 581-592.

Sandilands, D., Oliveri, M. E., Zumbo, B. D., & Ercikan, K. (2013).

Investigating sources of differential item functioning in international

large-scale assessments using a confirmatory approach. International

Journal of Testing , 13 (2), 152-174. doi: 10.1080/15305058.2012.690140

Sekhon, J. S. (2011). Multivariate and propensity score matching software with

automated balance optimization: The matching package for R. Journal

of Statistical Software, 42 (7), 1-52.

Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering

persons and items using simple sum scores. Journal of Educational and

Behavioral Statistics, 25 (4), 391-415.

Sijtsma, K., & Molenaar, I. (2002). Introduction to nonparametric item

response theory. Thousand Oaks, California: Sage Publications, Inc.

Spearman, C. (1904). General intelligence, objectively determined and

measured. American Journal of Psychology , 15 , 201-293.

Van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive

testing. New York: Springer.

Van der Linden, W. J., & Pashley, P. J. (2010). Item selection and ability

estimation in adaptive testing. In W. J. Van der Linden & C. A. W. Glas

(Eds.), Elements of adaptive testing (p. 3-30). Springer.

Verhelst, N. D. (2012). Profile analysis: A closer look at the PISA 2000 reading

data. Scandinavian Journal of Educational Research, 56 (3), 315-332. doi:

10.1080/00313831.2011.583937

Page 8: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

103

Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter logistic model:

OPLM. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models:

Foundations, recent developments and applications (p. 215-238). New

York: Springer Verlag.

Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1993). OPLM:

One parameter logistic model. Computer program and manual. Arnhem:

Cito.

Wainer, H., Bradlow, E., & Du, Z. (2000). Testlet response theory: An analog

for the 3pl model useful in testlet-based adaptive testing. In W. Van der

Linden & C. Glas (Eds.), Computerized adaptive testing: Theory and

practice (p. 245-269). Kluwer Academic Publishers.

Warm, T. (1989). Weighted likelihood estimation of ability in item response

theory. Psychometrika, 54 , 427-450.

Weiss, D. J. (Ed.). (1983). New horizons in testing: Latent trait test theory

and computerized adaptive testing. New York: Academic Press.

Zenisky, A., Hambleton, R. K., & Luecht, R. (2010). Multistage testing: Issues,

desings and research. In W. J. Van der Linden & C. A. W. Glas (Eds.),

Elements of adaptive testing (p. 355-372). Springer.

Page 9: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

104 bibliography

Page 10: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

References published chapters

Chapters 2, 3, and 4 have been submitted or accepted for publication as,

respectively:

Zwitser, R.J. & Maris, G. (in press). Conditional Statistical Inference with

Multistage Testing Designs. Psychometrika

Zwitser, R.J. & Maris, G. (conditionally accepted). Ordering Individuals with

Sum Scores: the Introduction of the Nonparametric Rasch Model.

Psychometrika

Zwitser, R.J., Glaser, S.S.F. & Maris, G. (submitted). Monitoring Countries

in a Changing World. A New Look at DIF in International Surveys

Interest of co-authors:

• G. Maris is the supervisor of this thesis

• S.S.F. Glaser is a master’s student who has contributed to the analyses

described in Section 4.4.3

105

Page 11: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

106 references published chapters

Page 12: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

Summary of ‘Contributions to Latent

Variable Modeling in Educational

Measurement’

One of the prominent questions in educational measurement is how to

summarise scored item responses into a final score, such that the final score

reflects the construct that is supposed to be measured. In answering this

question, latent variable models play an important role. This thesis considers

a couple of questions regarding the use of latent variable models in scoring

item responses.

Chapter 1 is an introduction to the rest of the thesis. It is explained that

there are different views on what a construct is. Furthermore, this chapter

introduces some general terms and theory, after which a broad overview of the

thesis is given.

The core of the thesis consists of Chapters 2 to 4, where three distinct

research projects are described. The first project, described in Chapter 2, is

about conditional likelihood inference from multistage testing designs. In

adaptive testing, the scoring of individual test takers is usually done via

estimates of person parameters. To obtain unbiased estimates, it is required

that the item parameters are also unbiased. This chapter shows how to

obtain item parameter estimates from multistage testing designs based on the

conditional likelihood method. Besides this technical result, some more

general issues related to adaptive testing, item parameter estimation, and

model fit are discussed. It is explained that simple measurement models are

more likely to fit the data obtained from adaptive designs compared to data

obtained from linear designs. This is illustrated with simulated data, as well

107

Page 13: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

108 summary

as with real data taken from the Dutch Entreetoets.

In Chapter 3, the item response theory (IRT)-based justification of the use

of the sum score is considered. Two IRT-models are well-known for their

relationship between the sum score and the person parameter: the parametric

Rasch Model (RM), in which the sum score is a sufficient statistic for the

person parameter, and the nonparametric Monotone Homogeneity Model

(MHM), in which the latent trait is stochastically ordered by the sum score.

It is illustrated that there is a theoretical gap between the two: the RM

enables scoring individuals by means of the sum score, while the MHM

enables ordering groups by means of the sum score. To fill the gap, the

concept of ordinal sufficiency is defined, and the nonparametric Rasch Model

is introduced as a less restrictive nonparametric alternative that enables

ordering individuals by means of the sum score.

The final project, in Chapter 4, is about differential item functioning

(DIF) in international surveys. Usually, DIF is considered as a threat to

validity, and as a phenomenon that hinders the comparison of performances

between countries. However, in the approach described in Chapter 4, DIF is

not considered as a threat, but as an interesting survey outcome reflecting

qualitative differences between countries. To obtain comparable scores in a

context with DIF, it is proposed not the take the person parameter estimates

as a basis for comparison. Instead, it is proposed to define the construct as a

market basket of items, and to take (a summary statistic of) the item

responses as the basis for comparisons. Since survey data are usually

incomplete, the latent variable models - probably different models in different

countries - are used to describe the distribution of the item responses in the

market basket. This approach is illustrated with data from the PISA cycle of

2006.

Chapter 5 is a general discussion. Three issues related to Chapters 2 to 4

are raised. The first is about an optimal adaptive test design for high-stakes

testing. It is argued that this is not a computerized adaptive test (CAT) with

an infinitely large and calibrated item bank. Instead, a multistage test can lead

to more efficient results. The second is about what to do when the sum score

is not ordinal sufficient for the person parameter. It is argued that, especially

in high-stakes testing, one should look for a coarser statistic that is ordinal

sufficient. The third issue is an elaboration on the positive aspects of DIF.

Page 14: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

Samenvatting van ‘Contributions to

Latent Variable Modeling in

Educational Measurement’

Een prominente vraag bij meten in het onderwijs is hoe de scores op

afzonderlijke toetsvragen samengevat moeten worden in een eindscore,

zodanig dat de eindscore het construct - dat is de te meten vaardigheid -

representeert. Bij het beantwoorden van deze vraag spelen

latentevariabelemodellen een belangrijke rol. Dit proefschrift beschouwt een

aantal vraagstukken omtrent het gebruik van latentevariabelemodellen en het

bepalen van eindscores.

Hoofdstuk 1 is een introductie. Er wordt in de eerste plaats beschreven dat

er verschillende visies zijn op wat een construct eigenlijk is. Verder worden een

aantal algemene termen en de nodige theorie geıntroduceerd. Tenslotte wordt

een overzicht gegeven van de rest van het proefschrift.

De hoofdstukken 2 tot en met 4 vormen de kern van het proefschrift. In

deze hoofdstukken worden drie afzonderlijke onderzoeksprojecten beschreven.

Het eerste project, beschreven in hoofdstuk 2, gaat over conditionele

likelihood1 inferentie bij multistage toetsing. Bij adaptief toetsen worden de

scores meestal toegekend via van schattingen van de persoonsparameters. Om

zuivere schatters te krijgen is het vereist dat de itemparameters ook zuiver

zijn. Dit hoofdstuk laat zien hoe bij multistage toetsing de itemparameters

geschat kunnen worden op basis van de conditionele likelihood methode.

Naast dit technische resultaat wordt ook een aantal algemene thema’s met

betrekking tot adaptief toetsen, itemparameterschattingen en model fit

1Voor sommige woorden in de Nederlandse samenvatting wordt de Engelse term gebruikt,omdat deze ook in het Nederlandse jargon gebruikt worden.

109

Page 15: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

110 samenvatting

besproken. Daarbij wordt uitgelegd dat eenvoudige meetmodellen

waarschijnlijk beter passen op data van adaptieve toetsen in vergelijking met

data van lineaire toetsen. Dit wordt zowel geıllustreerd met gesimuleerde data

als met data van de Nederlandse Entreetoets.

Hoofdstuk 3 gaat over de onderbouwing van het gebruik van de somscore

met behulp van itemresponstheorie (IRT). Twee IRT-modellen zijn bekend

vanwege de relatie tussen de somscore en de persoonsparameter: het

parametrische Rasch Model (RM), waarin de somscore een sufficient statistic

is voor de persoonsparameter, en het niet-parametrische Monotone

Homogeneity Model (MHM), waarin de latente trek stochastisch geordend is

op basis van de somscore. In hoofdstuk 3 wordt betoogd dat het RM het

scoren van individuen op basis van de somscore onderbouwt, terwijl het MHM

het ordenen van groepen op basis van de somscore onderbouwt. Dit laat

ruimte voor een derde model. Om het derde model te kunnen introduceren,

wordt eerst het begrip ordinal sufficiency gedefinieerd. Het model dat

vervolgens wordt geıntroduceerd is het niet-parametrische Rasch model. Dit

is een minder restrictief model dan het RM, waarmee het ordenen van

individuen op basis van de somscore kan worden onderbouwd.

Het laatste project, dat beschreven wordt in hoofdstuk 4, gaat over

differential item functioning (DIF) in internationale onderwijspeilingen.

Meestal wordt DIF gezien als een bedreiging voor de validiteit en als iets dat

de vergelijking van de prestaties van landen bemoeilijkt. Echter, in de

methode zoals die wordt beschreven in hoofdstuk 4, wordt DIF niet gezien als

een bedreiging, maar als een interessant resultaat dat kwalitatieve verschillen

tussen landen weergeeft. Om in een context met DIF te komen tot

vergelijkbare scores wordt voorgesteld om de vergelijking niet te baseren op

de persoonsparameters. In plaats daarvan wordt voorgesteld om het construct

te definieren als een grote verzameling toetsvragen (de market basket), en om

vergelijkingen te baseren op een samenvattende statistiek op deze market

basket. Aangezien de data van peilingen meestal incompleet zijn, worden

latentevariabelemodellen - mogelijk verschillende modellen in verschillende

landen - gebruikt om de verdeling van de itemscores in de market basket te

schatten. Deze benadering is geıllustreerd met PISA data uit 2006.

Hoofdstuk 5 is een algemene discussie. Drie zaken met betrekking tot de

hoofdstukken 2 tot en met 4 worden nader bediscussieerd. De eerste is de

Page 16: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

111

vraag wat een optimale adaptieve toets is voor high-stakes toetsing. Daarbij

wordt beargumenteerd dat dit niet een computer adaptieve toets (CAT) met

een oneindig grote en gekalibreerde vragenbank is. In plaats daarvan kan een

multistage toets tot efficientere resultaten leiden. De tweede gaat over wat te

doen als de somscore niet ordinal sufficient is voor de persoonsparameter. Er

wordt beargumenteerd dat in het geval van high-stakes toetsing men wellicht

een ruwere statistiek dan de somscore moet zoeken die wel ordinal sufficient is.

De derde is een uitwijding over de positieve aspecten van DIF.

Page 17: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

112 samenvatting

Page 18: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

Dankwoord

Dit proefschrift heb ik mogen schrijven in de tijd dat ik in dienst was bij Cito.

Graag bedank ik Cito voor alle geboden faciliteiten. Een aantal mensen heeft

de achterliggende jaren een bijzondere bijdrage geleverd. Ik wil hen hierbij

persoonlijk bedanken.

Gunter, ik heb jouw begeleiding enorm gewaardeerd. Door jouw creativiteit,

toegankelijkheid, beschikbaarheid, eigenwijsheid en heldere stijl van uitleggen

heb ik de achterliggende vijf jaar ontzettend veel geleerd. Bedankt voor al je

toewijding en vertrouwen.

Anton, bedankt voor mijn plek bij POK, voor je lessen in pragmatisch

en tactisch handelen, en voor je aanmoediging om hoofdstuk 4 helemaal te

herschrijven. Uiteindelijk is dit hoofdstuk er een stuk beter van geworden.

Bas, bedankt voor je betrokkenheid in brede zin. Al tijdens mijn

afstudeerstage was je als kamergenoot nieuwsgierig naar mijn bezigheden. En

in een later stadium, toen het stageonderzoek ontwikkelde tot hoofdstuk 3

van dit proefschrift, was je genuanceerde feedback op het iets te

ongenuanceerde manuscript een welkome aanvulling.

Timo, bedankt voor de vele momenten waarop je beschikbaar was voor

vragen en voor je feedback op verscheidene delen van het manuscript.

Han, bedankt voor je begeleiding in de laatste maanden. Mede door jouw

betrokkenheid is het gelukt in een drukke tijd tot een afronding te komen.

Saskia, Matthieu en Maarten, bewoners van de ‘prijzenkast’, bedankt voor

de vele gesprekken, koffiemomentjes, flauwe grappen en morele steun. Ik koester

vele mooie herinneringen aan B5.46. Ook dank ik de overige collega’s van Cito,

en in het bijzonder die van POK, voor de fijne tijd, de goede sfeer en de

voortdurende bereidheid om even mee te denken.

Michiel en Matthieu, bedankt voor jullie vriendschap en betrokkenheid in

113

Page 19: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

114 dankwoord

de afgelopen jaren. Ik vind het fijn dat jullie mijn paranimfen zijn.

Pa en ma, van kinds af aan hebben jullie mijn nieuwsgierigheid positief

benaderd. Mijn studietijd heeft wat omzwervingen gekend, maar ook daarin

heb ik de mogelijkheid gekregen om mijn weg te vinden. Ik ben blij met dit

proefschrift als resultaat. Bedankt voor al jullie aanmoedigingen.

Janet en Thijmen, ik ben blij dat jullie werk en wetenschap zo kunnen

relativeren. Bedankt voor alles wat jullie mij geven.

Robert Zwitser

December 2014

Page 20: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

.

Page 21: UvA-DARE (Digital Academic Repository) Contributions to ... · administered according to a branched testing design. Psychological Test and Assessment Modeling, 52(4), 450-460. Le,

Contributions to Latent Variable Modeling in Educational Measurement

Robert J. Zwitser

Contrib

utions to

Latent V

aria

ble M

od

eling in Ed

ucatio

nal M

easurem

entRobert J. Zw

itser