international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and...

28
international large-scale assessments – current results on student competencies in germany

Transcript of international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and...

Page 1: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

international large-scale assessments – current results on student competencies in germany

Page 2: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

7677

INTERNATIONAL LARGE-SCALE ASSESSMENTS – CURRENT RESULTS ON STUDENT COMPETENCIES IN GERMANY 78

1 Programme for International Student Assessment 2015 78

2 Trends in International Mathematics and Science Study (TIMSS) 2015 87

2.1 TIMSS Science Assessment Framework 872.2 TIMSS 2015 International Benchmarks in Science 88

3 Proficiency Gains in the Longitudinal PISA 2012 – 2013 Assessment 96

3.1 Measurement Requirements for Assessing Competency Development 96

3.2 Average Proficiency Gains 100

Page 3: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

international large-scale assessments – current results on student competencies in germany

With its special expertise in educational assessment and educational measure ment, the IPN plays an important national role in the interna­tional large­scale assessments Programme for International Student Assessment (PISA) and Trends in International Mathematics and Science Study (TIMSS). The IPN is part of the Centre for International Student Assessment (ZIB, headed by the Technical University of Munich), respon­sible for the national project management of PISA, and of the national TIMSS consortium (headed by the Institute for Education Research and School Development (IFS) at the Technical University of Dortmund). Whereas PISA assesses 15­year­old students’ competencies in reading, mathematics, and science every three years, TIMSS focusses on fourth and eighth graders’ achievement in mathematics and science every four years. Germany, however, only participates in TIMSS with a nationally representative sample of Grade 4 students. In 2015, both studies coin­cided and the eagerly awaited reports published in 2016 provided infor­mation on how student competencies have changed between 2000 and 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and mathematics compe­tencies in PISA and science competencies in TIMSS. Furthermore, IPN re­searchers played a major role in the longitudinal PISA 2012 – 2013 study. In this study ninth graders who were part of the German PISA 2012 sample were tested again one year later (i.e., at the end of Grade 10). The re peated measurement resulted in a data set which allows analyzing pro­ficiency gains over a period of one year and relating them to individual, family, and classroom factors.The following three sections provide central findings of PISA 2015, TIMSS 2015, and the longitudinal PISA 2012 – 2013 study and thereby inform about the IPN’s activities in large­scale assessments.

1 Programme for International Student Assessment 2015

Introduction

The international large­scale assessment PISA is launched and headed by the Organisation for Economic Co­operation and Development (OECD). PISA tests are developed to assess the competencies deemed necessary for

Page 4: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

1 Programme for International Student Assessment 2015

7879

fully participating in modern knowledge­based societies. Success in professional life and in life more generally hinges on these key competen­cies. Each PISA survey assesses students’ com­petencies in reading, mathematics, and science. However, in every survey one domain is assessed in more depth as the major domain. As in PISA 2006, this was science in PISA 2015. The recent survey thus concludes the second complete PISA cycle, which started in PISA 2009 with reading and continued in PISA 2012 with mathematics as the major domains. The findings presented in this report will focus on science as the major domain in PISA 2015.

Objectives

Equipping students with the competencies necessary to meet modern societies’ demands is a major objective of compulsory schooling. Educa­tional systems are therefore subjected to critical scrutiny; reforms need to be based on reliable in­formation. PISA assesses students’ cognitive out­comes near the end of compulsory education. It can thus provide educational systems with mean­ingful information on strengths, weaknesses, and possible methods of improvement. A second ma­jor function of PISA is related to the monitoring of trends. As data from an increasing number of PISA surveys becomes available, changes in aver­age performance over time may be investigated and used to evaluate the effects of reforms within the educational system. The relevance of both functions for monitoring educational outcomes is confirmed by the still increasing number of coun­tries participating in the PISA surveys – the tests of PISA 2015 were administered in 72 countries.

Changes Introduced in PISA 2015

For PISA 2015, the OECD decided to introduce several changes in the concept of the study.

Some of these changes refer to the frame­work of the underlying construct of scien­tific literacy; it was revised and expanded to some extent. The most significant changes, however, are related to the administration of the tests and the analysis of the data. For the first time, the tests in PISA 2015 were com­pletely computer­based. This change of the assessment mode from paper­and­pencil to computer allowed the use of innovative test formats (e.g., interactive tasks) and analysis pro­cedures (see Method section). The rationale for this decision was based on the perceived need to react to developments and improvements, especially in the field of computer­based testing and psychometrics, in order to ensure the quality of the PISA assess­ment in the future. However, introducing such changes in an assessment like PISA inevitably has implications with respect to the comparability of results across surveys. Whenever several changes are implemented at the same time it is almost impossible to separate the effects. Every inter­national large­scale assessment will thus reach points in its history where the benefits of in­troducing changes that will improve the quality of the data or enable a more comprehensive or detailed assessment of the underlying construct have to be carefully balanced against ensuring the comparability to historical data. In the case of PISA 2015, the decision was made that bene­fits of the changes justified some restrictions with respect to the comparability of results and the analysis of trends across surveys. This has to be kept in mind, especially with respect to the interpretation of the PISA 2015 trend results.

Method

The countries’ samples represent the respective full target populations of 15­year­olds enrolled in school. In Germany, N = 6 504 students from 253 schools participated in PISA 2015.

Page 5: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

The PISA 2015 computer­based science test comprised a total of 184 items organized in 56 units (a unit consists of a stimulus and several items related to a specific science topic like e.g. acid rain). Twenty­six of these units with a total of 99 items had been developed for the current assessment; five of these units (24 items) were interactive and 21 units (75 items) were non­ interactive. The remaining 30 units (85 items) consisted of so­called link items which had already been used in previous PISA surveys and ensure comparability of score estimates across surveys. However, the PISA 2015 science assessment is for the first time completely computer­based, so that the link items had to be transferred from a paper­based to a computer ­based format. Such mode changes might affect item difficulties and discriminations. To analyze the possible effects of the mode change, the international consortium for PISA 2015 carried out a mode effect study in the context of the PISA 2015 field trial. At the international level, that is, across all countries participating in PISA 2015, no mode effect was observed. However, this does not eliminate the possibility of ef­fects in specific countries that are at least partly related to the mode change. If, for example, in one country computers are not regularly

used in school, this might make a computer­ based test more difficult for students than a paper­and­pencil test.For Germany, the results of the field trial in­dicate that, on average, the item difficulty in­creased when the item was administered on the computer (effect size Cohen’s d = −0.23, corresponding to approximately 20 points on the international PISA scale). One possible rea­son could indeed be differences in the extent of computer use: Results from the student ques­tionnaire in PISA 2015 indicate that compared with the OECD average, 15­year­old students in Germany use computers less, both within and out of school, than their peers in other coun­tries. More detailed analyses of these effects will be presented in subsequent publications.To ensure reasonable testing times, each indi­vidual student in PISA works on only a subset of items. Specifically, the items were allocated to 66 so­called basic booklets according to a balanced incomplete block design. Estimates (plausible values) based on item response theory were obtained for all students’ domain- specific competencies located on the same scale regard­less of which particular items each student had worked on. Compared to earlier surveys, how­ever, PISA 2015 used a different scaling model.

INTERNATIONAL LARGE-SCALE ASSESSMENTS

Page 6: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

1 Programme for International Student Assessment 2015

8081

A two-parameter model (2-PL) estimating item difficulties and item dis­crimination parameters was used instead of the traditional Rasch model (1-PL) only providing difficulty parameters for each item.To relate certain score points to what students are able to do, the con­tinuous scale is segmented into a specified number of so-called levels of proficiency. The levels of proficiency allow a substantive interpretation of students’ competencies as assessed by the PISA tests. To put it simply, the levels of proficiency describe the kind of demands students on each level are able to meet. In science, six levels of proficiency are distinguished. In contrast to earlier surveys, the lowest level of proficiency is in PISA 2015 further differentiated in Levels Ia and Ib to allow for a better description of the abilities of low performing students.

Results

For clarity the results presented in this report will be confined to the 35 OECD countries. On average, 15­year­old students in Germany score 509 points on the overall science scale. Germany is thus one of 18 countries whose mean science score is significantly higher than the OECD average of 493 points (see Figure 1). The highest levels of science competence are reached by 15­year­olds in Japan, Estonia, Finland, and Canada. These countries also have the greatest proportion of top (at Levels V or VI; be­tween 12.4 % in Canada and 15.3 % in Japan) and the smallest proportion of low performing (below or at Level Ia; between 8.8 % in Estonia and 11.5 % in Canada) students. In Germany, 10.6 percent of the students are proficient at Level V or VI as compared with 7.7 percent on OECD average. At the lower end of the scale, 17.0 percent of students in Ger­many are proficient below or at Level Ia. The respective proportion across OECD countries is 21.2 percent.

Gender Differences

Looking at gender differences in science performance, in PISA 2015 boys significantly outperform girls on the OECD average (with a difference of four score points) and in 15 OECD countries. Among these countries is also Germany for the first time with a difference of 10 score points. An even higher difference of 14 score points can be found in the top­ performing country Japan whereas in only four countries (Finland, Lat­via, Greece, and Slovenia) girls reach significantly higher mean values. The biggest advantage in favor of the girls can be found in Finland with 19 score points. In the majority of OECD countries, girls are underrepre­sented in the group of high­performers. In 26 countries, the proportion of

Schiepe-Tiska, A., Rönnebeck, S., Schöps, K., Neumann, K., Schmidtner, S., Parchmann, I., & Prenzel, M. (2016). Natur-wissenschaftliche Kompetenz in PISA 2015 – Ergebnisse des internationalen Vergleichs mit einem modifizierten Testansatz [Science compe-tence in PISA 2015 – Results of the international compari-son with a modified approach to testing]. In K. Reiss., C. Sälzer, A. Schiepe-Tiska, E. Klieme, & O. Köller (Eds.), PISA 2015. Eine Studie zwischen Kontinuität und Innovation (pp. 45–98). Münster, Germany: Waxmann.

Page 7: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

OECD countries M SD

Japan 538 93

Estonia 534 89

Finland 531 96

Canada 528 92

Korea 516 95

New Zealand 513 104

Slovenia 513 95

Australia 510 102

United Kingdom 509 100

Germany 509 99

Netherlands 509 101

Switzerland 506 100

Ireland 503 89

Belgium 502 100

Denmark 502 90

Poland 501 91

Portugal 501 92

Norway 498 96

United States 496 99

Austria 495 97

France 495 102

Sweden 493 102

OECD average 493 94

Czech Republic 493 95

Spain 493 88

Latvia 490 82

Luxembourg 483 100

Italy 481 91

Hungary 477 96

Iceland 473 91

Israel 467 106

Slovakia 461 99

Greece 455 92

Chile 447 86

Turkey 425 79

Mexico

Sign

ifica

ntly

bel

ow t

he O

ECD

ave

rage

Sign

ifica

ntly

abo

ve t

he O

ECD

ave

rage

416 71

200 300 400 500 600 700 800

Percentiles

5% 10% 25% 75% 90% 95%

Mean value and confidence interval ( ± 2 SE )

Figure 1. Mean and distribution of science performance in PISA 2015 (OECD countries only).

INTERNATIONAL LARGE-SCALE ASSESSMENTS

Page 8: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

1 Programme for International Student Assessment 2015

8283

girls at Proficiency Level V or VI is significantly lower than the proportion of boys. This is also true for Germany. At the other end of the scale, however, Germany is next to Chile, one of only two OECD countries where the proportion of girls on the lowest levels of proficiency (Levels Ia and below) is higher than the proportion of boys. Looking at the gender differences in Germany in a bit more detail, one finds that the biggest differences between boys and girls pertain to items requiring the scientific expla­nation of phenomena and the application of content knowledge. No significant differences can be observed for items requiring the inter­pretation of data and evidence, the evaluation of scientific enquiry, and the application of pro­cedural or epistemic knowledge.

School-Type Specific Analyses in Germany

Next to the international comparison, the PISA dataset also allows for some detailed analyses

of the situation in Germany. Figure 2 shows the mean values for 15­year­old students at dif­ferent school types in Germany. In contrast to earlier surveys, PISA 2015 only differentiates between the Gymnasium (the academic track offering the highest level of secondary educa­tion in Germany) and non-Gymnasium schools. Within the last years, many schools offering lower secondary education – with the excep­tion of the Gymnasium – have been transferred into new or differently named school types by combining study programs. A detailed analysis of the non­Gymnasium schools is thus no longer meaningful.As expected, the mean values reached at the Gymnasium and non­Gymnasium schools differ considerably. Whereas students at the Gymna­sium reach a mean value of 585 score points, the mean value of 476 score points at the non­ Gymnasium schools is more than one standard deviation or almost one and a half proficiency levels lower. Figure 2 also allows looking at the overlaps in the competence distributions.

Non-Gymnasium Schools Gymnasium Total Sample

Leve

ls o

f Pro

ficie

ncy

VI

V

IV

Ia

Ib

II

III

M = 476

M = 509

M = 585

Educational Participation 62% 33% 100%

Percentiles

5%

10%

25%

75%

90%

95%

Mea

n va

lue

and

confi

denc

e in

terv

al (

± 2

SE

)

Figure 2. Mean and distribution of science performance according to school type in Germany.

Page 9: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

At the non-Gymnasium schools, most students are at Proficiency Lev­els II and III; at the Gymnasium the majority of students is proficient at Level IV. As in previous surveys, however, 10 percent of students at non­Gymnasium schools reach a competence level that is above the mean value of the Gymnasium. Conversely, at the Gymnasium 10 percent of the students reach a competence level that is below the mean value of the non­Gymnasium schools.

Trends in Science Performance

As previously stated, the OECD implemented a number of changes in PISA 2015 to accommodate new developments. Such modifications, however, limit the comparability of results especially with respect to changes and developments in science performance during the last decade. This has to be kept in mind when interpreting the trend results.Figure 3 shows the changes in science performance between PISA 2006, when science was for the first time major domain, and PISA 2015.No statistically significant changes were observed in 23 countries – or on the OECD average. Japan, Estonia, Canada, Korea, the United Kingdom, and Switzerland, for example, also belong to this group of countries next to Germany. Ten countries achieved significantly lower mean values in PISA 2015 compared to PISA 2006, first and foremost Finland with a decline of 32 score points. Only in two countries – Portugal and Norway – a significant improvement can be observed; especially in Portugal the increase is notable with 27 score points.Trends have also been analyzed for the Gymnasium and non­Gymnasium schools in Germany. The results can be regarded as an indicator of how successful the differentiated educational system is in providing students with a sound base of scientific literacy while at the same time support­ing gifted students and promoting excellence. For the non­Gymnasium schools no significant change was found. At the Gymnasium, however, the mean science performance decreased significantly from PISA 2006 to PISA 2015 by 13 score points.

Summary and Discussion

Overall, the PISA 2015 science assessment yielded positive results for Germany. The mean science performance of 15­year­old students as well as the proportion of top performers in Germany is above the OECD average. Nevertheless, countries like Japan or Canada show that there is still room for improvement. In contrast to earlier surveys a small gen­der effect was observed in favor of the boys. Due to the fact that this

INTERNATIONAL LARGE-SCALE ASSESSMENTS

Page 10: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

1 Programme for International Student Assessment 2015

8485

Figure 3. Mean values of science performance in PISA 2006 and PISA 2015 (OECD countries only).

OECD country 2015 2006 Difference

M (SE) M (SE) 2015 – 2006

Japan 538 (3.0) 531 (3.4)

Estonia 534 (2.1) 531 (2.5)

Finland 531 (2.4) 563 (2.0)

Canada 528 (2.1) 534 (2.0)

Korea 516 (3.1) 522 (3.4)

New Zealand 513 (2.4) 530 (2.7)

Slovenia 513 (1.3) 519 (1.1)

Australia 510 (1.5) 527 (2.3)

United Kingdom 509 (2.6) 515 (2.3)

Germany 509 (2.7) 516 (3.8)

Netherlands 509 (2.3) 525 (2.7)

Switzerland 506 (2.9) 512 (3.2)

Ireland 503 (2.4) 508 (3.2)

Belgium 502 (2.3) 510 (2.5)

Denmark 502 (2.4) 496 (3.1)

Poland 501 (2.5) 498 (2.3)

Portugal 501 (2.4) 474 (3.0)

Norway 498 (2.3) 487 (3.1)

United States 496 (3.2) 489 (4.2)

Austria 495 (2.4) 511 (3.9)

France 495 (2.1) 495 (3.4)

Sweden 493 (3.6) 503 (2.4)

OECD average 493 (0.4) 498 (0.5)

Czech Republic 493 (2.3) 513 (3.5)

Spain 493 (2.1) 488 (2.6)

Latvia 490 (1.6) 490 (3.0)

Luxembourg 483 (1.1) 486 (1.1)

Italy 481 (2.5) 475 (2.0)

Hungary 477 (2.4) 504 (2.7)

Iceland 473 (1.7) 491 (1.6)

Israel 467 (3.4) 454 (3.7)

Slovakia 461 (2.6) 488 (2.6)

Greece 455 (3.9) 473 (3.2)

Chile 447 (2.4) 438 (4.3)

Turkey 425 (3.9) 424 (3.8)

Mexico 416 (2.1) 410 (2.7)

−40 −30 −20 −10 0 10 20 30

Page 11: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

difference is quite small, no substantial practical implications should be derived from this finding. The analysis of trends between PISA 2006 and PISA 2015 yields no significant changes in overall science performance. This indicates that a stable level of science competence has been reached and established in Germany. Countries like Finland, New Zealand, or the Netherlands where a significant decrease in science performance can be observed show that this should not be taken for granted. Across all OECD countries, how ever, the science assessment in PISA 2015 tends to result in lower mean values which might indicate effects of the change in the assessment mode. Trend analysis using a national perspective show a sig­nificant decrease in science performance for students at the Gymnasium. Again, this finding has to be viewed against the background of the changes introduced by the OECD in the conception, administration, and analy­ses of the study. Nevertheless, it seems that Germany is less effective in supporting its top performing students than other countries and thus does not use the potential of its differentiated educational system to full capacity. Ensuring economic and social welfare requires both scientific literacy for all students and also promoting those students who are highly competent as potential candidates for scientific or engineering careers. The Förderstra tegie für leistungsstarke Schülerinnen und Schüler (Strategy to Foster Highly Competent Students) adopted by the KMK in 2015 can be seen as an important step towards fostering high performers. It thus com­plements earlier efforts to support students at the lower end of the com­petence distribution and underlines the importance of addressing both goals of science education.

ipn research group // Aiso Heinze, Olaf Köller, Oliver Lüdtke, Knut Neumann, Ilka Parchmann, Alexander Robitzsch, Silke Rönnebeck, Katrin Schöps

funded by // Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany (KMK), Federal Ministry of Education and Research (BMBF)

duration // 2013 – 2016

cooperation // German Institute for International Educational Research (DIPF); TUM School of Education

homepage // www.pisa.tum.de/home

INTERNATIONAL LARGE-SCALE ASSESSMENTS

Page 12: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

2 Trends in International Mathematics and Science Study (TIMSS) 2015

8687

2 Trends in International Mathematics and Science Study (TIMSS) 2015

Science is important in nearly all aspects of life and society, from maintaining and improving health to understanding and solving local, re­gional, and global environmental issues. Stu­dents’ understanding of science should develop throughout their schooling, starting in the early years of preschool and elementary education. The study of science in elementary and second­ary grades provides a critical foundation for stu­dents’ future careers and success in life. With this in mind, TIMSS aims to provide countries with information on how students succeed in gaining science knowledge and skills in the early grades as well as in their later school career. TIMSS started in the Nineteen­nineties and as­sesses science and mathematics achievement at the fourth and eighth grade levels in countries around the world. Germany participated in the study on fourth graders three times, namely in 2007, 2011, and 2015, allowing for estimating trends in science achievement. The study is con­ducted by the International Association for the Evaluation of Educational Achievement (IEA) and is coordinated by the International Study Center at Boston College, USA. In Germany, TIMSS 2015 is under the direction of the Insti­tute for Education Research and School Devel­opment (IFS) at the TU Dortmund. A group of researchers from the IPN is responsible for the science assessment in Germany.

2.1 TIMSS Science Assessment Framework

The TIMSS 2015 science assessment is based on a comprehensive framework originally de­veloped by Bloom and similar to that used in TIMSS 2011. This framework is organized around two dimensions: a content dimension

specifying the domains or subject matter to be assessed within science, and a cognitive dimension specifying the domains or thinking processes to be assessed. For the fourth grade assessment there are three content domains: life science, physical science including chemistry, and earth science. Life science includes the topics

· characteristics and life processes of organisms,

· life cycles, reproduction, and heredity, · organisms, environment, and their interactions,

· ecosystems, and · health.

Physical science comprises

· classification and properties of matter and changes in matter,

· forms of energy and energy transfer, · forces and motion.

And earth science is represented by the content areas

· earth’s structure, physical characteristics, and resources,

· earth’s processes, cycles, and history, and · earth in the solar system.

The three cognitive dimensions encompass the range of cognitive processes involved in learn­ing science. Here, the TIMSS framework dis­tinguishes knowing (e.g., recognizing, defining, describing, and providing examples), applying (e.g., comparing/contrasting/classifying, relating, using models, interpreting information, explain­ing), and reasoning (e.g., formulating questions/predicting, designing investigations, evaluating, drawing conclusions, generalizing, justifying).The TIMSS assessment is designed to ensure a broad coverage of science content and cognitive processes. In 2015 the TIMSS science test also

Page 13: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

comprises science practices. Practices are fundamental to scientific inquiry in all science disciplines. The TIMSS framework focusses on five practices:

· asking questions based on observations, · generating evidence, · working with data, · answering a research question, and · making an argument from evidence.

Students of science must become proficient in these practices to develop an understanding of how the scientific enterprise is conducted. Therefore, increasing emphasis has been placed on science practices and science in­quiry in many countries’ – including Germany’s – current science curricula, standards, and frameworks. In TIMSS, science practices are not assessed in isolation, but instead in the context of one of the content domains, and draw upon the range of thinking processes specified in the cognitive domains.

2.2 TIMSS 2015 International Benchmarks in Science

As well as overall performance, TIMSS reports student achievement at four points on the scale known as international benchmarks. These enable de­scriptions of proficiency levels in terms of what students reaching each inter­national benchmark know and can do. The international benchmarks range from reproduction of basic facts (low) to beginning scientific reasoning (ad­vanced). More specifically, students at the highest levels (advanced inter­national benchmark), for example, show some understanding of all three content domains (i.e., ecosystems, physical and chemical changes, and geo­graphic processes) and of the process of scientific inquiry. They are devel­oping the skills to interpret results of a simple experiment, to draw conclu­sions from diagrams, and to evaluate and support an argument. In contrast, students at the lowest level (low international benchmark) show only basic knowledge of life and physical sciences (i.e., simple facts on human health or physical properties), and can only interpret simple diagrams or tables, and provide short written responses to questions. These levels are cumulative skill sets, meaning that a student who is classified at the highest benchmark can also demonstrate all the skills that exemplify lower benchmarks.

Research Questions

Concerning science achievement in Germany, TIMSS 2015 addressed the following research questions:

Page 14: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

2 Trends in International Mathematics and Science Study (TIMSS) 2015

8889

1. What is the mean science achievement of fourth graders in Germany as compared to those of other countries participating in TIMSS 2015?

2. What are the results for fourth graders in Germany in the content (life science, phys­ical science, and earth science) and cogni­tive domains (knowing, applying, and rea­soning)?

3. How large are the groups of fourth graders in Germany who achieve the lowest and the highest international benchmarks?

4. Which trends are visible over time in science between 2007 and 2015?

Method

In TIMSS 2015, 47 countries from around the world participated in the assessments at Grade Level 4. Figure 4 shows a map of the world identifying the TIMSS 2015 countries. For the sake of comparability across countries and

across assessments main data collection took place in spring/summer 2015.Within each country, nationally representative samples of approximately 4 000 students from 150 – 200 schools participated in TIMSS 2015. One class from each school took part in the assess­ment. For TIMSS 2015 in Germany, the number of participating schools was 204 and the number of students within these schools was 3 984. TIMSS aims to assess core curriculum areas in science, requiring administration of a large set of test items. The test in TIMSS 2015 com­prised 168 items. As such, each student saw only a subset of the total number of test items. Therefore a matrix­sampling approach that in­volves packaging the entire assessment pool of mathematics and science items into a set of 14 booklets was used, with each student com­pleting just one booklet. IRT scaling was used to estimate achievement on the overall science scale. Test items were relatively evenly split be­tween two response formats – multiple­choice and constructed­response formats.

Norway (5)

Germany

Netherlands

United Kingdom

USA

Czech Republic

Morocco

Italy

Kuwait

HungarySlovakia

Lithuania

Sweden

Denmark

Qatar

Slovenia

Iran

Russian Federation

Japan

SOU

RC

E: IE

A: T

rend

s in

Inte

rnat

iona

l Mat

hem

atic

s an

d Sc

ienc

e St

udy

© T

IMSS

201

5

Taiwan

Hong Kong

Singapore

Australia

New Zealand

GeorgiaKazakhstan

ArmeniaCanada

Chile

SerbiaCroatia

FinlandPoland

Cyprus

Bulgaria

Saudi-Arabia

Bahrain

Portugal

Irland

Spain

France

Northern Ireland

Belgium

Turkey

Oman

Indonesia

United Arab Emirates

Republic of South Korea

Figure 4. Countries participating in TIMSS 2015 (Grade 4).

Page 15: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

Table 1 shows the percentages of the test items allocated to each dimen­sion, as specified by the TIMSS framework. Almost 50 percent of the items involved life science, 36 percent related to physical science and earth science had the smallest proportion with 20 percent. Among the cognitive domains, knowing and applying were each allocated almost 40 percent of test content, with the remaining 20 percent devoted to reason­ing. The percentages of the items allocated to each dimension are quite similar to TIMSS 2007 and 2011.

Table 1. Proportion of Items per Content and Cognitive Domain

Content domain Proportion Cognitive domain Proportion

Life science 44 % Knowing 40 %

Physical science

(incl. chemistry)

36 % Applying 39 %

Earth science 20 % Reasoning 21 %

In order to investigate trends in science achievement, 57 percent of the TIMSS 2015 items stemmed from the 2011 and 2007 assessments. The other 43 percent had been newly constructed for the 2015 assessment. TIMSS went to great lengths to ensure that comparisons of student achieve­ment across countries would be as fair and equitable as possible. Although the tests were developed to represent an agreed­upon framework and were intended to have as much in common across countries as possible, it was unavoidable that the match between the TIMSS 2015 assessment and the science curriculum was somewhat diverse across countries. Therefore, test­curriculum matching analysis (TCMA) was conducted to investigate the extent to which the TIMSS 2015 science assessment was relevant to each country’s curriculum. In each participating country expert groups examined each test item and indicated whether the item was in their country’s in­tended fourth grade curriculum. For Germany TCMA showed that 90 per­cent of the items were classified as appropriate for fourth grade science classrooms. This is far more compared to TIMSS 2007 and TIMSS 2011, where about 80 percent of the items were deemed appropriate.

Results

Overall Performance in Science

Figure 5 summarizes the mean scores and position for each participat­ing country relative to Germany. The countries are arranged according to their mean overall science achievement. As with previous cycles of TIMSS, Singapore and the Republic of Korea significantly outperformed all other countries with means of 590 and 589, respectively. The next

Steffensky, M., Kleickmann, T., Kasper, D. & Köller, O. (2016).

Naturwissenschaftliche Kom-petenzen im internationalen

Vergleich: Testkonzeption und Ergebnisse [An international

comparison of science compe-tencies: Test conception and re-sults]. In H. Wendt, W. Bos, C.

Selter, O. Köller, K. Schwippert, & D. Kasper (Eds.), TIMSS 2015: Mathematische und naturwissen-

schaftliche Kompetenzen von Grundschulkindern in Deutsch-

land im internationalen Vergleich (pp. 133–183). Münster,

Germany: Waxmann.

Page 16: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

2 Trends in International Mathematics and Science Study (TIMSS) 2015

9091

Country Average Scale Score Science Achievement Distribution

TIMSS Scale Centerpoint 500

h Country average significantly higher than the centerpoint of the TIMSS fourth Grade scale

Country average significantly lower than the centerpoint of the TIMSS fourth Grade scale

i

100 200 300 400 500 600 700 800

Percentiles of Performance

5th 25th 75th 95th

95% Confidence Interval for Average ( ± 2 SE )

Kuwait 337 (6.2) iMorocco 352 (4.7) iSaudi Arabia 390 (4.9) iIndonesia 397 (4.8) iIran. Islamic Rep. of 421 (4.0) iOman 431 (3.1) iQatar 436 (4.1) iUnited Arab Emirates 451 (2.8) iGeorgia 451 (3.7) i

Bahrain 459 (2.6) i

Chile 478 (2.7) iCyprus 481 (2.6) iTurkey 483 (3.3) iFrance 487 (2.7) i

New Zealand 506 (2.7) h

Portugal 508 (2.2) hBelgium (Flemish) 512 (2.3) hItaly 516 (2.6) hNetherlands 517 (2.7) hSpain 518 (2.6) hNorthern Ireland 520 (2.2) hSlovak Republic 520 (2.6) hAustralia 524 (2.9) hSerbia 525 (3.7) hCanada 525 (2.6) hDenmark 527 (2.1) hLithuania 528 (2.5) hGermany 528 (2.4) hIreland 529 (2.4) hCroatia 533 (2.1) hCzech Republic 534 (2.4) hBulgaria 536 (5.9) hEngland 536 (2.4) hNorway (Grade 5) 538 (2.6) hSweden 540 (3.6) hHungary 542 (3.3) hSlovenia 543 (2.4) hUnited States 546 (2.2) hPoland 547 (2.4) hKazakhstan 550 (4.4) hFinland 554 (2.3) hChinese Taipei 555 (1.8) hHong Kong SAR 557 (2.9) hRussian Federation 567 (3.2) hJapan 569 (1.8) hKorea. Rep. of 589 (2.0) hSingapore 590 (3.7) h

SOU

RC

E: IE

A: T

rend

s in

Inte

rnat

iona

l Mat

hem

atic

s an

d Sc

ienc

e St

udy

© T

IMSS

201

5

Figure 5. Science achievement in the 47 participating countries.

Page 17: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

highest countries were Japan and the Russian Federation, significantly higher than all coun­tries other than the top two. Among the low­est achieving countries with means below 400 scale points were Indonesia, Saudi Arabia, Mo­rocco, and Kuwait. Germany is shown in 19th position in the TIMSS science table with an overall score of 528, which is significantly above the international aver­age of 505. Analysis of overall achievement in Germany solely based on the items that fit the intended curriculum (i.e., excluding the 10 % of items TCMA classified as not appropriate) showed only very small differences in the overall results. Germany’s mean score was significantly lower than that of 15 countries including the U.S., Sweden, Hungary, and England. Germany’s mean score did not differ significantly from the means for 10 countries (including Ireland, Den­mark, Canada, and Australia). Within the OECD and the EU countries, the mean level of science achievement (527 and 525 scale points, respec­

tively) equaled the level shown by students in Germany. Twenty­two other countries – includ­ing the Netherlands, New Zealand, and France – were significantly outperformed by Germany. The achievement difference between Germany’s highest and lowest five percent performing stu­dents was 229 scale points, which is similar to the variation in achievement seen in other EU and OECD countries.

Performance in Content and Cognitive Domains

Germany’s performance in all three content ar­eas was relatively consistent, with no domain scores differing significantly from the overall mean of 528. Descriptively, students in Ger­many were relatively weaker in earth science (i.e., a 10­point difference to the overall science score), whereas in life science they displayed no differences to the overall score and only small

Page 18: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

9293

2 Trends in International Mathematics and Science Study (TIMSS) 2015

relative strengths in physical science (532 scale points). Compared with the means of the contents domain subscales in the EU and OECD coun­tries, Germany performed significantly better on the physical science sub­scale (10 and 9 scale points, respectively). No differences were found for both other content domains.The mean scores of the three cognitive domain subscales (knowing, ap­plying, and reasoning) were also not significantly different to the national mean. Within the OECD and the EU countries, the mean level of knowing and applying subscales equaled the level shown by students in Germany. However, students in Germany outperformed the EU countries on the rea­soning subscale; no differences were found within the OECD countries.

Performance at the International Benchmarks

In Germany, 40 percent of students reached the two highest international benchmarks (8 % reaching the advanced benchmark and 33 % reaching the high international benchmark). Germany was in a group of countries – including Australia, Canada, or the Czech Republic – with relatively low proportions of students performing at the highest level (less than 10 %) and less than 50 percent reaching the high benchmark. Average results for the EU and OECD countries are similar. Again, it clearly shows the difference between Germany’s profile and those of the highest scoring countries: In the top two performing countries (Republic of Korea and Singapore), at least 29 percent of students reached the advanced bench­mark and at least 71 percent reached the high benchmark. Generally, the highest scoring countries had higher proportions of students reaching the highest levels.On the other side of the achievement range were 18 percent of the stu­dents in Germany performing at the low benchmark and further 4 percent who failed to reach this level. Again, these results are similar to those in other EU and OECD countries. Among the high performing countries in science the respective figures varied from 10 percent (Singapore) to 4 percent (Republic of Korea) reaching or missing the low benchmark.

Science Achievement Development From 2007 to 2015

Comparing Germany’s performance over three cycles of TIMSS, we find no overall change in the mean science score. Comparing this trend with findings among the seven countries performing similar to Germa­ny in TIMSS 2007 provides a somewhat mixed picture (Figure 6). Two of these participants (Sweden and Kazakhstan) improved their scores in 2015 but others, like Italy, Slovenia, and the Netherlands showed some

Page 19: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

decreases over time, and Australia maintained its previous level of achievement. The differ­ence between attainment between the highest and lowest performing students is smaller in 2015 than in 2007 (230 and 254 scale points, respectively). As was the case for the over­all scale findings TIMSS 2015 neither showed changes on the three content domains nor on the three cognitive domains since TIMSS 2007. With regard to Germany’s proportions of stu­dents at the different proficiency levels some sig­

nificant, but rather small changes occurred since 2007. A positive note is that less students failed to reach the low international benchmark in 2015 (4.0 compared with 5.8 %). Furthermore, the percentage at the intermediate benchmark increased significantly: there were now 38.9 percent reaching this benchmark, instead of 35.4 percent in 2007. However, it is also notable that there has been a significant decrease on the ad­vanced international benchmark from 9.6 to 7.6 percent.

Figure 6. Science achievement in TIMSS 2007 and TIMSS 2015. Blue bars = statistically significant.

SOU

RC

E: I

EA: T

rend

s in

Int

erna

tion

al M

athe

mat

ics

and

Scie

nce

Stud

y ©

TIM

SS 2

015

Participants 2007 2015 Difference Higher Performance in

M07 (SE) M15 (SE) M15– M07 (SE) TIMSS 2007 TIMSS 2015

Georgia 418 (4.6) 451 (3.7) 34 (5.9)

Slovenia 518 (1.9) 543 (2.4) 24 (3.1)

Japan 548 (2.1) 569 (1.8) 21 (2.7)

Russian Federation 546 (4.8) 567 (3.2) 21 (5.7)

Czech Republic 515 (3.1) 534 (2.4) 19 (3.9)

Lithuania 514 (2.4) 530 (2.7) 16 (3.6)

Sweden 525 (2.9) 540 (3.6) 15 (4.6)

Denmark 517 (2.9) 527 (2.1) 10 (3.5)

USA 539 (2.7) 546 (2.2) 7 (3.5)

Hungary 536 (3.3) 542 (3.3) 6 (4.7)

Singapore 587 (4.1) 590 (3.7) 4 (5.5)

Hong Kong 554 (3.5) 557 (2.9) 2 (4.6)

New Zealand 504 (2.6) 506 (2.7) 1 (3.7)

Germany 528 (2.4) 528 (2.4) 1 (3.4)

Taiwan 557 (2.0) 555 (1.8) –1 (2.7)

Australia 527 (3.3) 524 (2.9) –4 (4.4)

Slovakia 526 (4.8) 520 (2.6) –5 (5.4)

United Kingdom 542 (2.9) 536 (2.4) –6 (3.7)

Netherlands 523 (2.6) 517 (2.7) –6 (3.7)

Iran 436 (4.3) 421 (4.0) –15 (5.9)

Italy 535 (3.2) 516 (2.6) –19 (4.1)

−50 −40 −30 −20 −10 10 20 30 40 50 600

Page 20: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

2 Trends in International Mathematics and Science Study (TIMSS) 2015

ipn research group // Thilo Kleickmann, Olaf Köller, Mirjam Steffensky

funded by // Federal Ministry of Education and Research (BMBF)

duration // 2013 – 2016

cooperation // Institute for Education Research and School Development (IFS); TU Dortmund University

homepage // timss.bc.edu/

9594

Summary

Fourth grade students were among performers in the upper half of sci­ence in all three study cycles Germany participated in. The 2007–2015 comparative analysis shows a stable performance on a level of the par­ticipating OECD countries. Yet, two aspects seem notable: First in many other countries a positive trend over time is visible and secondly, the gap is large between Germany and the top performing countries like the Republic of Korea, Singapore, and the Russian Federation. Students in Germany displayed no significant strengths or weaknesses on the content or cognitive domains in TIMSS 2015, which indicates that students receive learning opportunities to develop a broad knowledge in science. The mean of 532 for physical science was the highest score, and indeed this mean is significantly higher than the mean of the OECD and EU countries which was the same in TIMSS 2007. In both study cycles means for earth science were the lowest score but still not significantly lower than the overall science score of both cycles. In Germany, at least 40 percent reached the high benchmark on the most recent assessment. It is believed that these students are well prepared for science learning in secondary school. A similar proportion of students on the proficiency levels was found in TIMSS 2007. A remaining serious issue is the proportion of students reaching only the low benchmark and even more those who failed to reach it. Results over time indicate a small improvement from TIMSS 2007 to TIMSS 2015, but still about 20 percent of students in Germany show only a very basic science knowledge. These students have unfavorable prerequisites for following science learning in secondary schools.

Page 21: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

3 ProficiencyGainsinthe Longitudinal PISA 2012 – 2013 Assessment

The longitudinal PISA 2012 – 2013 assessment aims to provide data that fosters developing a greater understanding of how students’ profi­ciencies in the domains mathematics, reading, and science develop at the end of lower sec­ondary school (from Grade 9 to Grade 10). This study is an extension of PISA 2012 in Ger many. Students from 222 schools participated in the regular assessment while in Grade 9. Of the overall German PISA 2012 sample, n = 6 584 students attended a school track that offered schooling at least up to Grade 10 (the excep­tion being mainly the lowest track, the so­called Hauptschule, that often ends after ninth grade). We retested n = 4 610 ninth graders about one year later when they were in Grade 10 (70.0 %). The longitudinal student sample is representa­tive for the respective grade­level German stu­dent population excluding the Hauptschule. Hence, the longitudinal extension of PISA 2012 provides an excellent opportunity to study students’ proficiency development in the last phase of lower secondary education. The study allows examining proficiency gains assessed on the basis of the international PISA tests (mathe­matics, reading, and science), as well as by tests assessing the accomplishment of the German national educational standards (GES) for the end of lower secondary education (mathemat­ics, physics, chemistry, and biology) that have been developed by the Institute for Educational Quality Improvement (IQB). The comparison of the results given by the two different test con­ceptualizations is inasmuch interesting as both sets of instruments are based on different ra­tionales. The tests developed within the PISA framework are based on the literacy concept, meaning they target students’ ability to apply domain-specific knowledge to everyday prob­lems (following the idea of mathematical and

scientific literacy). The tests developed on the basis of the GES framework are more closely aligned to the school curriculum. Hence, these tests can be conceived as more direct measures of the returns of schooling.The longitudinal PISA 2012 – 2013 study allows examining a variety of important research ques­tions. Among these are (a) the description of average proficiency gains, (b) the comparison between proficiency gains assessed by the tests based on the PISA and the GES framework, re­spectively, and (c) the investigation whether proficiency development differs between school tracks. In addition, the longitudinal PISA 2012 – 2013 study provides data that is used in numer­ous research projects aiming to study the effects of individual (e.g., motivation), and contextual variables (e.g., classroom climate) on students’ development of domain-specific proficiencies. Of course, the validity of results depends on the adequateness of the psychometric approaches that are used for assessing students’ profi ciency development. At the IPN we have invested great effort in examining whether the require­ments for repeated measures are met in the lon­gitudinal PISA 2012 – 2013 study. Furthermore, based on our results we have proposed a new item response theory (IRT) model that accounts for declines in students’ test­taking effort when working on a test. In the next section, we there­fore first summarize key results of the investiga­tion of measurement aspects and then present key results regarding proficiency gains.

3.1 Measurement Requirements for Assessing Competency Development

The validity of results derived in longitudinal studies requires test scores to be comparable over time. In the psychometric literature this re­quirement is termed (longitudinal) measurement invariance. Violations of measurement invari­

Page 22: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

3 Proficiency Gains in the Longitudinal PISA 2012 – 2013 Assessment

9697

ance can occur for several reasons. For example, test items might be im­pacted differently by changes in learning opportunities provided at school. For example, in each grade in mathematics curriculum several sub­domains are emphasized to a different degree (e.g., quadratic equations in Grades 8 and 9). As a consequence, the relative difficulties of items belonging to different sub-domains could change across time, such that proficiency gains would not be safely generalizable across the content of the whole test. Other forms of violations of measurement invariance could stem from the changes in students’ test-taking behavior. In case of low­stakes tests typ­ical for large­scale studies of student achievement, most students show a decline in their test­taking effort while working on the test. This effect be­comes visible in the so­called item position effect meaning that the probabil­ity of solving a test item decreases the closer the item is positioned towards the end of the test. Item position effects are likely to lead to violations of longitudinal measure­ment invariance for two reasons. First, in many studies the test forms used at the different measurement occasions are often changed such that items targeting one domain (e.g., reading) are moved to a different position in the test. As a consequence, test forms – typically comprising multiple domains – are more strongly affected by position effects when the items belonging to a specific domain have been, on average, moved closer to the end of the test. Second, in longitudinal studies increases in the prevalence of unmotivated test­taking behavior could occur even in cases where the test design is held constant over time. In both scenarios, the items’ properties (e.g., their diffi­culties) change over time as a function of their position in a test. In the case of the longitudinal PISA 2012 – 2013 study, meaningful viola­tions of the measurement invariance that depend on item characteristics (e.g., due to changes in curricular emphases) do not appear likely. The tests developed in the PISA framework exhibit an adequate level of mea­surement invariance with respect to a variety of country-specific school curricula. Because the cross­country differences in learning opportunities are by far larger than the within­country curricular changes from Grade 9 to 10, meaningful violations of the longitudinal measurement invariance are unlikely. For the tests that had been developed based on the GES framework we did not expect such violations either because the content of these tests represents the German school curricula in the last phase of lower secondary education. However, violations of measurement in­variance that depend on the items’ positions in the test (i.e., position ef­fects) are plausible. They are to be expected when the average position of items is changed across measurement occasions. Furthermore, some pre­vious studies provide evidence for increases in unmotivated test­ taking behavior when retesting students with low­stakes tests, meaning that the strengths of position effects could increase even when the same measure­ment design is used both times.

Nagy, G., Lüdtke, O., & Köller, O. (in press). Modeling test context effects in longitudinal achievement data: Examining position effects in the longitudinal PISA 2012 assessment. Psychological Test and Assessment Modeling.

Page 23: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

In order to examine violations of measurement invariance that depend on the item positions in the test we made use of the rotated booklet de­sign employed in the longitudinal PISA 2012 – 2013 assessment. In this assessment design, groups of items (i.e., item clusters) are positioned in different test forms (i.e., booklets) at different positions; the 13 booklets each comprise 4 clusters. This design allows estimating position effects by comparing the item difficulties across booklets. If booklet effects that are in line with position effects are detected, position effects can be account­ed for by defining a reference position that is used to define students’ proficiency levels at all points in time. The items based on the PISA framework were administered at various positions at both measurement occasions. For each domain a subsample of students responded to some items presented at the first position in the test. This allows defining the metric of the proficiency variables with respect to the first position in the test at both occasions of measurement, thereby controlling for position effects and their changes across time. This approach was, however, not feasible for the GES test because in the assessment design employed at the second measurement occasion some domains were not assessed with respect to the first position. In the case of the GES tests, all booklets administered at the first measurement occasion instead focused on a single domain (either mathematics, physics, chemistry, or biology). Hence, the proficiency estimates at the first oc­casion closely correspond to a situation in which students were given the items in an average position of the test (i.e., the middle position). In order to account for the effects of the imbalance of the test design at the second measurement occasion, we have estimated the changes in proficiency levels expected for an average position by using linear approximations. Note that this approach accounts for the imbalance of the assessment designs, but does not control for the fact that the preva­lence of the reduction of test­taking effort could change over time (e.g., position effects could become stronger).

Results

As we expected, our analyses did not reveal any indication of meaning­ful violations of the longitudinal invariance assumption that depend on the characteristics of the items. Item parameters (i.e., item difficulties) estimated by several IRT calibration runs were found to exhibit a high concordance between measurement occasions and school tracks for the PISA as well as for the GES tests. However, the analyses based on the newly proposed IRT models provided evidence for strong position effects for the test based on the PISA framework. The pattern of results is illus­trated in Figure 7. Almost all booklet effects were negative and became

Nagy, G., Haag, N., Lüdtke, O., & Köller, O. (in press). Längsschnittskalierung der Tests zur Überprüfung des

Erreichens der Bildungsstan-dards der Sekundarstufe I im

PISA-Längsschnitt 2012/2013 [Longitudinal IRT scaling of

tests to verify achievement of the educational standards for

the secondary Level I in the PISA longitudinal assessment

2012/2013]. Zeitschrift für Er-ziehungswissenschaft.

Page 24: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

3 Proficiency Gains in the Longitudinal PISA 2012 – 2013 Assessment

9899

Mathematics

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4 1 2 3 4

Nonacademic Grade 9

Nonacademic Grade 10

Academic Grade 9

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4

Academic Grade 10

1 2 3 4

Science

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4 1 2 3 4

Nonacademic Grade 9

Nonacademic Grade 10

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

Academic Grade 9

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4

Academic Grade 10

Reading

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4 1 2 3 4

Nonacademic Grade 9

Nonacademic Grade 10

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4

Academic Grade 9

-1.1

-1.0

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

1 2 3 4

Academic Grade 10

Figure 7. Text context effects on the level of item clusters presented by item cluster position (x-axis) for the first and the second measurement occasion. Text context effects assessed in one booklet are connected by a dotted line. Effects are presented in the logit-metric (y-axis).

Page 25: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

more extreme the later a cluster was included in a booklet. This pattern of effects indicates that the booklet effects identified reflected position effects. The strengths of position effects varied between domains, such that mathematics was least and reading most strongly impacted by such effects. Students from nonacademic track schools were more strongly im­pacted by position effects than students from academic track schools, and the position effects in nonacademic track schools became stronger in the second assessment, whereas in the academic track schools, these effects seemed to be quite constant over time.In case of the GES proficiency, estimates for the first measurement occa­sion in 2012 were provided by the IQB (mathematics, physics, chemis­try, and biology). Here, the impact of position effects was examined by comparing estimates of proficiency changes across booklets. In the case of position effects operating at the second measurement occasion, it is expected that gains become smaller the later the item clusters are on av­erage presented in the test. With the exception of mathematics we found strong evidence for monotone score declines by positions with students from nonacademic tracks again being more strongly impacted by such effects. These results are summarized in Figure 8. Taken together, our findings provide strong indication that the tests em­ployed in the longitudinal PISA 2012 – 2013 study are affected by stu­dents’ declines in test­taking effort. Furthermore, these effects became stronger at the second measurement occasion in the case of PISA instru­ments, but such changes could not be evaluated in the case of GES tests.

3.2AverageProficiencyGains

The main purpose of the longitudinal PISA 2012 – 2013 study was to estimate students’ proficiency gains in the last phase of lower secondary education (Grades 9 to 10). In this section, we report key results per­taining to students’ proficiency gains estimated on the basis of PISA (mathematics, reading, and science) and GES tests (mathematics, physics, chemistry, and biology). Moreover, we considered the biasing effect of the declines in test­taking effort by reporting results derived from con­ventional cross­time comparisons ignoring position effects, as well as results from analyses in which position effects were controlled for. Figure 9 presents the change estimates derived on the basis of PISA tests. As shown there, ignoring position effects resulted in negative change estimates for reading in the full sample, and for science in nonac­ademic track students. These counterintuitive results disappeared when position effects were controlled for. The results indicated small profi­ciency gains in mathematics and larger gains in science for students from the aca demic track, whereas students from the nonacademic tracks did

Nagy, G., Lüdtke, O., Köller, O., & Heine, J.-H. (in press).

IRT-Skalierung der Tests im PISA-Längsschnitt 2012/2013:

Auswirkungen von Testkon-texteffekten auf die Zuwachs-schätzung [IRT scaling of the tests in the PISA longitudinal

assessment 2012/2013: Impact of test context effects

on growth estimates]. Zeitschrift für Erziehungswissenschaft.

Page 26: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

3 Proficiency Gains in the Longitudinal PISA 2012 – 2013 Assessment

100101

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Mathematics Global Scale

Ave

rage

Pro

ficie

ncy

Gai

ns

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Physics Content Knowledge

Ave

rage

Pro

ficie

ncy

Gai

ns

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Physics Scientific Inquiry

Ave

rage

Pro

ficie

ncy

Gai

ns

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Chemistry Content Knowledge

Ave

rage

Pro

ficie

ncy

Gai

ns

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Chemistry Scientific Inquiry

Ave

rage

Pro

ficie

ncy

Gai

ns

Ave

rage

Pro

ficie

ncy

Gai

ns

Ave

rage

Pro

ficie

ncy

Gai

ns

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

0 1 2 3 4 5 6

Biology Content Knowledge

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

0 1 2 3 4 5 6

Biology Scientific Inquiry

Figure 8. Average score declines (y-axis) by average position of item clusters in test booklets (x-axis) in the GES tests. Results for nonacademic (blue circles) and academic track students (yellow circles), and linear approximations.

Page 27: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

INTERNATIONAL LARGE-SCALE ASSESSMENTS

not show meaningful increases in any domain assessed by the PISA test.

Results derived on the basis of the GES tests indicted overall meaningful increases in stu­dents’ proficiencies. Despite this, the findings provided some indication that the change esti­mates are affected by position effects. Figure 10 provides the estimated proficiency gains when ignoring position effects, and when accounting for such effects. Here, the adjusted estimates correspond to the change scores expected for booklets in which the test items are presented in an average (i.e., middle) position of the test. Taken together, the GES tests indicated increas­es in all domains in the full sample. However, proficiency gains were overestimated in certain domains where the majority of students re­

-0.1

-0.2

-0.3

Mat

h.

Rea

d.

Scie

.

Mat

h.

Rea

d.

Scie

.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ignoring Position Effects

-0.1

-0.2

-0.3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Controlling for Position Effects

Profi

cien

cy G

ain

(Log

it M

etri

c)

Profi

cien

cy G

ain

(Log

it M

etri

c)

Overall

Nonacademic

Academic

ceived the test material at the beginning of the test (e.g., chemistry content knowledge), and underestimated when most students were con­fronted with the respective domain-specific test items in the later parts of the test (e.g., biology scientific inquiry). Average proficiency gains in the full sample ranged between 0.20 and 0.48 units on the logit metric (adjusted for position effects), indicating relatively large effect sizes. Increases appeared to be lower in the nonaca­demic tracks as compared to the academic track (differences ranging between −0.10 to −0.34 units on the logit metric) with the exception being mathematics in which gains of similar size were observed in both groups.

Summary and Conclusion

It can be summarized that the results regarding students’ development of domain-specific profi­ciencies are contaminated by position effects re­flecting declines in test-taking effort. Our results show that ignoring position effects can lead to counterintuitive results, as indicated by nega­tive change estimates in the case of the PISA tests. When position effects were taken into ac­count the results given on the basis of the PISA and GES tests pointed towards the importance of considering the framework which the tests used for assessing students’ proficiencies are based. Proficiency gains defined with respect to the literacy concept as used in the PISA assess­ments were smaller than gains in proficiency variables defined with respect to tests reflecting the school curricula, as in case of the GES tests. Besides these differences, both test concep­tions provided a similar pattern of school track differences in proficiency gains. In the case of mathematics, no school track differences were found on the basis of any measure, whereas in the case of science the PISA as well as the GES tests indicated larger proficiency gains in aca­demic track students.

Figure 9. Estimates of average proficiency gains for the tests based on the PISA framework by school track. Results for analyses ignoring and accounting for posi-tion effects. Math. = Mathematics, Read. = Reading, Scie. = Natural Sciences.

Page 28: international large-scale assessments - IPN Startseite€¦ · 2015 (PISA) and between 2007 and 2015 (TIMSS). Researchers at the IPN were involved in the reporting of science and

3 Proficiency Gains in the Longitudinal PISA 2012 – 2013 Assessment

The longitudinal PISA 2012 – 2013 study exemplifies important challenges in assessing students’ proficiency development that are typically neglected in applied research settings. However, position effects that can be used as indicators of students’ test­taking persistence are only one kind of sys­tematic biases that may occur in longitudinal studies of student achieve­ment. Other kinds of biases could stem from random guessing behavior, and from practice effects due to repeated testing. Therefore, the IPN will continue to follow research investigating such biasing factors.

ipn research group // Olaf Köller, Oliver Lüdtke, Gabriel Nagy

funded by // Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany (KMK), Federal Ministry of Education and Research (BMBF)

duration // 2014 – 2016

cooperation // Centre for International Student Assessment

103102

Ignoring Position Effects

Controlling for Position Effects

Overall

Nonacademic

Academic

-0.1

-0.2

-0.3

Mat

h.

Phys

. CK

Biol

. CK

Biol

. SI

Che

m. C

K

Che

m. S

I

Phys

. SI

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Profi

cien

cy G

ain

(Log

it M

etri

c)

-0.1

-0.2

-0.3

Mat

h.

Phys

. CK

Biol

. CK

Biol

. SI

Che

m. C

K

Che

m. S

I

Phys

. SI

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Profi

cien

cy G

ain

(Log

it M

etri

c)

Figure 10. Estimates of average proficiency gains for the tests based on the GES framework by school track. Results for analyses ignoring and accounting for position effects. CK = content knowledge, SI = scientific inquiry, Math. = Mathematics, Phys. = Physics, Chem. = Chemistry, Biol. = Biology.