Implementation of Large‐Scale Education Assessments ... · Implementation of Large‐Scale...

Implementation of Large-Scale Education Assessments, First Edition. Edited by Petra Lietz, John C. Cresswell, Keith F. Rust and Raymond J. Adams. © 2017 John Wiley & Sons Ltd. Published 2017 by John Wiley & Sons Ltd.

1Implementation of Large‐Scale Education AssessmentsPetra Lietz, John C. Cresswell, Keith F. Rust and Raymond J. Adams

1.1 Introduction

The 60 years that followed a study of mathematics in 12 countries conducted by the International Association for the Evaluation of Educational Achievement (IEA) in 1964 have seen a proliferation of large‐scale assessments (LSAs) in education. In a recent systematic review of the impact of LSAs on education policy (Best et al., 2013), it was estimated that LSAs in education are now being undertaken in about 70% of the countries in the world.

The Programme for International Student Assessment (PISA) conducted by the Organisation for Economic Co‐operation and Development (OECD) was implemented in 75 countries in 2015 with around 510 000 participating students and their schools. Similarly, the Trends in International Mathematics and Science Study (TIMSS), conducted by the IEA, collected information from schools and students in 59 countries in 2015.

0002812688.INDD 1 1/10/2017 3:06:47 PM

COPYRIG

HTED M

ATERIAL

2 Implementation of Large-Scale Education Assessments

This book is about the implementation of LSAs in schools which can be considered to involve 13 key areas. These start with the explication of policy goals and issues, assessment frameworks, test and questionnaire designs, item development, translation and linguistic control as well as sampling. They also cover field operations, technical standards, data collection, coding and management as well as quality assurance measures. Finally, test and questionnaire data have to be scaled and analysed while a database is pro-duced and accompanied by dissemination and the reporting of results. While much of the book has been written from a central coordinating and manage-ment perspective, two chapters illustrate the actual implementation of LSAs which highlight the requirements regarding project teams and infrastructure required for participation in such assessments. Figure 1.2 in the concluding section of this chapter provides details regarding where each of these 13 key areas is covered in the chapters of this book.

Participation in these studies, on a continuing basis, is now widespread, as is indicated in Appendix 1.A. Furthermore, their results have become inte-gral to the general public discussion of educational progress and interna-tional comparisons in a wide range of countries with the impact of LSAs on education policy being demonstrated (e.g. Baker & LeTendre, 2005; Best et al., 2013; Breakspear, 2012; Gilmore, 2005). Therefore, it seems timely to bring together in one place the collective knowledge of those who routinely conduct these studies, with the aim of informing users of the results as to how such studies are conducted and providing a handbook for future p ractitioners of current and prospective studies.

While the emphasis throughout the book is on the practical implementa-tion of LSAs, it is grounded in theories of psychometrics, statistics, quality improvement and survey communication. The chapters of this book seek to cover in one place almost every aspect of the design, implementation and analysis of LSAs, (see Figure 1.2), with perhaps greater emphasis on the aspects of implementation than can be found elsewhere. This emphasis is intended to complement other recent texts with related content but which have a greater focus on the analysis of data from LSAs (e.g. Rutkowski, von Davier & Rutkowski, 2013).

This introductory chapter first provides some context in terms of the devel-opment of international, regional and national assessments and the policy context in which they occur. Then, the purposes for countries to undertake such assessments, particularly with a view to evidence‐based policymaking in education, are discussed. This is followed by a description of the content of the book. The chapter finishes with considerations as to where LSAs might be headed and what is likely to shape their development.

0002812688.INDD 2 1/10/2017 3:06:47 PM

Implementation of Large‐Scale Education Assessments 3

1.2 International, Regional and National Assessment Programmes in Education

The IEA first started a programme of large‐scale evaluation studies in educa-tion with a pilot study to explore the feasibility of such an endeavour in 1959–1961 (Foshay et al., 1962). After the feasibility study had shown that international comparative studies in education were indeed possible, the first content areas to be tested were mathematics with the First International Mathematics Study conducted by 12 countries in 1962–1967 (Husén, 1967; Postlethwaite, 1967) and the content areas of the six subject surveys, namely, civic education, English as a foreign language, French as a foreign language, literature education, reading, comprehension and science, conducted in 18 countries in 1970–1971. Since then, as can be seen in Appendix 1.A, parti cipation in international studies of education has grown considerably with 59 and 75 countries and economies, respectively, participating in the latest administrations of the TIMSS by the IEA in 2015 and the PISA by the OECD in 2015.

In addition to international studies conducted by the IEA since the late 1950s and by the OECD since 2000, commencing in the mid 1990s, three assessment programmes with a regional focus have been designed and implemented. First, the Conference of Education Ministers of Countries Using French as the Language of Communication (Conférence des ministres de l’Education des États et gouvernements de la Francophonie – CONFEMEN) conducts the Programme d’Analyse des Systèmes Educatifs de la CONFEMEN (PASEC). Since its first data collection in 1991, assessments have been under-taken in over 20 francophone countries not only in Africa but other parts of the world (e.g. Cambodia, Laos and Vietnam). Second, the Southern and Eastern African Consortium for Monitoring Educational Quality (SACMEQ), with the support of the UNESCO International Institute for Educational Planning (IIEP) in Paris, has undertaken four data collections since 1995, with the latest assessment in 2012–2014 (SACMEQ IV) involving 15 countries in Southeast Africa. Third, the Latin‐American Laboratory for Assessment of the Quality in Education (LLECE is the Spanish acronym), with the assis-tance of UNESCO’s Regional Bureau for Education in Latin America and the Caribbean (UREALC), has undertaken three rounds of data collection since 1997, with 15 countries participating in the Third Regional Comparative and Explanatory Study (TERCE) in 2013. First steps towards an assessment in the Asia‐Pacific region are currently being undertaken through the Southeast Asian Primary Learning Metrics (SEA‐PLM) initiative.

In terms of LSAs of student learning, a distinction is made here between LSAs that are intended to be representative of an entire education system,

0002812688.INDD 3 1/10/2017 3:06:48 PM


which may measure and monitor learning outcomes for various subgroups (e.g. by gender or socio‐economic background), and large‐scale examina-tions that are usually national in scope and which report or certify individual student’s achievement (Kellaghan, Greaney & Murray, 2009). Certifying examinations may be used by education systems to attest achievement at the end of primary or secondary education, for example, or education systems may use examinations to select students and allocate placements for further or specialised study, such as university entrance or scholarship examina-tions. The focus of this book is on the implementation of LSAs of student learning that are representative of education systems, particularly interna-tional assessments that compare education systems and student learning across participating countries.

Parallel to the growth in international assessments, the number of coun-tries around the world administering national assessments in any year has also increased – from 28 in 1995 to 57 in 2006 (Benavot & Tanner, 2007). For economically developing countries in the period from 1959 to 2009, Kamens and Benavot (2011) reported the highest number of national assessments in one year as 37 in 1999. Also in the 1990s, most of the countries in Central and South America introduced national assessments (e.g. Argentina, Bolivia, Brazil, Colombia, Dominican Republic, Ecuador, El Salvador, Guatemala, Paraguay, Peru, Uruguay and Venezuela) through the Partnership for Educational Revitalization in the Americas (PREAL) (Ferrer, 2006) although some introduced them earlier (e.g. Chile in 1982 and Costa Rica in 1986).

International, regional and national assessment programmes can all be considered as LSAs in education. While this book focuses mainly on interna-tional assessment programmes conducted in primary and secondary educa-tion, it also contains examples and illustrations from regional and national assessments where appropriate.

1.3 Purposes of LSAs in Education

Data from LSAs provide information regarding the extent to which students of a particular age or grade in an education system are learning what is expected in terms of certain content and skills. In addition, they assess differ-ences in achievement levels by subgroups such as gender or region and f actors that are correlated with different levels of achievement. Thus, a g eneral purpose of participation in LSAs is to obtain information on a s ystem’s e ducational outcomes and – if questionnaires are administered to obtain background information from students, teachers, parents and/or

0002812688.INDD 4 1/10/2017 3:06:48 PM


schools – the associated factors, which, in turn, can assist policymakers and other stakeholders in the education system in making policy and resourcing decisions for improvement (Anderson, Chiu & Yore, 2010; Benavot & Tanner, 2007; Braun, Kanjee & Bettinger, 2006; Grek, 2009; Postlethwaite & Kellaghan, 2008). This approach to education policymaking, based on evidence, includ-ing data from LSAs, has been adopted around the world, with Wiseman (2010, p. 2) stating that it is ‘the most frequently reported method used by politicians and policymakers which he argues can be considered a global norm for educational governance’.

More specifically, Wiseman (2010) has put forward three main purposes for evidence‐based policymaking, namely, measuring and ensuring quality, ensuring equity and accountability. To fulfil the purpose of measuring qual-ity, comparisons of performance across countries and over time tend to be undertaken. To provide indicators of equity, the performance of subgroups in terms of gender, socio‐economic status, school type or regions tends to be compared. Accountability refers to the use of assessment results to monitor and report, sometimes publicly, achievement results to enforce schools and other stakeholders to improve practice for meeting defined curricular and performance standards. In addition, the use of assessments for accountabil-ity purposes may use assessment data to implement resource allocation poli-cies (e.g. staff remuneration and contracts). Accountability is more frequently an associated goal of national assessment programmes than international assessment programmes.

To explicate further the way in which information from LSAs is used in education policymaking, models of the policy cycle are frequently put for-ward (e.g. Bridgman & Davis, 2004; Haddad & Demsky, 1995; Sutcliffe & Court, 2005). While most models include between six and eight stages, they seem to share four stages, namely, agenda setting, policy formulation, policy implementation and monitoring and evaluation. Agenda setting is the aware-ness of and priority given to an issue or problem whereas policy formulation refers to the analytical and political ways in which options and strategies are constructed. Policy implementation covers the forms and nature of policy administration and activities in the field. In the final step, monitoring and evaluation involves an appraisal of the extent to which implemented policies have achieved the intended aims and objectives. A model showing these four steps is shown in Figure 1.1.

Regardless of their purpose, data from LSAs are reported mainly through international, regional and national reports. However, these data are also used quite extensively in secondary data analyses (e.g. Hansen, Gustafsson & Rosén, 2014; Howie & Plomp, 2006; Owens, 2013), as well as meta‐analyses

0002812688.INDD 5 1/10/2017 3:06:48 PM


(e.g. Else‐Quest, Hyde & Linn, 2010; Lietz, 2006) which frequently lead to policy recommendations.

While recommendations are widespread, examples of the actual impact of these assessments on education policy are often provided in a more anecdo-tal or case study fashion (see Figazollo, 2009; Hanushek & Woessmann, 2010; McKinsey & Company, 2010) or by the main initiators of these assessments (e.g. Husén, 1967). Moreover, surveys have been conducted to ascertain the policy impact of these assessments. As these surveys have frequently been commissioned or initiated by the organisation responsible for the assessment (e.g. Breakspear, 2012 for the OECD; Gilmore, 2005 for the IEA), a certain positive predisposition regarding the effectiveness of the link between assess-ment and policy could be assumed. Similarly, surveys of and interviews with staff in ministries and entities that participate in such assessments (e.g. UNESCO, 2013), and that rely on future funding to continue their participa-tion, are likely to report positively on the effects of assessment results on education policymaking.

Two systematic reviews that were conducted recently (Best et al., 2013; Tobin et al., 2015) took a different approach by systematically locating and analysing available evidence of links between LSA programmes and educa-tion policy. In other words, these reviews did not include reports or articles that resulted in policy recommendations or surveys of participating entities’ perceived impact of assessments on policy but looked for actual evidence of

Agenda setting

Policyformulation

Policyimplementation

Monitoring andpolicy evaluation

Figure 1.1 Simplified model of the policy cycle (Source: Sutcliffe and Court (2005). Reproduced with permission from the Overseas Development Institute)

0002812688.INDD 6 1/10/2017 3:06:48 PM


an assessment–policy link. In the review that focused on such a link in eco-nomically developing countries between 1990 and 2011 (Best et al., 2013), of 1325 uniquely identified materials only 54 were considered to provide such evidence. In the review that focused on all countries in the Asia‐Pacific between 1990 and 2013 (Tobin et al., 2015), 68 of the 1301 uniquely identified materials showed evidence of such a link.

Results of these systematic reviews revealed some interesting insights into the use of LSAs as follows:

• Just under half of the assessment programmes in the review were national in coverage, followed by one‐third international programmes, while approximately one‐fifth were regional assessment programmes and only a few were subnational assessment programmes.

• Of the regional assessment programmes SACMEQ featured most often, followed by LLECE/SERCE and PASEC.

• Of the international assessments, PISA featured most often, followed by TIMSS and the Progress in International Reading Literacy Study (PIRLS).

• LSA programmes were most often intended to measure and ensure educa-tional quality. Assessment programmes were less often used for the policy goals of equity or accountability for specific education matters.

• The most frequent education policies impacted upon by the use of assess-ment data were system‐level policies regarding (i) curriculum standards and reform, (ii) performance standards and (iii) assessment policies.

• The most common facilitators for assessment data to be used in policy-making, regardless of the type of assessment programme, were media and public opinion as well as appropriate and ongoing dissemination to stakeholders.

• Materials which explicitly noted no impact on the policy process outlined barriers to the use of assessment data, which were thematically grouped as problems relating to (i) the (low) quality of an assessment programme and analyses, (ii) financial constraints, (iii) weak assessment bodies and fragmented government agencies and (iv) low technical capacity of assessment staff.

• The high quality of the assessment programme was frequently seen as a facilitator to the use of regional assessment data, while the lack of quality was often regarded as a barrier to the use of subnational and national assessments. In international assessments, quality emerged as both a facilitator and barrier. The high quality of an assessment programme was seen as a facilitator in so far as the results were credible, robust and not questioned by stakeholders. They were also regarded as a barrier in that

0002812688.INDD 7 1/10/2017 3:06:48 PM


the requirement of having to adhere to the internationally defined high‐quality standards was frequently a challenge to participating countries.

As the chapters throughout this book demonstrate, for assessment pro-grammes to be of high quality, much effort, expertise, time and financial resources are required. While developing and maintaining the necessary fund-ing and expertise continues to be a challenge, ultimately, the highest quality standards are required if information from LSAs is to be taken seriously by policymakers and other users of these data. Such high technical quality, com-bined with the ongoing integration of assessments into policy processes and an ongoing and varied media and communication strategy will increase the usefulness of evidence from LSAs for various stakeholders (Tobin et al., 2015).

1.3.1 Trend as a Specific Purpose of LSAs in Education

One‐off or cross‐sectional assessments can provide information about an outcome of interest at one point in time. This is of some interest in the com-parative sense as participating systems can look at each other’s performance on the outcome and see what they can learn from those systems that (i) per-form at a higher level or (ii) manage to produce greater homogeneity between the highest and lowest achievers or (iii) do preferably both i and ii. These comparisons, however, are made across cultures and it is frequently being questioned as to which cultures or countries it is appropriate or reasonable to compare (e.g. Goldstein & Thomas, 2008). The relatively higher achievement of many Asian countries in PISA and TIMSS compared to other countries is often argued to be a consequence of differences in basic tenets and resulting dispositions, beliefs and behaviours across countries. Thus, various authors (e.g. Bracey, 2005; Leung, 2006, Minkov, 2011; Stankov, 2010) demonstrate cultural differences across societies regarding, for example, the value of edu-cation or student’s effort or respect for teachers which makes it difficult to see how countries can learn from each other to improve outcomes. Therefore, assessments that enable comparisons over time within countries are often considered to be more meaningful.

In England, the follow‐up study of the Plowden National Survey of 1964 was undertaken 4 years later in 1968 and was reported by Peaker (1967, 1971). This study followed the same students over the two occasions. Similarly, in Australia, a longitudinal Study of School Performance was carried out in 1975 with a subsample of students following up 4 years later in 1979 with 10‐and 14‐year‐old students in the fields of literacy and numeracy (Bourke et al. 1981; Keeves and Bourke, 1976; Williams et al., 1980).

0002812688.INDD 8 1/10/2017 3:06:48 PM


Both of these studies were longitudinal in kind, which is relatively rare in the realm of LSAs, which tend to use repeated cross‐sectional assessments as a means to gauge changes over time across comparable cohorts, rather than looking at growth within a cohort by following the same individuals over time. The most substantial and continuing programme of this latter type of assessment of national scope is the National Assessment of Educational Progress (NAEP) in the United States. It was initiated in 1969 in order to assess achievement at the levels of Grade 4, Grade 8 and Grade 12 in reading, mathematics and science (see, e.g. Jones & Olkin, 2004; Tyler, 1985).

The main international assessments are cross‐sectional in kind and are repeated at regular intervals with PIRLS conducted every 5 years, PISA every 3 years and TIMSS every 4 years. As the target population (e.g. 15‐year‐olds or Grade 4 students) remains the same on each occasion, it enables the moni-toring of student outcomes for this target population over time. Notably, the importance of providing trend information was reflected by IEA’s change in what ‘TIMSS’ meant. In the 1995 assessment, the ‘T’ stood for ‘third’ which was still maintained in 1999 where the study was called the ‘Third International Mathematics and Science Study Repeat’. By the time of the 2003 assessment, however, the ‘T’ stood for ‘Trends in International Mathematics and Science Study’.

Now that PISA has assessed all major domains (i.e. reading, mathematics and science) twice, increasingly the attention paid to the results within each country are those of national trends, both overall and for population sub-groups, rather than cross national. It is not news anymore that Korean stu-dents substantially outperform US students in mathematics. Indeed, if an implementation of PISA were suddenly to show this not to be the case the results would not be believed, even though a different cohort is assessed each time. Generally, participating countries are most interested in whether or not there is evidence of improvement over time, both since the prior assessment and over the longer term. Such comparisons within a country over time are of great interest since they are not affected by the possible unique effects of culture which can be seen as problematic for cross‐country comparisons.

Increasingly, countries that participate in PISA supplement their samples with additional students, not in a way that will appreciably improve the preci-sion of comparisons with other countries but in ways that will improve the precision of trend measurements for key demographic groups within the country, such as ethnic or language minorities or students of lower socio‐ economic status. Of course this does not preclude the occasional instances of political leaders who vow to show improvements in education over time through a rise in the rankings of the PISA or TIMSS ‘league tables’ (e.g. Ferrari, 2012).

0002812688.INDD 9 1/10/2017 3:06:48 PM


1.4 Key Areas for the Implementation of LSAs in Education

As emphasised at the beginning of this introduction and found in the sys-tematic reviews, for LSAs to be robust and useful, they need to be of high quality, technically sound, have a comprehensive communication strategy and be useful for education policy. To achieve this aim, 13 key areas need to be considered in the implementation of LSAs (see Figure 1.2).

While Figure 1.2 illustrates where these key areas are discussed in the chap-ters of this book, a brief summary of the content of each chapter is given below.

Chapter 2 – Test Design and ObjectivesGiven that all LSAs have to address the 13 elements of a robust assessment

programme, why and how do these assessments differ from one another in practice? The answer to this question lies in the way that the purpose and guiding principles of an assessment guide decisions about who and what should be assessed. In this chapter, Dara Ramalingam outlines the key features of a selection of LSAs to illustrate the way in which their different purposes and assessment frameworks have led to key differ-ences in decisions about test content, target population and sampling.

Chapter 3 – Test DevelopmentAll educational assessments that seek to provide accurate information about

the test takers’ knowledge, skills and understanding in the domain of interest share a number of common characteristics. These include tasks which elicit responses that contribute to building a sense of the test takers’ capacity in the domain. This also means that the tests draw on knowledge and understanding that are intrinsic to the domain and are not likely to be more or less difficult for any individual or group because of knowledge or skills that are irrelevant to the domain. The tests must be in a format that is suited to the kind of questions being asked, provide coverage of the area of learning that is under investigation and they must be practically manageable. Juliette Mendelovits describes the additional challenges for LSAs to comply with these general ‘best practice’ characteristics as inter-national LSAs start with the development of frameworks that guide the development of tests that are subsequently administered to many thousands of students in diverse countries, cultures and contexts.

Chapter 4 – Design, Development and Implementation of Contextual Questionnaires in LSAs

In order to be relevant to education policy and practice, LSAs routinely collect contextual information through questionnaires to enable the

0002812688.INDD 10 1/10/2017 3:06:48 PM

Tec

hnic

alst

anda

rds

A r

obus

tas

sess

men

tpr

ogra

m

Ling

uist

icqu

ality

con

trol

Sam

ple

desi

gn

Hig

h-qu

ality

item

s

Ass

essm

ent

fram

ewor

k

Tes

t des

ign

Pol

icy

goal

san

d is

sues

Rep

ortin

g an

ddi

ssem

inat

ion

Dat

a an

alys

is

Sca

ling

met

hodo

logy

Dat

am

anag

emen

t

Pro

ject

team

and

infr

astr

uctu

re

Ch

apte

r 10

Ch

apte

rs 7

, 8 a

nd

9

Ch

apte

r 7

Ch

apte

r 6

Ch

apte

r 5

Ch

apte

rs 3

an

d 4

Ch

apte

rs 2

, 3 a

nd

4

Ch

apte

r 2

Ch

apte

r 1

Ch

apte

rs 1

6 an

d 1

7

Ch

apte

rs 1

4 an

d 1

5

Ch

apte

rs 1

3,14

an

d 1

5

Sta

ndar

dise

dop

erat

ions

Cha

pter

s 9,

11

and

12

Fig

ure

1.2

Ke

y a

rea

s o

f a ro

bu

st a

sse

ssm

en

t pro

gra

mm

e

0002812688.INDD 11 1/10/2017 3:06:49 PM


examination of factors that are linked to differences in student perfor-mance. In addition, information obtained by contextual questionnaires is used independently of performance data to generate indicators of non‐cognitive learning outcomes such as students’ attitudes towards reading, mathematics self‐efficacy and interest in science or indicators about teacher education and satisfaction as well as application of instructional strategies. In this chapter, Petra Lietz not only gives an overview of the content of questionnaires for students, parents, teach-ers and schools in LSAs but also discusses and illustrates the question-naire design process from questionnaire framework development to issues such as question order and length as well as question and response formats.

Chapter 5 – Sample Design, Weighting and Calculation of Sampling Variance

Since the goal of LSAs as we have characterised them is to measure the achievement of populations and specified subgroups rather than that of individual students and schools, it is neither necessary, nor in most cases feasible, to assess all students in the target population within each participating country. Hence, the selection of an appropriate sample of students, generally from a sample of schools, is a key technical require-ment for these studies. In this chapter, Keith Rust, Sheila Krawchuk and Christian Monseur describe the steps involved in selecting such samples and their rationale. Given that a complex, stratified multistage sample is selected in most instances, those analysing the data must use appropri-ate methods of inference that take into account the effects of the sample design on the sample configuration. Furthermore, some degree of school and student nonresponse is bound to occur, and methods are needed in an effort to mitigate any bias that such nonresponse might introduce.

Chapter 6 – Translation and Cultural Appropriateness of Survey Material in LSAs

Cross‐linguistic, cross‐national and cross‐cultural equivalence is a funda-mental requirement of LSAs in education which seek to make compari-sons across many different settings. While procedures for the translation, adaptation, verification and finalisation of survey materials – also called ‘localisation’ – will not completely prevent language or culturally induced bias, they aim to minimise the possibility of them occurring. In this chapter, Steve Dept, Andrea Ferrari and Béatrice Halleux discuss the strengths and weaknesses of various approaches to the localisation of materials in different LSAs and single out practices that are more likely than others to yield satisfactory outcomes.

0002812688.INDD 12 1/10/2017 3:06:49 PM


Chapter 7 – Quality AssuranceQuality assurance measures cover all aspects from test development to

database production as John Cresswell explains in this chapter. To ensure comparability of the results across students and across countries and schools, much work has gone into standardising cross‐national assess-ments. The term ‘standardised’, in this context, not only refers to the scaling and scoring of the tests but also to the consistency in the design, content and administration of the tests (deLandshere, 1997). This extent of standardisation is illustrated by the PISA technical standards which for the administration in 2012 (NPM(1003)9a) covered three broad stand-ards, one concerning data, the second regarding management and the third regarding national involvement. Data standards covered target population and sampling, language of testing, field trial participation, adaptation and translation of tests, implementation of national options, quality monitoring, printing, response coding and data submission. Management standards covered communication, notification of interna-tional and national options, schedule for material submission, drawing of samples, data management and archiving of materials. National standards covered feedback regarding appropriate mechanisms for p romoting school participation and dissemination of results among all national stakeholders.

Chapter 8 – Processing Responses to Open‐ended Survey QuestionsIn this chapter, Ross Turner discusses the challenges associated with the

consistent assessment of responses that students generate when answer-ing questions other than multiple‐choice items. The methods described take into account the increased difficulty of this task when carried out in an international setting. Examples are given of the detailed sets of guide-lines which are needed to code the responses and the processes involved in developing and implementing these guidelines.

Chapter 9 – Computer‐based Delivery of Cognitive Assessment and Questionnaires

As digital technologies have advanced in the twenty‐first century, the demand for using these technologies in large‐scale educational assessment has increased. Maurice Walker focuses in this chapter on the substantive and logistical rationales for adopting or incorporating a computer‐based approach to student assessment. He outlines assessment architecture and important item design options with the view that well‐planned computer‐based assessment (CBA) should be a coherent, accessible, stimulating and intuitive experience for the test taker. Throughout the chapter, examples illustrate the differing degrees of diffusion of

0002812688.INDD 13 1/10/2017 3:06:49 PM


digital infrastructure into the schools of countries that participate in LSAs. It also discusses the impact of these infrastructure issues on the choices of whether and how to undertake CBAs.

Chapter 10 – Data Management ProceduresFalk Brese and Mark Cockle discuss in this chapter the data management

procedures needed to minimise error that might be introduced by any processes involved with converting responses from students, teachers, parents and school principals to electronic data. The chapter presents the various aspects of data management of international LSAs that need to be taken into account to meet this goal.

Chapter 11 – Test Implementation in the Field: The Case of PASECOswald Koussihouèdé describes the implementation of one of the regional

assessments – PASEC – which is undertaken in francophone countries in Africa and Asia. He describes the significant changes which have recently been made to this assessment programme in an attempt to bet-ter describe the strengths and weaknesses of the student populations of the participating countries and to ensure that the assessment is being implemented using the latest methodology.

Chapter 12 – Test Implementation in the Field: The Experience of Chile in International LSAs

Chile has participated in international LSAs undertaken by the IEA, OECD and UNESCO since 1998. Ema Lagos first explains the context in which these assessments have occurred, both in terms of the education system as well as political circumstances. She then provides a comprehensive picture of all the tasks that need to be undertaken by a participating country, from input into instrument and item development, sampling, the preparation of test materials and manuals and the conduct of field operations to the coding, entry, management and analysis of data and the reporting of results.

Chapter 13 – Why LSAs Use Scaling and Item Response Theory (IRT)As raw scores obtained from the instruments used in assessments are not

amenable to statistical analysis or the provision of valid and reliable comparisons across students, schools, states or countries and over time, most LSAs commonly use item response models in the scaling of cogni-tive data. Raymond Adams and Alla Berezner describe and illustrate three reasons for using IRT in this chapter. These include that IRT mod-els (i) support the process of test development and construct validation, (ii) facilitate the usage of the tests consisting of a number of rotated test

0002812688.INDD 14 1/10/2017 3:06:49 PM


forms within one assessment to increase content coverage and (iii) enable the maintenance of scales that are comparable across countries and over time when used in conjunction with multiple imputation methodology.

Chapter 14 – Describing Learning GrowthTo enhance the utility of the scales used to report the results of learning

outcomes in LSAs, it has become a common practice to attach substantive descriptions to the scale scores. These descriptions typically emerge from one of the two main approaches. One is a strictly criterion‐based approach which identifies what students in a particular population are expected to know and be able to do at various points along the proficiency scale. The other approach is to describe observed growth in proficiency in the popu-lation of interest without implying particular desired performance expec-tations. In this chapter, Ross Turner and Raymond Adams introduce some of the methodologies that are used to implement these two broad approaches. They also discuss some of the issues surrounding the scale construction and provide examples which illustrate how these descrip-tions are used to convey information about learning growth.

Chapter 15 – Scaling of Questionnaire Data in International LSAsValidity and precision of measures are essential not only for the perfor-

mance test in LSAs but also for the information obtained from context questionnaires about factors that are considered to be linked to perfor-mance. Wolfram Schulz describes in this chapter the different method-ologies available for reviewing item dimensionality, cross‐national measurement, equivalence and scaling of context questionnaire data and to what extent they have been used across international studies. As home background has been found to be related to student learning outcomes in various ways, special attention is given to a discussion of different ways of obtaining home or family‐related indicators to measure students’ socio‐economic background. He also summarises the different approaches to scaling questionnaire data across international studies and discusses future perspectives for the scaling of context questionnaire data.

Chapter 16 – Database Production for Large‐scale Educational Assessments

One of the deliverables in LSAs is a database which is frequently publicly accessible for further research and analyses. In this chapter, Alla Berezner and Eveline Gebhardt describe procedures and issues related to the construction of a database which is challenging given the complex sam-pling procedures and rotated booklet design used to collect data from

0002812688.INDD 15 1/10/2017 3:06:49 PM


t housands of students, their parents, teachers and schools. The issues discussed relate not only to the database itself but also to its documenta-tion, for example, through compendia which include a set of tables showing descriptive statistics for every item in the cognitive tests and the questionnaires or through encyclopedias which provide information about the participating education systems. Finally, the chapter discusses that databases need to be accompanied by user guides aimed at provid-ing information for the appropriate use of the data in subsequent analy-ses which, in turn, are often supported by hands‐on training of analysts who use these databases.

Chapter 17 – Dissemination and ReportingIn this chapter, John Cresswell emphasises that dissemination and report-

ing are an essential part of the assessment process to provide the infor-mation necessary for policymakers to make informed decisions that bring about improved student learning outcomes in their countries. He also points out many ways of reporting on an assessment which go beyond a detailed written report of the results. Publication of assess-ment frameworks, sample items and questionnaires are shown to p rovide valuable information about assessments as is the provision of a high‐quality database that can be used for further analyses to e nable evidence‐based decision‐making by education practitioners and policymakers.

1.5 Summary and Outlook

The agreed importance and value of LSAs is demonstrated by the rise in their popularity over the past 60 years. LSAs have matured from being innovative research activities implemented by far‐sighted researchers who were univer-sity based and resource limited into technically sophisticated education monitoring activities that form part of the routine processes of education policy and practices in many parts of the world.

As illustrated in this volume, LSAs have motivated, or indeed undertaken themselves, important methodological advances in areas such as scaling, sampling, instrument development and validation, translation as well as technical and quality assurance standards.

They have promoted the development of networks of researchers and the sharing of information between countries. They have (i) encouraged coun-tries to monitor their education systems; (ii) supported international and national endeavours to objectively benchmark, monitor and evaluate

0002812688.INDD 16 1/10/2017 3:06:49 PM


educational outcomes in terms of both quality and equity; and (iii) allowed national expectations for educational inputs, processes and outcomes to be reviewed from an understanding of what is being expected and achieved elsewhere.

There are also important future roles and challenges for LSAs. To date, the majority of LSAs of the type discussed in this volume have been undertaken in and by high‐income economies, with some participation by upper‐mid-dle‐income countries and limited participation by lower‐middle‐income and low‐income countries. It is our expectation, however, that this will soon change in response to the emphasis of the United Nation’s Sustainable Development Goals (SDGs) on the improvement of educational outcomes. The SDGs have placed quality at the heart of the global education develop-ment agenda for the coming 15 years, but not unexpectedly they stop short of providing a definition of ‘quality’.

An important challenge for LSAs in the future will be to support the devel-opment of a common understanding of what is meant by quality within the context of the SDGs. While a single international study is unlikely – and probably not a good idea given the different contexts and resulting objectives for LSAs (see Chapter 2) – there would appear to be considerable merit in LSAs building on their methods for scale development (see Chapters 13, 14 and 15) to prepare a reporting framework that could be used internationally for monitoring progress towards the SDGs.

With an expected increase in participation in LSAs by lower‐middle‐income and low‐income countries, the need will grow for the development of instrumentation and delivery mechanisms that are applicable in more diverse contexts. This, in turn, will have an impact on the kinds of skills that are assessed, the levels at which the assessments are targeted and the focus and nature of contextual information.

The increase in the breadth of countries taking up LSAs is accompanied by a clear interest in adding diversity to the content assessed in LSAs. While LSAs have already addressed content as diverse as collaborative problem solving, writing, ICT literacy as well as civics and citizenship, it has been reading, mathematics and, to a lesser extent, science that have been at the core of the largest LSAs. Increasingly, however, a wider array of general capabilities such as innovation, critical thinking, analytic reasoning, creativ-ity or social and emotional skills are being identified as the desirable target outcomes of a comprehensive curriculum which has the individual child and its preparedness for lifelong learning at the centre. The improved definition and assessment of these concepts and outcomes is a core challenge for LSAs of the future. The inclusion of more diverse assessment domains will also

0002812688.INDD 17 1/10/2017 3:06:49 PM


ensure that LSAs remain relevant to the full array of desired educational o utcomes and are not seen to have a narrowing influence on curriculum but instead represent tools for curriculum innovation and reform.

A final challenge to LSAs is responding to the wide recognition that truly longitudinal designs – where the same individuals and the ways in which their learning is facilitated are assessed over time – would improve our capacity to examine the impact of factors such as teaching and school leader-ship practices and the availability of instructional resources on educational outcomes.

In summary, much has been achieved by LSAs. The world of education has benefited from their existence and from the foresight of their initiators. Their work, however, is far from done. The recognition of the importance learning for all continues to gain momentum and education systems throughout the world are searching for ways to identify effective mechanisms for ensuring breadth and depth of learning for all. The chapters in this book provide read-ers with a guide to the essential elements for the implementation of LSAs that are of high quality and relevance. As such, this book is a contribution to LSAs having and continuing to play a leading role in the endeavour to provide high‐quality learning for all.

Appendix 1.A

International and Regional Assessments: Initial Studies and Latest Rounds of Assessments

Initial studies Latest round of assessments

Countries IEA pilot study

FIMS Six‐subject survey

PASEC 2015

PIRLS 2016

SACMEQ III 2011

TERCE 2013

TIMSS 2015

PISA 2015

Asia‐PacificAustralia 1 1 1 1 1China–Hong Kong SAR

1 1 1

China–Macao SAR

1 1

China (four provinces)

1

Chinese Taipei 1 1 1India 1Indonesia 1 1

0002812688.INDD 18 1/10/2017 3:06:49 PM





PASEC 2015

PIRLS 2016

SACMEQ III 2011

TERCE 2013

TIMSS 2015

PISA 2015

Japan 1 1 1 1Republic of Korea

1 1

Malaysia 1 1New Zealand 1 1 1 1Singapore 1 1 1Thailand 1 1 1Vietnam 1

EuropeAlbania 1Armenia 1Austria 1 1Azerbaijan 1Belgium 1 1 1 1 1 1Bulgaria 1 1 1Croatia 1 1Cyprus 1 1Czech Republic 1 1 1Denmark 1 1 1England 1 1 1 1 1Estonia 1Finland 1 1 1 1 1 1France 1 1 1 1 1 1Georgia 1 1Germany 1 1 1 1 1 1Greece 1Hungary 1 1 1Iceland 1Ireland 1 1 1Israel 1 1 1 1 1Italy 1 1 1 1Republic of Kazakhstan

1 1 1

Kosovo 1Latvia 1 1Lithuania 1 1 1Luxembourg 1MacedoniaMalta 1 1Moldova 1

(Continued )

0002812688.INDD 19 1/10/2017 3:06:49 PM





PASEC 2015

PIRLS 2016

SACMEQ III 2011

TERCE 2013

TIMSS 2015

PISA 2015

Republic of Montenegro

1

Netherlands 1 1 1 1 1Northern Ireland 1 1Norway 1 1 1Poland 1 1 1 1 1Portugal 1 1 1Romania 1Russian Federation

1 1 1

Scotland 1 1 1Republic of Serbia

1

Slovak Republic 1 1 1Slovenia 1 1 1Spain 1 1 1Sweden 1 1 1 1 1 1Switzerland 1 1Turkey 1 1UkraineWales 1Yugoslavia 1

Latin America and CaribbeanArgentina 1 1 1 1Brazil 1 1Chile 1 1 1 1 1Colombia 1 1Costa Rica 1 1Dominican Republic

1 1

Ecuador 1Guatemala 1Honduras 1Nicaragua 1Panama 1Paraguay 1Peru 1 1Trinidad and Tobago

1

(Continued )

0002812688.INDD 20 1/10/2017 3:06:49 PM





PASEC 2015

PIRLS 2016

SACMEQ III 2011

TERCE 2013

TIMSS 2015

PISA 2015

Uruguay 1 1

Middle East and North AfricaAlgeria 1Bahrain 1 1Egypt 1 1Iran 1 1 1Jordan 1 1Kuwait 1 1Lebanon 1 1Morocco 1 1Oman 1 1Palestine 1Qatar 1 1 1Saudi Arabia 1 1SyriaTunisia 1United Arab Emirates

1 1 1

Yemen

North AmericaCanada 1 1 1Mexico 1 1United States 1 1 1 1 1 1

Sub‐Saharan AfricaBenin 1Botswana 1 1 1Burkina Faso 1Burundi 1Cameroon (Francophone and Anglophone)

1

Chad 1Congo 1GhanaIvory Coast 1Kenya 1Lesotho 1Madagascar 1

(Continued )

0002812688.INDD 21 1/10/2017 3:06:50 PM





PASEC 2015

PIRLS 2016

SACMEQ III 2011

TERCE 2013

TIMSS 2015

PISA 2015

Malawi 1Mauritius 1Mozambique 1Namibia 1Niger 1Senegal 1Seychelles 1South Africa 1 1 1Swaziland 1Tanzania (Mainland)

1

Tanzania (Zanzibar)

1

Togo 1Uganda 1Zambia 1Zimbabwe 1Total 12 12 18 11 51 14 15 59 75

References

Anderson, J. O., Chiu, M. H. & Yore, L. D. (2010). First cycle of PISA (2000–2006) – International perspectives on successes and challenges: Research and policy directions. International Journal of Science and Mathematics Education, 8(3), 373–388.

Baker, D. P. & LeTendre, G. K. (2005). National Differences, Global Similarities: World Culture and the Future of Schooling. Stanford University Press, Stanford, CA.

Benavot, A. & Tanner, E. (2007). The growth of national learning assessments in the world, 1995–2006. Paper commissioned for the EFA Global Monitoring Report 2008, Education for All by 2015: Will We Make It? (pp. 1–17). UNESCO, Paris.

Best, M., Knight, P., Lietz, P., Lockwood, C., Nugroho D. & Tobin, M. (2013). The impact of national and international assessment programmes on education policy, particularly policies regarding resource allocation and teaching and learning practices in developing countries. Final report. EPPI‐Centre, Social Science Research Unit, Institute of Education, University of London, London. Available at: http://eppi.ioe.ac.uk/cms/Default.aspx?tabid=3418 (accessed 15 July 2016).

(Continued )

0002812688.INDD 22 1/10/2017 3:06:50 PM


Bourke, S. F., Mills, J. M., Stanyon, J. & Holzer, F. (1981). Performance in Literacy and Numeracy, 1980. AGPS, Canberra.

Bracey, G. W. (2005). Put out over PISA. Phi Delta Kappan Magazine, 86(10), 797–798.Braun, H., Kanjee, A. & Bettinger, E. (2006). Improving Education, Through Assessment,

Innovation, and Evaluation. American Academy of Arts and Sciences, Cambridge, MA.Breakspear, S. (2012). The policy impact of PISA. OECD Education Working Paper 71.

OECD Publishing, Paris.Bridgman, P. & Davis, G. (2004). The Australian Policy Handbook. Allen & Unwin,

Crows Nest.Else‐Quest, N. M., Hyde, J. S. & Linn, M. C. (2010). Cross‐national patterns of gender

differences in mathematics: A meta‐analysis. Psychological Bulletin, 136(1), 103–127.Ferrari, J. (2012, September 3). Labor’s ‘top five’ goals for schools. The Australian.

Available at: http://www.theaustralian.com.au/national‐affairs/education/labors‐ top‐five‐goal‐for‐schools/story‐fn59nlz9‐1226463502869 (accessed 15 July 2016).

Ferrer, J. G. (2006). Educational Assessment Systems in Latin America: Current Practice and Future Challenges. Preal, Washington, DC.

Figazollo, L. (2009). Impact of PISA 2006 on the education policy debate. Available at: http://download.ei‐ie.org/docs/IRISDocuments/Research%20Website%20 Documents/2009‐00036‐01‐E.pdf (accessed 15 July 2016).

Foshay, A. W., Thorndike, R. L., Hotyat, F., Pidgeon, D. A. & Walker, D. A. (1962). Educational Achievements of Thirteen‐Year‐Olds in Twelve Countries: Results of an International Research Project, 1959–1961. UNESCO Institute for Education, Hamburg.

Gilmore, A. (2005). The Impact of PIRLS (2001) and TIMSS (2003) in Low and Middle‐Income Countries: An Evaluation of the Value of World Bank Support for International Surveys of Reading Literacy (PIRLS) and Mathematics and Science (TIMSS). International Association for the Evaluation of Educational Achievement (IEA), Amsterdam.

Goldstein, H. & Thomas, S. M. (2008). Reflections on the international comparative surveys debate. Assessment in Education: Principles, Policy and Practice, 15(3), 215–222.

Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24(1), 23–37.

Haddad, W. D. & Demsky, T. (1995). Education Policy‐Planning Process: An Applied Framework. Fundamentals of Educational Planning (Vol. 51). United Nations Educational, Scientific, and Cultural Organization, International Institute for Educational Planning, Paris.

Hansen, K. Y., Gustafsson, J. E. & Rosén, M. (2014). Northern Lights on TIMSS and PIRLS 2011: Differences and Similarities in the Nordic Countries. Norden, Norway.

Hanushek, E. A. & Woessmann, L. (2010). The High Cost of low Educational Performance: The Long‐run Economic Impact of Improving PISA Outcomes. OECD Publishing, Paris.

Howie, S. & Plomp, T. (2006). Contexts of Learning Mathematics and Science: Lessons. Routledge, London/New York.

0002812688.INDD 23 1/10/2017 3:06:50 PM


Husén, T. (Ed.) (1967). International Study of Achievement in Mathematics: A Comparison of Twelve Countries (Vols. 1–2). Almqvist & Wiksell, Stockholm.

Jones, L. & Olkin, I. (Eds.) (2004). The Nation’s Report Card: Evolution and Perspectives. Phi Delta Kappan, Bloomington, IN.

Kamens, D. H. & Benavot, A. (2011). National, regional and international learning assessments: Trends among developing countries, 1960–2009. Globalisation, Societies and Education, 9(2), 285–300.

Keeves, J. P. & Bourke, S. F. (1976). Australian studies in school performance. Volume I literacy and numeracy in Australian schools. A First Report ERDC Report No. 8. AGPS, Canberra.

Kellaghan, T., Greaney, V. & Murray, T. S. (2009). Using the Results of a National Assessment of Educational Achievement. National Assessments of Educational Achievement (Vol. 5). World Bank, Washington, DC.

deLandshere, G. (1997). History of educational research. In: Keeves, J. P. (Ed.) Educational Research, Methodology and Measurement: An International Handbook, 2nd edn (pp. 8–16). Pergamon Press, Oxford.

Leung, K. S. F. (2006). Mathematics education in East Asia and the West: Does culture matter? In: Mathematics Education in Different Cultural Traditions‐A Comparative Study of East Asia and the West (pp. 21–46). Springer, New York.

Lietz, P. (2006). Issues in the change in gender differences in reading achievement in cross‐national research studies since 1992: A meta‐analytic view. International Education Journal, 7(2), 127–149.

McKinsey & Company (2010). How the world’s most improved school systems keep g etting better. Available at: http://mckinseyonsociety.com/how‐the‐worlds‐most‐improved‐school‐systems‐keep‐getting‐better/ (accessed 15 July 2016).

Minkov, M. (2011). Cultural Differences in a Globalizing World. Emerald Group Publishing, Bingley.

NPM(1003)9a. (2010). PISA 2012 Technical Standards. Paper presented at the PISA 2012 National Project Manager Meeting, Hong Kong, March 2010.

Owens, T. L. (2013). Thinking beyond league tables: A review of key PISA research questions. In: Meyer H.‐D. & Benavot A. (Eds.) PISA, Power, and Policy: The Emergence of Global Educational Governance (pp. 27–49). Oxford Studies in Comparative Education, Southampton/Oxford. Available at: http://www.academia. edu/3707750/Thinking_Beyond_League_Tables_a_review_of_key_PISA_ research_questions (accessed 15 July 2016).

Peaker, G. F. (1967). The regression analyses of the national survey. In: Central Advisory Council for Education (England) Children and their Primary Schools (Plowden Report) (Vol. 2, Appendix IV, pp. 179–221). HMSO, London.

Peaker, G. F. (1971). The Plowden Children Four Years Later. NFER, London.Postlethwaite, T. N. (1967). School Organization and Student Achievement: A Study

Based on Achievement in Mathematics in Twelve Countries. Almqvist & Wiksell, Stockholm.

0002812688.INDD 24 1/10/2017 3:06:50 PM


Postlethwaite, T. N. & Kellaghan, T. (2008). National Assessments of Education Achievement. Jointly published by IIEP, Paris, France and IAE, Brussels, Belgium. Available at: http://unesdoc.unesco.org/images/0018/001817/181753e. pdf (accessed 15 July 2016).

Rutkowski, L., von Davier, M. & Rutkowski, D. (Eds.) (2013). Handbook of International Large‐Scale Assessment: Background, Technical Issues, and Methods of Data Analysis. CRC Press, Boca Raton, FL.

Stankov, L. (2010). Unforgiving Confucian culture: A breeding ground for high academic achievement, test anxiety and self‐doubt? Learning and Individual Differences, 20(6), 555–563.

Sutcliffe, S. & Court, J. (2005). Evidence‐Based Policymaking: What Is It? How Does It Work? What Relevance for Developing Countries? Overseas Development Institute, London.

Tobin, M., Lietz, P., Nugroho, D., Vivekanandan, R. & Nyamkhuu, T. (2015). Using Large‐Scale Assessments of Students’ Learning to Inform Education Policy: Insights from the Asia‐Pacific Region. ACER/UNESCO, Melbourne/Bangkok.

Tyler, R. W. (1985). National assessment of educational progress (NAEP). In: Husén, T. & Postlethwaite, T. N. (Eds.) International Encyclopedia of Education. Pergamon Press, Oxford.

UNESCO. (2013). The use of student assessment for policy and learning. Improvement Education Policy and Reform Unit (EPR), Education Policy Research Working Document No. UNESCO Bangkok, Bangkok, Thailand. Available at: http://unesdoc. unesco.org/images/0022/002206/220693e.pdf (accessed 15 July 2016).

Williams, T., Batten, M., Girling‐Butcher, S. & Clancy, J. (1980). School and Work in Prospect: 14‐Year‐Olds in Australia. ACER Research Monograph (Vol. 10). Australian Council for Educational Research, Hawthorn. Available at: https://archive.org/stream/ERIC_ED198302#page/n0/mode/2up (accessed 15 July 2016).

Wiseman, A. W. (2010). The uses of evidence for educational policymaking: Global contexts and international trends. In: Luke, A., Kelly, G. J. & Green, J. (Eds.) Review of Research in Education (Vol. 34, pp. 1–24). American Educational Research Association, Washington, DC.

0002812688.INDD 25 1/10/2017 3:06:50 PM

Implementation of Large‐Scale Education Assessments ... · Implementation of Large‐Scale...

Documents

Transcript of Implementation of Large‐Scale Education Assessments ... · Implementation of Large‐Scale...