Language Testing (Leeds)1

Language Testing

Liu Jianda

Syllabus

Understand the general considerations that must be addressed in the development of new tests or the selection of existing language tests;

Make their own judgements and decisions about either selecting an existing language test or developing a new language test;

Familiarise themselves with the fundamental issues, approaches, and methods used in measurement and evaluation;

Design, develop, evaluate and use language tests in ways that are appropriate for a given purpose, context, and group of test takers;

Understand the future development of language testing and the application of IT to computerized language testing.

It is expected that, by the end of this module, participants should be able to do the following :

In order to achieve these objectives, the module gives participants the opportunity to develop the following skills:

writing test items collecting test data and conducting item analysis evaluating language tests with regard to validity and

reliability

This is done by considering a wide range of issues and topics related to language testing. These include the following :

General concepts in language testing and evaluation Evaluation of a language test: reliability and validity Communicative approach to language testing Design of a language test Item writing and item analysis Interpreting test results Item response theory and its applications Computerized language testing and its future development

Syllabus

Class Schedule1 Basic concepts in language testing 2 Test validation: reliability and validity (1) 3 Test validation: reliability and validity (2) 4 Test construction (1) 5 Test construction (2) 6 Test construction (3) 7 Test Construction (4) 8 Test Construction (5) 9 Test Construction (6) 10 Rasch analysis (1) 11 Rasch analysis (2) 12 Language testing and modern technology

One 5000 – 6000 word paper on language testing

Collaborative work:You’ll be divided into group of four to

complete the development of a test paper. Each of you will be responsible for one part of the test paper. But each part should contribute equally to the whole test paper. Therefore, besides developing your part, you need to come together to discuss the whole test paper in terms of reliability and validity.

Assessment

Bachman, L. F. & Palmer, A. (1996). Language Testing in Practice. Oxford: Oxford University Press.

Brown, J. D. (1996). Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall Regents.

Li, X. (1997). The Science and Art of Language Testing. Changsha: Hunan Educational Press.

McNamara, T. (1996). Measuring second language performance. London ; New York: Longman.

Website:http://www.clal.org.cn/personal/testing/Leeds

Course books

Session 1

Basic concepts in language testing

Spolsky (1978) classified the development of language testing into three periods, or trends: the prescientific period the psychometric/structuralist period the integrative/sociolinguistic period.

. A short history of language

testing

grammar-translation approaches to language teaching

translation and free composition tests

difficult to score objectively no statistical techniques applied to

validate the tests simple, but unfair to students

The prescientific period

The psychometric-structuralist period audio-lingual and related teaching

methods objectivity, reliability, and validity of

tests considered measure discrete structure points multiple-choice format (standardized

tests) follow scientific principles, have trained

linguists and language testers

The integrative-sociolinguistic period communicative competence

Chomsky’s (1965) distinction of competence and performance Competence: an ideal speaker-listener’s knowledge of the rules of the la

nguage; performance: the actual use of language in concrete situations

Hymes’s (1972) proposal of communicative competence the ability of native speakers to use their language in ways that are not onl

y linguistically accurate but also socially appropriate. Canale & Swain’s (1980) framework of communicative competence:

Grammatical competence, mastery of the language code such as morphology, lexis, syntax, semantics, phonology;

Sociolinguistic competence, mastery of appropriate language use in different sociolinguistic contexts;

Discourse competence, mastery of how to achieve coherence and cohesion in spoken and written communication

Strategic competence, mastery of communication strategies used to compensate for breakdowns in communication and to enhance the effectiveness of communication.

Bachman’s (1990)’s framework of communicative language ability:

Language competence: grammatical, sociolinguistic, and discourse competence (Canale & Swain): organizational competence

grammatical competence textual competence

pragmatic competence illocutionary competence sociolinguistic competence

Strategic competence: performs assessment, planning, and execution functions in determining the most effective means of achieving a communicative goal

Psychophysiological mechanisms: characterize the channel (auditory, visual) and mode (receptive, productive)

The integrative-sociolinguistic period

Oller’s (1979) pragmatic proficiency test: Temporally and sequentially consistent with the

real world occurrences of language forms Linking to a meaningful extralinguistic context f

amiliar to the testees Clark’s (1978) direct assessment: approxi

mating to the greatest extent the testing context to the real world

Cloze test and dictation (Yang, 2002b) Communicative testing or to test communi

catively


Performance tests (Brown, Hudson, Norris, & Bonk, 2002; Norris, 1998) Not discrete-point in nature Integrating two or more of the language skills of

listening, speaking, reading, writing, and other aspects like cohesion and coherence, suprasegmentals, paralinguistics, kinesics, pragmatics, and culture

Task-based: essays, interviews, extensive reading tasks


Performance Tests Three characteristics:

The task should: be based on needs analysis (What criteria

should be used? What content and context? How should experts be used?)

be as authentic as possible with the goal of measuring real-world activities

sometimes have collaborative elements that stimulate communicative interactions

be contextualized and complex integrate skills with content be appropriate in terms of number, timing, and

frequency of assessment be generally non-intrusive, that is, be aligned

with the daily actions in the language classroom

Raters should be appropriate in terms of: number of raters overall expertise familiarity and training in use of the scale

The rating scale should be based on appropriate: categories of language learning and development appropriate breadth of information regarding learner

performance abilities standards that are both authentic and clear to

studentsTo enhance the reliability and validity of decisions as well as accountability, performance assessments should be combined with other methods for gathering information (e.g. self-assessments, portfolios, conferences, classroom behaviors, and so forth)

Performance Tests

Development graph (Li, 1997: 5)

2. Theoretical issues

Language testing is concerned with both content and methodology.

Development since 1990

Communicative language testing (Weir, 1990)

Reliability and validity Social functions of language

testing

Ethical language testing Washback (impact) (Qi, 2002; Wall, 1997)

impact: effects of tests on individuals, policies or practices within the classroom, the school, the educational system or society as a whole

washback: effects of tests on language teaching and learning Ways of investigating washback:

analyses of test results teachers’ and students’ accounts of what takes place in the classroom (questionnair

es and interviews) classroom observation

Ethics of test use use with care (Spolsky, 1981: 20) codes of practice

Professionalization of the field training of professionals development of standards of practice and mechanism for their implementatio

n and enforcement Critical language testing

put language testing in the society

Factors affecting performance of examinees

Communicative language ability

TEST SCORE

Test method facets

Personal attributes

Random factors

Development since 1990 Testing interlanguage pragmatic knowledge

currently on research level focus on method validation web-based test by Roever

Computerized language testing Item banking Computer-assisted language testing Computerized adaptive language testing

Test items adapted for individuals Test ends when examinee’s ability is determined Test time very shorter

Web-based testing Phonepass testing

Development since 1990

Language testing and second language acquisition (Bachman & Cohen, 1998)

Help to define construct of language ability

Use findings of language testing to prove hypotheses in SLA

Provide SLA researchers with testing and standards of testing

Development of research methodology

Factor analysis The main applications of factor analytic

techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships

between variables, that is to classify variables.

Therefore, factor analysis is applied as a data reduction or structure detection method

Generalizability theory (Bachman, 1997; Bachman, Lynch, & Mason, 1995)

Estimating the relative effects of different factors on test scores (facets)

The most generalizable indicator of an individual’s language ability is the universe score, however, in real world, we can only obtain scores from a limited sample of measures, so we need to estimate the dependability of a given observed score as an estimate of the universe score.

Two stages are involved in applying G-theory to test development

G-study The purpose is to estimate the effects of the various

facets in the measurement procedure (usually conducted in pretesting).

e.g. persons (differences in individuals’ speaking ability), raters (differences in severity among raters), tasks (differences in difficulty of tasks);

two-way interactions: task x rater different raters are rating the different tasks

differently person x task some tasks are differentially diffucult for di

fferent groups of test takers (source of bias) person x rater some raters score the performance of diff

erent groups of test takers differently (indication of rater bias)

Two stages are involved in applying G-theory to test development

D-study The purpose is to design an optimal measure for

the interpretations or decisions that are to be made on the basis of the test scores (estimation of dependability).

Generalizability coefficient (G coefficient) provides an estimate of the proportion of an individual’s observed score that can be attributed to his or her universe score, taking into consideration the effects of the different conditions of measurement specified in the universe of generalization. But it is appropriate for norm-referenced tests.

For criterion-referenced tests, use phi coefficient. (GENOVA)

Item response theory (Rasch model)

It enables us to estimate the statistical properties of items and the abilities of test takers so that these are not dependent upon a particular group of test takers or a particular form of a test. It is widely used in large-scale standardized test.

Structural equation model (Antony John Kunnan, 1998)

A combination of multiple regression, path analysis and factor analysis

Attempts to explain a correlation or a covariance data matrix derived from a set of observed variables; latent variables are responsible for the covariance among the measured variables.

Basic procedures in SEM (Example from Purpura, 1998)

Examine the relationships between strategy use and second language test performance.

Design two questionnaires for cognitive strategies and metacognitive strategies (40 items)

Ask respondents to answer the questionnaires Respondents take a foreign language test Cluster the 40 items to measure several variables Compute the reliability of the variables Conduct factor analysis to identify factors Conduct SEM analysis (AMOS, EQS, LISREL)

Qualitative method

Verbal report (think-aloud, introspective)

Observation Questionnaires and interviews Discourse analysis

3. Classification of language tests

According to families Norm-referenced tests Criterion-referenced tests

Norm-referenced tests

Measure global language abilities (e.g. listening, reading speaking, writing)

Score on a test is interpreted relative to the scores of all other students who took the test

Normal distribution

Normal Distribution

http://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm

Norm-referenced tests

Students know the format of the test but do not know what specific content or skill will be tested

A few relatively long subtests with a variety of question contents

Criterion-referenced tests Measure well-defined and fairly specific

objectives Interpretation of scores is considered absolute

without referring to other students’ scores Distribution of scores need not to be normal Students know in advance what types of

questions, tasks, and content to expect for the test

A series of short, well-defined subtests with similar question contents

According to decision purposes

Proficiency tests Placement tests Achievement tests Diagnostic tests

Proficiency tests Test students’ general levels of

language proficiency The test must provide scores that form

a wide distribution so that interpretations of the differences among students will be as fair as possible

Can dramatically affect students’ lives, so slipshod decision making in this area would be particularly unprofessional

Placement tests

Group students of similar ability levels (homogeneous ability levels)

Help decide what each student’s appropriate level will be within a specific program

Right tests for right purposes

Achievement tests About the amount of learning that students

have done The decision may involve who will a advanced

to the next level of study or which students should graduate

Must be designed with a specific reference to a particular course

Criterion-referenced, conducted at the end of the program

Used to make decisions about student’s levels of learning, meanwhile can be used to affect curriculum changes and to test those changes continually against the program realities

Diagnostic tests Aimed at fostering achievement by promoting strengths

and eliminating the weaknesses of individual students Require more detailed information about the very

specific areas in which students have strengths and weaknesses

Criterion-referenced, conducted at the beginning or in the middle of a language course

Can be diagnostic at the beginning or in the middle but achievement test at the end

Perhaps the most effective use of a diagnostic test is to report the performance level on each objective (in a percentage) to each student so that he or she can decide how and where to invest time and energy most profitably

Formative assessment vs. summative assessment

Formative: a judgment of an ongoing program used to provide information for program review, identification of the effectiveness of the instructional process, and the assessment of the teaching process

Summative: a terminal evaluation employed in the general assessment of the degree to which the larger outcomes have been obtained over a substantial part of or all of a course. It is used in determining whether or not the learner has achieved the ultimate objectives for instruction which were set up in advance of the instruction.

Public examinations vs. classroom tests

Purpose: proficiency vs. achievement (placement, diagnostic)

Format: standardized vs. open (objective vs. subjective)

Scale: large-scale vs. small-scale (self-assessment)

Scores: normality, backwash

Language Testing (Leeds)1

Documents

Transcript of Language Testing (Leeds)1