Language Testing (Leeds)1
-
Upload
era-wahyuni -
Category
Documents
-
view
32 -
download
5
Transcript of Language Testing (Leeds)1
Language Testing
Liu Jianda
Syllabus
Understand the general considerations that must be addressed in the development of new tests or the selection of existing language tests;
Make their own judgements and decisions about either selecting an existing language test or developing a new language test;
Familiarise themselves with the fundamental issues, approaches, and methods used in measurement and evaluation;
Design, develop, evaluate and use language tests in ways that are appropriate for a given purpose, context, and group of test takers;
Understand the future development of language testing and the application of IT to computerized language testing.
It is expected that, by the end of this module, participants should be able to do the following :
In order to achieve these objectives, the module gives participants the opportunity to develop the following skills:
writing test items collecting test data and conducting item analysis evaluating language tests with regard to validity and
reliability
This is done by considering a wide range of issues and topics related to language testing. These include the following :
General concepts in language testing and evaluation Evaluation of a language test: reliability and validity Communicative approach to language testing Design of a language test Item writing and item analysis Interpreting test results Item response theory and its applications Computerized language testing and its future development
Syllabus
Class Schedule1 Basic concepts in language testing 2 Test validation: reliability and validity (1) 3 Test validation: reliability and validity (2) 4 Test construction (1) 5 Test construction (2) 6 Test construction (3) 7 Test Construction (4) 8 Test Construction (5) 9 Test Construction (6) 10 Rasch analysis (1) 11 Rasch analysis (2) 12 Language testing and modern technology
One 5000 – 6000 word paper on language testing
Collaborative work:You’ll be divided into group of four to
complete the development of a test paper. Each of you will be responsible for one part of the test paper. But each part should contribute equally to the whole test paper. Therefore, besides developing your part, you need to come together to discuss the whole test paper in terms of reliability and validity.
Assessment
Bachman, L. F. & Palmer, A. (1996). Language Testing in Practice. Oxford: Oxford University Press.
Brown, J. D. (1996). Testing in Language Programs. Upper Saddle River, NJ: Prentice Hall Regents.
Li, X. (1997). The Science and Art of Language Testing. Changsha: Hunan Educational Press.
McNamara, T. (1996). Measuring second language performance. London ; New York: Longman.
Website:http://www.clal.org.cn/personal/testing/Leeds
Course books
Session 1
Basic concepts in language testing
Spolsky (1978) classified the development of language testing into three periods, or trends: the prescientific period the psychometric/structuralist period the integrative/sociolinguistic period.
. A short history of language
testing
grammar-translation approaches to language teaching
translation and free composition tests
difficult to score objectively no statistical techniques applied to
validate the tests simple, but unfair to students
The prescientific period
The psychometric-structuralist period audio-lingual and related teaching
methods objectivity, reliability, and validity of
tests considered measure discrete structure points multiple-choice format (standardized
tests) follow scientific principles, have trained
linguists and language testers
The integrative-sociolinguistic period communicative competence
Chomsky’s (1965) distinction of competence and performance Competence: an ideal speaker-listener’s knowledge of the rules of the la
nguage; performance: the actual use of language in concrete situations
Hymes’s (1972) proposal of communicative competence the ability of native speakers to use their language in ways that are not onl
y linguistically accurate but also socially appropriate. Canale & Swain’s (1980) framework of communicative competence:
Grammatical competence, mastery of the language code such as morphology, lexis, syntax, semantics, phonology;
Sociolinguistic competence, mastery of appropriate language use in different sociolinguistic contexts;
Discourse competence, mastery of how to achieve coherence and cohesion in spoken and written communication
Strategic competence, mastery of communication strategies used to compensate for breakdowns in communication and to enhance the effectiveness of communication.
Bachman’s (1990)’s framework of communicative language ability:
Language competence: grammatical, sociolinguistic, and discourse competence (Canale & Swain): organizational competence
grammatical competence textual competence
pragmatic competence illocutionary competence sociolinguistic competence
Strategic competence: performs assessment, planning, and execution functions in determining the most effective means of achieving a communicative goal
Psychophysiological mechanisms: characterize the channel (auditory, visual) and mode (receptive, productive)
The integrative-sociolinguistic period
Oller’s (1979) pragmatic proficiency test: Temporally and sequentially consistent with the
real world occurrences of language forms Linking to a meaningful extralinguistic context f
amiliar to the testees Clark’s (1978) direct assessment: approxi
mating to the greatest extent the testing context to the real world
Cloze test and dictation (Yang, 2002b) Communicative testing or to test communi
catively
The integrative-sociolinguistic period
Performance tests (Brown, Hudson, Norris, & Bonk, 2002; Norris, 1998) Not discrete-point in nature Integrating two or more of the language skills of
listening, speaking, reading, writing, and other aspects like cohesion and coherence, suprasegmentals, paralinguistics, kinesics, pragmatics, and culture
Task-based: essays, interviews, extensive reading tasks
The integrative-sociolinguistic period
Performance Tests Three characteristics:
The task should: be based on needs analysis (What criteria
should be used? What content and context? How should experts be used?)
be as authentic as possible with the goal of measuring real-world activities
sometimes have collaborative elements that stimulate communicative interactions
be contextualized and complex integrate skills with content be appropriate in terms of number, timing, and
frequency of assessment be generally non-intrusive, that is, be aligned
with the daily actions in the language classroom
Raters should be appropriate in terms of: number of raters overall expertise familiarity and training in use of the scale
The rating scale should be based on appropriate: categories of language learning and development appropriate breadth of information regarding learner
performance abilities standards that are both authentic and clear to
studentsTo enhance the reliability and validity of decisions as well as accountability, performance assessments should be combined with other methods for gathering information (e.g. self-assessments, portfolios, conferences, classroom behaviors, and so forth)
Performance Tests
Development graph (Li, 1997: 5)
2. Theoretical issues
Language testing is concerned with both content and methodology.
Development since 1990
Communicative language testing (Weir, 1990)
Reliability and validity Social functions of language
testing
Ethical language testing Washback (impact) (Qi, 2002; Wall, 1997)
impact: effects of tests on individuals, policies or practices within the classroom, the school, the educational system or society as a whole
washback: effects of tests on language teaching and learning Ways of investigating washback:
analyses of test results teachers’ and students’ accounts of what takes place in the classroom (questionnair
es and interviews) classroom observation
Ethics of test use use with care (Spolsky, 1981: 20) codes of practice
Professionalization of the field training of professionals development of standards of practice and mechanism for their implementatio
n and enforcement Critical language testing
put language testing in the society
Factors affecting performance of examinees
Communicative language ability
TEST SCORE
Test method facets
Personal attributes
Random factors
Development since 1990 Testing interlanguage pragmatic knowledge
currently on research level focus on method validation web-based test by Roever
Computerized language testing Item banking Computer-assisted language testing Computerized adaptive language testing
Test items adapted for individuals Test ends when examinee’s ability is determined Test time very shorter
Web-based testing Phonepass testing
Development since 1990
Language testing and second language acquisition (Bachman & Cohen, 1998)
Help to define construct of language ability
Use findings of language testing to prove hypotheses in SLA
Provide SLA researchers with testing and standards of testing
Development of research methodology
Factor analysis The main applications of factor analytic
techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships
between variables, that is to classify variables.
Therefore, factor analysis is applied as a data reduction or structure detection method
Generalizability theory (Bachman, 1997; Bachman, Lynch, & Mason, 1995)
Estimating the relative effects of different factors on test scores (facets)
The most generalizable indicator of an individual’s language ability is the universe score, however, in real world, we can only obtain scores from a limited sample of measures, so we need to estimate the dependability of a given observed score as an estimate of the universe score.
Two stages are involved in applying G-theory to test development
G-study The purpose is to estimate the effects of the various
facets in the measurement procedure (usually conducted in pretesting).
e.g. persons (differences in individuals’ speaking ability), raters (differences in severity among raters), tasks (differences in difficulty of tasks);
two-way interactions: task x rater different raters are rating the different tasks
differently person x task some tasks are differentially diffucult for di
fferent groups of test takers (source of bias) person x rater some raters score the performance of diff
erent groups of test takers differently (indication of rater bias)
Two stages are involved in applying G-theory to test development
D-study The purpose is to design an optimal measure for
the interpretations or decisions that are to be made on the basis of the test scores (estimation of dependability).
Generalizability coefficient (G coefficient) provides an estimate of the proportion of an individual’s observed score that can be attributed to his or her universe score, taking into consideration the effects of the different conditions of measurement specified in the universe of generalization. But it is appropriate for norm-referenced tests.
For criterion-referenced tests, use phi coefficient. (GENOVA)
Item response theory (Rasch model)
It enables us to estimate the statistical properties of items and the abilities of test takers so that these are not dependent upon a particular group of test takers or a particular form of a test. It is widely used in large-scale standardized test.
Structural equation model (Antony John Kunnan, 1998)
A combination of multiple regression, path analysis and factor analysis
Attempts to explain a correlation or a covariance data matrix derived from a set of observed variables; latent variables are responsible for the covariance among the measured variables.
Basic procedures in SEM (Example from Purpura, 1998)
Examine the relationships between strategy use and second language test performance.
Design two questionnaires for cognitive strategies and metacognitive strategies (40 items)
Ask respondents to answer the questionnaires Respondents take a foreign language test Cluster the 40 items to measure several variables Compute the reliability of the variables Conduct factor analysis to identify factors Conduct SEM analysis (AMOS, EQS, LISREL)
Qualitative method
Verbal report (think-aloud, introspective)
Observation Questionnaires and interviews Discourse analysis
3. Classification of language tests
According to families Norm-referenced tests Criterion-referenced tests
Norm-referenced tests
Measure global language abilities (e.g. listening, reading speaking, writing)
Score on a test is interpreted relative to the scores of all other students who took the test
Normal distribution
Normal Distribution
http://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm
Norm-referenced tests
Students know the format of the test but do not know what specific content or skill will be tested
A few relatively long subtests with a variety of question contents
Criterion-referenced tests Measure well-defined and fairly specific
objectives Interpretation of scores is considered absolute
without referring to other students’ scores Distribution of scores need not to be normal Students know in advance what types of
questions, tasks, and content to expect for the test
A series of short, well-defined subtests with similar question contents
According to decision purposes
Proficiency tests Placement tests Achievement tests Diagnostic tests
Proficiency tests Test students’ general levels of
language proficiency The test must provide scores that form
a wide distribution so that interpretations of the differences among students will be as fair as possible
Can dramatically affect students’ lives, so slipshod decision making in this area would be particularly unprofessional
Placement tests
Group students of similar ability levels (homogeneous ability levels)
Help decide what each student’s appropriate level will be within a specific program
Right tests for right purposes
Achievement tests About the amount of learning that students
have done The decision may involve who will a advanced
to the next level of study or which students should graduate
Must be designed with a specific reference to a particular course
Criterion-referenced, conducted at the end of the program
Used to make decisions about student’s levels of learning, meanwhile can be used to affect curriculum changes and to test those changes continually against the program realities
Diagnostic tests Aimed at fostering achievement by promoting strengths
and eliminating the weaknesses of individual students Require more detailed information about the very
specific areas in which students have strengths and weaknesses
Criterion-referenced, conducted at the beginning or in the middle of a language course
Can be diagnostic at the beginning or in the middle but achievement test at the end
Perhaps the most effective use of a diagnostic test is to report the performance level on each objective (in a percentage) to each student so that he or she can decide how and where to invest time and energy most profitably
Formative assessment vs. summative assessment
Formative: a judgment of an ongoing program used to provide information for program review, identification of the effectiveness of the instructional process, and the assessment of the teaching process
Summative: a terminal evaluation employed in the general assessment of the degree to which the larger outcomes have been obtained over a substantial part of or all of a course. It is used in determining whether or not the learner has achieved the ultimate objectives for instruction which were set up in advance of the instruction.
Public examinations vs. classroom tests
Purpose: proficiency vs. achievement (placement, diagnostic)
Format: standardized vs. open (objective vs. subjective)
Scale: large-scale vs. small-scale (self-assessment)
Scores: normality, backwash