CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet...

28
CHAPTER 9 Common European Framework of Reference for Languages: learning, teaching, assessment Assessment 9.2 The Framework as a resource for assessment 9.2.1 The specification of the content of tests and examinations The description of ‘Language Use and the Language User’, in Chapter 4 and in particular section 4.4 on ‘Communicative Language Activities’, can be consulted when drawing up a task specification for a communicative assessment. It is increasingly recognised that valid assessment requires the sampling of a range of relevant types of discourse. For example, in relation to the testing of speaking, a recently developed test illustrates this point. First, there is a simulated Conversation which functions as a warm up; then there is an Informal Discussion of topical issues in which the candidate declares an interest. This is followed by a Transaction phase, which takes the form either of a face-to-face or simulated telephone information seeking activity. This is followed by a Production phase, based upon a written Report in which the candidate gives a Description of his/her academic field and plans. Finally there is a Goal-orientated Co-operation, a consensus task between candidates. To summarise, the Framework categories for communicative activities employed are: Interaction Production (Spontaneous, short turns) (Prepared, long turns) Spoken: Conversation Description of his/her academic field Informal discussion Goal-orientated co-operation Written: Report/Description of his/her academic field In constructing the detail of the task specifications the user may wish to consult section 4.1, on ‘the context of language use’ (domains, conditions and constraints, mental context), section 4.6 on ‘Texts’, and Chapter 7 on ‘Tasks and their Role in Language Teaching’, specifically section 7.3 on ‘Task difficulty’. Section 5.2 on ‘Communicative language competences’ will inform the construction

Transcript of CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet...

Page 1: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

CHAPTER 9Common European Framework of Reference for Languages: learning, teaching, assessmentAssessment9.2 The Framework as a resource for assessment

9.2.1 The specification of the content of tests and examinationsThe description of ‘Language Use and the Language User’, in Chapter 4 and in particularsection 4.4 on ‘Communicative Language Activities’, can be consulted when drawing upa task specification for a communicative assessment. It is increasingly recognised thatvalid assessment requires the sampling of a range of relevant types of discourse. Forexample, in relation to the testing of speaking, a recently developed test illustrates thispoint. First, there is a simulated Conversation which functions as a warm up; then thereis an Informal Discussion of topical issues in which the candidate declares an interest. Thisis followed by a Transaction phase, which takes the form either of a face-to-face or simulatedtelephone information seeking activity. This is followed by a Production phase, basedupon a written Report in which the candidate gives a Description of his/her academic fieldand plans. Finally there is a Goal-orientated Co-operation, a consensus task between candidates.To summarise, the Framework categories for communicative activities employed are:Interaction Production(Spontaneous, short turns) (Prepared, long turns)Spoken: Conversation Description of his/her academic fieldInformal discussionGoal-orientated co-operationWritten: Report/Description of his/her academic fieldIn constructing the detail of the task specifications the user may wish to consult section4.1, on ‘the context of language use’ (domains, conditions and constraints, mentalcontext), section 4.6 on ‘Texts’, and Chapter 7 on ‘Tasks and their Role in LanguageTeaching’, specifically section 7.3 on ‘Task difficulty’.Section 5.2 on ‘Communicative language competences’ will inform the constructionof the test items, or phases of a spoken test, in order to elicit evidence of the relevant linguistic,sociolinguistic and pragmatic competences. The set of content specifications atThreshold Level produced by the Council of Europe for over 20 European languages (seeBibliography items listed on p. 200) and at Waystage and Vantage Level for English, plustheir equivalents when developed for other languages and levels, can be seen as ancillaryto the main Framework document. They offer examples of a further layer of detail toinform test construction for Levels A1, A2, B1 and B2.9.2.2 The criteria for the attainment of a learning objectiveThe scales provide a source for the development of rating scales for the assessment of theattainment of a particular learning objective and the descriptors may assist in the formulationof criteria. The objective may be a broad level of general language proficiency,expressed as a Common Reference Level (e.g. B1). It may on the other hand be a specificconstellation of activities, skills and competences as discussed in section 6.1.4 on ‘PartialCompetences and Variation in Objectives in relation to the Framework’. Such a modularobjective might be profiled on a grid of categories by levels, such as that presented inTable 2.In discussing the use of descriptors it is essential to make a distinction between:1. Descriptors of communicative activities, which are located in Chapter 4.2. Descriptors of aspects of proficiency related to particular competences, which arelocated in Chapter 5. The former are very suitable for teacher- or self-assessment with regard to real-worldtasks. Such teacher- or self-assessments are made on the basis of a detailed picture of the

Page 2: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

learner’s language ability built up during the course concerned. They are attractivebecause they can help to focus both learners and teachers on an action-oriented approach.However, it is not usually advisable to include descriptors of communicative activitiesin the criteria for an assessor to rate performance in a particular speaking or writing testif one is interested in reporting results in terms of a level of proficiency attained. This isbecause to report on proficiency, the assessment should not be primarily concerned withany one particular performance, but should rather seek to judge the generalisable competencesevidenced by that performance. There may of course be sound educationalreasons for focusing on success at completing a given activity, especially with youngerBasic Users (Levels A1; A2). Such results will be less generalisable, but generalisability ofresults is not usually the focus of attention in the earlier stages of language learning.This reinforces the fact that assessments can have many different functions. What isappropriate for one assessment purpose may be inappropriate for another.9.2.2.1 Descriptors of communicative activitiesDescriptors of communicative activities (Chapter 4) can be used in three separate waysin relation to the attainment of objectives.1. Construction: As discussed above in section 9.2.1 scales for communicative activitieshelp in the definition of a specification for the design of assessment tasks.2. Reporting: Scales for communicative activities can also be very useful for reportingresults. Users of the products of the educational system, such as employers, are ofteninterested in the overall outcomes rather than in a detailed profile of competence.3. Self- or teacher-assessment: Finally, descriptors for communicative activities can beused for self- and teacher-assessment in various ways, of which the following aresome examples:• Checklist: For continuous assessment or for summative assessment at the end of acourse. The descriptors at a particular level can be listed. Alternatively, thecontent of descriptors can be ‘exploded’. For example the descriptor Can ask forand provide personal information might be exploded into the implicit constituentparts I can introduce myself; I can say where I live; I can say my address in French; I cansay how old I am, etc. and I can ask someone what their name is; I can ask someone wherethey live; I can ask someone how old they are, etc.• Grid: For continuous or summative assessment, rating a profile onto a grid ofselected categories (e.g. Conversation; Discussion; Exchanging Information) defined atdifferent levels (B1+, B2, B2+).The use of descriptors in this way has become more common in the last 10 years.Experience has shown that the consistency with which teachers and learners can interpretdescriptors is enhanced if the descriptors describe not only WHAT the learner cando, but also HOW WELL they do it. 9.2.2.2 Descriptors of aspects of proficiency related to particular competencesDescriptors of aspects of proficiency can be used in two main ways in relation to theattainment of objectives.1. Self- or teacher-assessment: Provided the descriptors are positive, independent statementsthey can be included in checklists for self- and teacher-assessment. However, it is aweakness of the majority of existing scales that the descriptors are often negativelyworded at lower levels and norm-referenced around the middle of the scale. They alsooften make purely verbal distinctions between levels by replacing one or two wordsin adjacent descriptions which then have little meaning outside the co-text of thescale. Appendix A discusses ways of developing descriptors that avoid these problems.2. Performance assessment: A more obvious use for scales of descriptors on aspects of competencefrom Chapter 5 is to offer starting points for the development of assessment

Page 3: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

criteria. By guiding personal, non-systematic impressions into considered judgements,such descriptors can help develop a shared frame of reference among thegroup of assessors concerned.There are basically three ways in which descriptors can be presented for use as assessmentcriteria:• Firstly, descriptors can be presented as a scale – often combining descriptors for differentcategories into one holistic paragraph per level. This is a very common approach.• Secondly, they can be presented as a checklist, usually with one checklist per relevantlevel, often with descriptors grouped under headings, i.e. under categories.Checklists are less usual for live assessment.• Thirdly, they can be presented as a grid of selected categories, in effect as a set of parallelscales for separate categories. This approach makes it possible to give a diagnosticprofile. However, there are limits to the number of categories that assessors cancope with.There are two distinctly different ways in which one can provide a grid of sub-scales:Proficiency Scale: by providing a profile grid defining the relevant levels forcertain categories, for example from Levels A2 to B2. Assessment is then madedirectly onto those levels, possibly using further refinements like a second digitor pluses to give greater differentiation if desired. Thus even though the performancetest was aimed at Level B1, and even if none of the learners had reachedLevel B2, it would still be possible for stronger learners to be credited with B1+,B1++ or B1.8.Examination Rating Scale: by selecting or defining a descriptor for each relevantcategory which describes the desired pass standard or norm for a particularmodule or examination for that category. That descriptor is then named ‘Pass’ or‘3’ and the scale is norm-referenced around that standard (a very weak performance= ‘1’, an excellent performance = ‘5’). The formulation of ‘1’ & ‘5’ might be other descriptors drawn or adapted from the adjacent levels on the scale fromthe appropriate section of Chapter 5, or the descriptor may be formulated in relationto the wording of the descriptor defined as ‘3’.9.2.3 Describing the levels of proficiency in tests and examinations to aid comparisonThe scales for the Common References Levels are intended to facilitate the description ofthe level of proficiency attained in existing qualifications – and so aid comparisonbetween systems. The measurement literature recognises five classic ways of linking separateassessments: (1) equating; (2) calibrating; (3) statistical moderation; (4) benchmarking,and (5) social moderation.The first three methods are traditional: (1) producing alternative versions of the sametest (equating), (2) linking the results from different tests to a common scale (calibrating),and (3) correcting for the difficulty of test papers or the severity of examiners (statisticalmoderation).The last two methods involve building up a common understanding through discussion(social moderation) and the comparison of work samples in relation to standardiseddefinitions and examples (benchmarking). Supporting this process of building a commonunderstanding is one of the aims of the Framework. This is the reason why the scales ofdescriptors to be used for this purpose have been standardised with a rigorous developmentmethodology. In education this approach is increasingly described as standardsorientedassessment. It is generally acknowledged that the development of astandards-oriented approach takes time, as partners acquire a feel for the meaning of thestandards through the process of exemplification and exchange of opinions.It can be argued that this approach is potentially the strongest method of linkingbecause it involves the development and validation of a common view of the construct.The fundamental reason why it is difficult to link language assessments, despite the statisticalwizardry of traditional techniques, is that the assessments generally test radically

Page 4: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

different things even when they are intending to cover the same domains. This is partlydue to (a) under-conceptualisation and under-operationalisation of the construct, andpartly to due to (b) related interference from the method of testing.The Framework offers a principled attempt to provide a solution to the first andunderlying problem in relation to modern language learning in a European context.Chapters 4 to 7 elaborate a descriptive scheme, which tries to conceptualise languageuse, competences and the processes of teaching and learning in a practical way whichwill help partners to operationalise the communicative language ability we wish topromote.The scales of descriptors make up a conceptual grid which can be used to:a) relate national and institutional frameworks to each other, through the medium ofthe Common Framework;b) map the objectives of particular examinations and course modules using the categoriesand levels of the scales.Appendix A provides readers with an overview of methods to develop scales of descriptors,and relate them to the Framework scale. The User Guide for Examiners produced by ALTE (Document CC-Lang (96) 10 rev) providesdetailed advice on operationalising constructs in tests, and avoiding unnecessarydistortion though test method effects.9.3 Types of assessmentA number of important distinctions can be made in relation to assessment. The followinglist is by no means exhaustive. There is no significance to whether one term in thedistinction is placed on the left or on the right.Table 7. Types of assessment1 Achievement assessment Proficiency assessment2 Norm-referencing (NR) Criterion-referencing (CR)3 Mastery learning CR Continuum CR4 Continuous assessment Fixed assessment points5 Formative assessment Summative assessment6 Direct assessment Indirect assessment7 Performance assessment Knowledge assessment8 Subjective assessment Objective assessment9 Checklist rating Performance rating10 Impression Guided judgement11 Holistic assessment Analytic assessment12 Series assessment Category assessment13 Assessment by others Self-assessment9.3.1 Achievement assessment/proficiency assessmentAchievement assessment is the assessment of the achievement of specific objectives – assessmentof what has been taught. It therefore relates to the week’s/term’s work, the coursebook, the syllabus. Achievement assessment is oriented to the course. It represents aninternal perspective.Proficiency assessment on the other hand is assessment of what someone can do/knowsin relation to the application of the subject in the real world. It represents an externalperspective.Teachers have a natural tendency to be more interested in achievement assessment inorder to get feedback for teaching. Employers, educational administrators and adultlearners tend to be more interested in proficiency assessment: assessment of outcomes,what the person can now do. The advantage of an achievement approach is that it is close to the learner’s experience. The advantage of a proficiency approach is that it helps everyoneto see where they stand; results are transparent.In communicative testing in a needs-oriented teaching and learning context one canargue that the distinction between achievement (oriented to the content of the course)

Page 5: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

and proficiency (oriented to the continuum of real world ability) should ideally be small.To the extent that an achievement assessment tests practical language use in relevant situationsand aims to offer a balanced picture of emerging competence, it has a proficiencyangle. To the extent that a proficiency assessment consists of language and communicativetasks based on a transparent relevant syllabus, giving the learner the opportunity toshow what they have achieved, that test has an achievement element.The scales of illustrative descriptors relate to proficiency assessment: the continuumof real world ability. The importance of achievement testing as a reinforcement to learningis discussed in Chapter 6.9.3.2 Norm-referencing (NR)/criterion-referencing (CR)Norm-referencing is the placement of learners in rank order, their assessment and rankingin relation to their peers.Criterion-referencing is a reaction against norm-referencing in which the learner isassessed purely in terms of his/her ability in the subject, irrespective of the ability ofhis/her peers.Norm-referencing can be undertaken in relation to the class (you are 18th) or the demographiccohort (you are 21,567th; you are in the top 14%) or the group of learners takinga test. In the latter case, raw test scores may be adjusted to give a ‘fair’ result by plottingthe distribution curve of the test results onto the curve from previous years in order tomaintain a standard and ensure that the same percentage of learners are given ‘A’ gradesevery year, irrespective of the difficulty of the test or the ability of the pupils. A commonuse of norm-referenced assessment is in placement tests to form classes.Criterion-referencing implies the mapping of the continuum of proficiency (vertical)and range of relevant domains (horizontal) so that individual results on a test can be situatedin relation to the total criterion space. This involves (a) the definition of the relevantdomain(s) covered by the particular test/module, and (b) the identification of ‘cut-offpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set.The scales of illustrative descriptors are made up of criterion statements for categoriesin the descriptive scheme. The Common Reference Levels present a set of common standards.9.3.3 Mastery CR/continuum CRThe mastery criterion-referencing approach is one in which a single ‘minimum competencestandard’ or ‘cut-off point’ is set to divide learners into ‘masters’ and ‘non-masters’, withno degrees of quality in the achievement of the objective being recognised.The continuum criterion-referencing approach is an approach in which an individualability is referenced to a defined continuum of all relevant degrees of ability in the areain question. There are in fact many approaches to CR, but most of them can be identified as primarilya ‘mastery learning’ or ‘continuum’ interpretation. Much confusion is caused bythe misidentification of criterion-referencing exclusively with the mastery approach.The mastery approach is an achievement approach related to the content of thecourse/module. It puts less emphasis on situating that module (and so achievement in it)on the continuum of proficiency.The alternative to the mastery approach is to reference results from each test to therelevant continuum of proficiency, usually with a series of grades. In this approach, thatcontinuum is the ‘criterion’, the external reality which ensures that the test resultsmean something. Referencing to this external criterion can be undertaken with a scalaranalysis (e.g. Rasch model) to relate results from all the tests to each other and so reportresults directly onto a common scale.The Framework can be exploited with mastery or continuum approach. The scale oflevels used in a continuum approach can be matched to the Common Reference Levels;

Page 6: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

the objective to be mastered in a mastery approach can be mapped onto the conceptualgrid of categories and levels offered by the Framework.9.3.4 Continuous assessment/fixed point assessmentContinuous assessment is assessment by the teacher and possibly by the learner of class performances,pieces of work and projects throughout the course. The final grade thusreflects the whole course/year/semester.Fixed point assessment is when grades are awarded and decisions made on the basis ofan examination or other assessment which takes place on a particular day, usually theend of the course or before the beginning of a course. What has happened beforehand isirrelevant; it is what the person can do now that is decisive.Assessment is often seen as something outside the course which takes place at fixedpoints in order to make decisions. Continuous assessment implies assessment which isintegrated into the course and which contributes in some cumulative way to the assessmentat the end of the course. Apart from marking homework and occasional or regularshort achievement tests to reinforce learning, continuous assessment may take the formof checklists/grids completed by teachers and/or learners, assessment in a series offocused tasks, formal assessment of coursework, and/or the establishment of a portfolioof samples of work, possibly in differing stages of drafting, and/or at different stages inthe course.Both approaches have advantages and disadvantages. Fixed point assessment assuresthat people can still do things that might have been on the syllabus two years ago. But itleads to examination traumas and favours certain types of learners. Continuous assessmentallows more account to be taken of creativity and different strengths, but is verymuch dependent on the teacher’s capacity to be objective. It can, if taken to an extreme,turn life into one long never-ending test for the learner and a bureaucratic nightmarefor the teacher.Checklists of criterion statements describing ability with regard to communicativeactivities (Chapter 4) can be useful for continuous assessment. Rating scales developedin relation to the descriptors for aspects of competence (Chapter 5) can be used to awardgrades in fixed point assessment. 9.3.5 Formative assessment/summative assessmentFormative assessment is an ongoing process of gathering information on the extent oflearning, on strengths and weaknesses, which the teacher can feed back into their courseplanning and the actual feedback they give learners. Formative assessment is often usedin a very broad sense so as to include non-quantifiable information from questionnairesand consultations.Summative assessment sums up attainment at the end of the course with a grade. It isnot necessarily proficiency assessment. Indeed a lot of summative assessment is normreferenced,fixed-point, achievement assessment.The strength of formative assessment is that it aims to improve learning. The weaknessof formative assessment is inherent in the metaphor of feedback. Feedback only works ifthe recipient is in a position (a) to notice, i.e. is attentive, motivated and familiar with theform in which the information is coming, (b) to receive, i.e. is not swamped with information,has a way of recording, organising and personalising it; (c) to interpret, i.e. has suffi-cient pre-knowledge and awareness to understand the point at issue, and not to takecounterproductive action and (d) to integrate the information, i.e. has the time, orientationand relevant resources to reflect on, integrate and so remember the new information.This implies self-direction, which implies training towards self-direction,monitoring one’s own learning, and developing ways of acting on feedback.Such learner training or awareness raising has been called évaluation formatrice. Avariety of techniques may be used for this awareness training. A basic principle is tocompare impression (e.g. what you say you can do on a checklist) with the reality, (e.g.

Page 7: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

actually listening to material of the type mentioned in the checklist and seeing if you dounderstand it). DIALANG relates self-assessment to test performance in this way. Anotherimportant technique is discussing samples of work – both neutral examples and samplesfrom learners and encouraging them to develop a personalised metalanguage on aspectsof quality. They can then use this metalanguage to monitor their work for strengths andweaknesses and to formulate a self-directed learning contract.Most formative or diagnostic assessment operates at a very detailed level of the particularlanguage points or skills recently taught or soon to be covered. For diagnostic assessmentthe lists of exponents given in section 5.2 are still too generalised to be of practicaluse; one would need to refer to the particular specification which was relevant(Waystage, Threshold, etc.). Grids consisting of descriptors defining different aspects ofcompetence at different levels (Chapter 4) can, however, be useful to give formative feedbackfrom a speaking assessment.The Common Reference Levels would appear to be most relevant to summative assessment.However, as the DIALANG Project demonstrates, feedback from even a summativeassessment can be diagnostic and so formative.9.3.6 Direct assessment/indirect assessmentDirect assessment is assessing what the candidate is actually doing. For example, a smallgroup are discussing something, the assessor observes, compares with a criteria grid,matches the performances to the most appropriate categories on the grid, and gives anassessment. Indirect assessment, on the other hand, uses a test, usually on paper, which often assessesenabling skills.Direct assessment is effectively limited to speaking, writing and listening in interaction,since you can never see receptive activity directly. Reading can, for example, onlybe assessed indirectly by requiring learners to demonstrate evidence of understandingby ticking boxes, finishing sentences, answering questions, etc. Linguistic range andcontrol can be assessed either directly through judging the match to criteria or indirectlyby interpreting and generalising from the responses to test questions. A classic direct testis an interview; a classic indirect test is a cloze.Descriptors defining different aspects of competence at different levels in Chapter 5can be used to develop assessment criteria for direct tests. The parameters in Chapter 4can inform the selection of themes, texts and test tasks for direct tests of the productiveskills and indirect tests of listening and reading. The parameters of Chapter 5 can in additioninform the identification of key linguistic competences to include in an indirect testof language knowledge, and of key pragmatic, sociolinguistic and linguistic competencesto focus on in the formulation of test questions for item-based tests of the four skills.9.3.7 Performance assessment/knowledge assessmentPerformance assessment requires the learner to provide a sample of language in speech orwriting in a direct test.Knowledge assessment requires the learner to answer questions which can be of a rangeof different item types in order to provide evidence of the extent of their linguistic knowledgeand control.Unfortunately one can never test competences directly. All one ever has to go on is arange of performances, from which one seeks to generalise about proficiency. Proficiencycan be seen as competence put to use. In this sense, therefore, all tests assess only performance,though one may seek to draw inferences as to the underlying competences fromthis evidence.However, an interview requires more of a ‘performance’ than filling gaps in sentences,and gap-filling in turn requires more ‘performance’ than multiple choice. In this sensethe word ‘performance’ is being used to mean the production of language. But the word‘performance’ is used in a more restricted sense in the expression ‘performance tests’.

Page 8: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

Here the word is taken to mean a relevant performance in a (relatively) authentic andoften work or study-related situation. In a slightly looser use of this term ‘performanceassessment’, oral assessment procedures could be said to be performance tests in thatthey generalise about proficiency from performances in a range of discourse styles consideredto be relevant to the learning context and needs of the learners. Some testsbalance the performance assessment with an assessment of knowledge of the languageas a system; others do not.This distinction is very similar to the one between direct and indirect tests. TheFramework can be exploited in a similar way. The Council of Europe specifications for differentlevels (Waystage, Threshold Level, Vantage Level) offer in addition appropriatedetail on target language knowledge in the languages for which they are available.

9.3.8 Subjective assessment/objective assessmentSubjective assessment is a judgement by an assessor. What is normally meant by this is thejudgement of the quality of a performance.Objective assessment is assessment in which subjectivity is removed. What is normallymeant by this is an indirect test in which the items have only one right answer, e.g. multiplechoice.However the issue of subjectivity/objectivity is considerably more complex.An indirect test is often described as an ‘objective test’ when the marker consults adefinitive key to decide whether to accept or reject an answer and then counts correctresponses to give the result. Some test types take this process a stage further by onlyhaving one possible answer to each question (e.g. multiple choice, and c-tests, which weredeveloped from cloze for this reason), and machine marking is often adopted to eliminatemarker error. In fact the objectivity of tests described as ‘objective’ in this way issomewhat over-stated since someone decided to restrict the assessment to techniquesoffering more control over the test situation (itself a subjective decision others may disagreewith). Someone then wrote the test specification, and someone else may havewritten the item as an attempt to operationalise a particular point in the specification.Finally, someone selected the item from all the other possible items for this test. Sinceall those decisions involve an element of subjectivity, such tests are perhaps betterdescribed as objectively scored tests.In direct performance assessment grades are generally awarded on the basis of a judgement.That means that the decision as to how well the learner performs is made subjectively,taking relevant factors into account and referring to any guidelines or criteria andexperience. The advantage of a subjective approach is that language and communicationare very complex, do not lend themselves to atomisation and are greater than the sumof their parts. It is very often difficult to establish what exactly a test item is testing.Therefore to target test items on specific aspects of competence or performance is a lotless straightforward than it sounds.Yet, in order to be fair, all assessment should be as objective as possible. The effects ofthe personal value judgements involved in subjective decisions about the selection ofcontent and the quality of performance should be reduced as far as possible, particularlywhere summative assessment is concerned. This is because test results are very oftenused by third parties to make decisions about the future of the persons who have beenassessed.Subjectivity in assessment can be reduced, and validity and reliability thus increasedby taking steps like the following:• developing a specification for the content of the assessment, for example based upon aframework of reference common to the context involved• using pooled judgements to select content and/or to rate performances

Page 9: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

• adopting standard procedures governing how the assessments should be carriedout• providing definitive marking keys for indirect tests and basing judgements in directtests on specific defined criteria• requiring multiple judgements and/or weighting of different factors• undertaking appropriate training in relation to assessment guidelines checking the quality of the assessment (validity, reliability) by analysing assessmentdataAs discussed at the beginning of this chapter, the first step towards reducing the subjectivityof judgements made at all stages in the assessment process is to build a commonunderstanding of the construct involved, a common frame of reference. The Frameworkseeks to offer such a basis for the specification for the content and a source for the developmentof specific defined criteria for direct tests.9.3.9 Rating on a scale/rating on a checklistRating on a scale: judging that a person is at a particular level or band on a scale made upof a number of such levels or bands.Rating on a checklist: judging a person in relation to a list of points deemed to be relevantfor a particular level or module.In ‘rating on a scale’ the emphasis is on placing the person rated on a series of bands.The emphasis is vertical: how far up the scale does he/she come? The meaning of the differentbands/levels should be made clear by scale descriptors. There may be several scalesfor different categories, and these may be presented on the same page as a grid or on differentpages. There may be a definition for each band/level or for alternate ones, or forthe top, bottom and middle.The alternative is a checklist, on which the emphasis is on showing that relevantground has been covered, i.e. the emphasis is horizontal: how much of the content of themodule has he/she successfully accomplished? The checklist may be presented as a list ofpoints like a questionnaire. It may on the other hand be presented as a wheel, or in someother shape. The response may be Yes/No. The response may be more differentiated, witha series of steps (e.g. 0–4) preferably with steps identified with labels, with definitionsexplaining how the labels should be interpreted.Because the illustrative descriptors constitute independent, criterion statementswhich have been calibrated to the levels concerned, they can be used as a source toproduce both a checklist for a particular level, as in some versions of the LanguagePortfolio, and rating scales or grids covering all relevant levels, as presented in Chapter3, for self-assessment in Table 2 and for examiner assessment in Table 3.9.3.10 Impression/guided judgementImpression: fully subjective judgement made on the basis of experience of the learner’sperformance in class, without reference to specific criteria in relation to a specific assessment.Guided judgement: judgement in which individual assessor subjectivity is reduced bycomplementing impression with conscious assessment in relation to specific criteria.An ‘impression’ is here used to mean when a teacher or learner rates purely on thebasis of their experience of performance in class, homework, etc. Many forms of subjectiverating, especially those used in continuous assessment, involve rating an impressionon the basis of reflection or memory possibly focused by conscious observation of the person concerned over a period of time. Very many school systems operate on thisbasis.The term ‘guided judgement’ is here used to describe the situation in which that

Page 10: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

impression is guided into a considered judgement through an assessment approach.Such an approach implies (a) an assessment activity with some form of procedure, and/or(b) a set of defined criteria which distinguish between the different scores or grades, and(c) some form of standardisation training. The advantage of the guided approach tojudging is that if a common framework of reference for the group of assessors concernedis established in this way, the consistency of judgements can be radically increased. Thisis especially the case if ‘benchmarks’ are provided in the form of samples of performanceand fixed links to other systems. The importance of such guidance is underlined by thefact that research in a number of disciplines has shown repeatedly that with untrainedjudgements the differences in the severity of the assessors can account for nearly asmuch of the differences in the assessment of learners as does their actual ability, leavingresults almost purely to chance.The scales of descriptors for the common reference levels can be exploited to providea set of defined criteria as described in (b) above, or to map the standards represented byexisting criteria in terms of the common levels. In the future, benchmark samples of performanceat different common reference levels may be provided to assist in standardisationtraining.9.3.11 Holistic/analyticHolistic assessment is making a global synthetic judgement. Different aspects are weightedintuitively by the assessor.Analytic assessment is looking at different aspects separately.There are two ways in which this distinction can be made: (a) in terms of what is lookedfor; (b) in terms of how a band, grade or score is arrived at. Systems sometimes combinean analytic approach at one level with a holistic approach at another.a) What to assess: some approaches assess a global category like ‘speaking’ or ‘interaction’,assigning one score or grade. Others, more analytic, require the assessor toassign separate results to a number of independent aspects of performance. Yetother approaches require the assessor to note a global impression, analyse by differentcategories and then come to a considered holistic judgement. The advantageof the separate categories of an analytic approach is that they encourage the assessorto observe closely. They provide a metalanguage for negotiation between assessors,and for feedback to learners. The disadvantage is that a wealth of evidencesuggests that assessors cannot easily keep the categories separate from a holisticjudgement. They also get cognitive overload when presented with more than fouror five categories.b) Calculating the result: some approaches holistically match observed performance todescriptors on a rating scale, whether the scale is holistic (one global scale) or analytic(3–6 categories in a grid). Such approaches involve no arithmetic. Results arereported either as a single number or as a ‘telephone number’ across categories.Other more analytical approaches require giving a certain mark for a number of dif- ferent points and then adding them up to give a score, which may then convert intoa grade. It is characteristic of this approach that the categories are weighted, i.e. thecategories do not each account for an equal number of points.Tables 2 and 3 in Chapter 3 provide self-assessment and examiner assessment examplesrespectively of analytic scales of criteria (i.e. grids) used with a holistic rating strategy (i.e.match what you can deduce from the performance to the definitions, and make a judgement).9.3.12 Series assessment/category assessmentCategory assessment involves a single assessment task (which may well have differentphases to generate different discourse as discussed in section 9.2.1.) in which performanceis judged in relation to the categories in an assessment grid: the analyticapproach outlined in 9.3.11.Series assessment involves a series of isolated assessment tasks (often roleplays withother learners or the teacher), which are rated with a simple holistic grade on a labelled

Page 11: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

scale of e.g. 0–3 or 1–4.A series assessment is one way of coping with the tendency in category assessmentsfor results on one category to affect those on another. At lower levels the emphasis tendsto be on task achievement, the aim is to fill out a checklist of what the learner can do onthe basis of teacher/learner assessment of actual performances rather than simpleimpression. At higher levels, tasks may be designed to show particular aspects of profi-ciency in the performance. Results are reported as a profile.The scales for different categories of language competence juxtaposed with the text inChapter 5 offer a source for the development of the criteria for a category assessment.Since assessors can only cope with a small number of categories, compromises have tomade in the process. The elaboration of relevant types of communicative activities insection 4.4. and the list of different types of functional competence outlined in section5.2.3.2 may inform the identification of suitable tasks for a series assessment.9.3.13 Assessment by others/self-assessmentAssessment by others: judgements by the teacher or examiner.Self-assessment: judgements about your own proficiency.Learners can be involved in many of the assessment techniques outlined above.Research suggests that provided ‘high stakes’ (e.g. whether or not you will be acceptedfor a course) are not involved, self-assessment can be an effective complement to tests andteacher assessment. Accuracy in self-assessment is increased (a) when assessment is inrelation to clear descriptors defining standards of proficiency and/or (b) when assessmentis related to a specific experience. This experience may itself even be a test activity. It isalso probably made more accurate when learners receive some training. Such structuredself-assessment can achieve correlations to teachers’ assessments and tests equal to thecorrelation (level of concurrent validation) commonly reported between teachers themselves,between tests and between teacher assessment and tests. The main potential for self-assessment, however, is in its use as a tool for motivationand awareness raising: helping learners to appreciate their strengths, recognise theirweaknesses and orient their learning more effectively.Self-assessment and examiner versions of rating grids are presented in Table 2 and inTable 3 in Chapter 3. The most striking distinction between the two – apart from thepurely surface formulation as I can do . . . or Can do . . . is that whereas Table 2 focuses oncommunicative activities, Table 3 focuses on generic aspects of competence apparent inany spoken performance. However, a slightly simplified self-assessment version of Table3 can easily be imagined. Experience suggests that at least adult learners are capable ofmaking such qualitative judgements about their competence.9.4 Feasible assessment and a metasystemThe scales interspersed in Chapters 4 and 5 present an example of a set of categoriesrelated to but simplified from the more comprehensive descriptive scheme presented inthe text of Chapters 4 and 5. It is not the intention that anyone should, in a practicalassessment approach, use all the scales at all the levels. Assessors find it difficult to copeUsers of the Framework may wish to consider and where appropriate state:• which of the types of assessment listed above are:• • more relevant to the needs of the learner in their system• • more appropriate and feasible in the pedagogic culture of their system• • more rewarding in terms of teacher development through ‘washback’ effect• the way in which the assessment of achievement (school-oriented; learning-oriented) andthe assessment of proficiency (real world-oriented; outcome-oriented) are balanced andcomplemented in their system, and the extent to which communicative performance isassessed as well as linguistic knowledge.• the extent to which the results of learning are assessed in relation to defined standardsand criteria (criterion-referencing) and the extent to which grades and evaluations areassigned on the basis of the class a learner is in (norm-referencing).• the extent to which teachers are:

Page 12: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

• • informed about standards (e.g. common descriptors, samples of performance)• • encouraged to become aware of a range of assessment techniques• • trained in techniques and interpretation• the extent to which it is desirable and feasible to develop an integrated approach tocontinuous assessment of coursework and fixed point assessment in relation to relatedstandards and criteria definitions• the extent to which it is desirable and feasible to involve learners in self-assessment inrelation to defined descriptors of tasks and aspects of proficiency at different levels, andoperationalisation of those descriptors in – for example – series assessment• the relevance of the specifications and scales provided in the Framework to their context,and the way in which they might be complemented or elaborated. with a large number of categories and in addition, the full range of levels presented maynot be appropriate in the context concerned. Rather, the set of scales is intended as a referencetool.Whatever approach is being adopted, any practical assessment system needs to reducethe number of possible categories to a feasible number. Received wisdom is that morethan 4 or 5 categories starts to cause cognitive overload and that 7 categories is psychologicallyan upper limit. Thus choices have to be made. In relation to oral assessment, ifinteraction strategies are considered a qualitative aspect of communication relevant inoral assessment, then the illustrative scales contain 12 qualitative categories relevant tooral assessment:Turntaking strategiesCo-operating strategiesAsking for clarificationFluencyFlexibilityCoherenceThematic developmentPrecisionSociolinguistic competenceGeneral rangeVocabulary rangeGrammatical accuracyVocabulary controlPhonological controlIt is obvious that, whilst descriptors on many of these features could possibly be includedin a general checklist, 12 categories are far too many for an assessment of any performance.In any practical approach, therefore, such a list of categories would beapproached selectively. Features need to be combined, renamed and reduced into asmaller set of assessment criteria appropriate to the needs of the learners concerned, tothe requirements of the assessment task concerned and to the style of the pedagogicculture concerned. The resultant criteria might be equally weighted, or alternativelycertain factors considered more crucial to the task at hand might be more heavilyweighted.The following four examples show ways in which this can be done. The first three examplesare brief notes on the way categories are used as test criteria in existing assessmentapproaches. The fourth example shows how descriptors in scales in the Framework weremerged and reformulated in order to provide an assessment grid for a particular purposeon a particular occasion. Example 1:Cambridge Certificate in Advanced English (CAE), Paper 5: Criteria for Assessment (1991)Test criteria Illustrative scales Other categories

Page 13: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

Fluency FluencyAccuracy and range General rangeVocabulary rangeGrammatical accuracyVocabulary controlPronunciation Phonological controlTask achievement Coherence Task successSociolinguistic appropriacy Need for interlocutor supportInteractive communication Turntaking strategies Extent and ease of maintainingCo-operative strategies contributionThematic developmentNote on other categories: In the illustrative scales, statements about task success are found in relationto the kind of activity concerned under Communicative Activities. Extent and ease of contribution isincluded under Fluency in those scales. An attempt to write and calibrate descriptors on Need forInterlocutor Support to include in the illustrative set of scales was unsuccessful.Example 2:International Certificate Conference (ICC): Certificate in English for Business Purposes,Test 2: Business Conversation (1987)Test criteria Illustrative scales Other categoriesScale 1 (not named) Sociolinguistic appropriacy Task successGrammatical accuracyVocabulary controlScale 2 (Use of discourse Turntaking strategiesfeatures to initiate and Co-operative strategiesmaintain flow of Sociolinguistic appropriacyconversation)

Example 3:Eurocentres – Small Group Interaction Assessment (RADIO) (1987)Test criteria Illustrative scales Other categoriesRange General rangeVocabulary rangeAccuracy Grammatical accuracyVocabulary controlSocio-linguistic appropriacyDelivery FluencyPhonological controlInteraction Turntaking strategiesCo-operating strategiesExample 4:Swiss National Research Council: Assessment of Video PerformancesContext: The illustrative descriptors were scaled in a research project in Switzerland asexplained in Appendix A. At the conclusion of the research project, teachers who hadparticipated were invited to a conference to present the results and to launch experimentationin Switzerland with the European Language Portfolio. At the conference, two ofthe subjects of discussion were (a) the need to relate continuous assessment and selfassessmentchecklists to an overall framework, and (b) the ways in which the descriptorsscaled in the project could be exploited in different ways in assessment. As part of thisprocess of discussion, videos of some of the learners in the survey were rated onto theassessment grid presented as Table 3 in Chapter 3. It presents a selection from the illustrativedescriptors in a merged, edited form.Test criteria Illustrative scales Other categoriesRange General rangeVocabulary rangeAccuracy Grammatical accuracy

Page 14: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

Vocabulary controlFluency FluencyInteraction Global interactionTurntakingCo-operatingCoherence Coherence Different systems with different learners in different contexts simplify, select andcombine features in different ways for different kinds of assessment. Indeed rather thanbeing too long, the list of 12 categories is probably unable to accommodate all the variantspeople choose, and would need to be expanded to be fully comprehensive.Users of the Framework may wish to consider and where appropriate state:• the way in which theoretical categories are simplified into operational approaches in theirsystem;• the extent to which the main factors used as assessment criteria in their system can besituated in the set of categories introduced in Chapter 5 for which sample scales areprovided in the Appendix, given further local elaboration to take account of specificdomains of use. This appendix discusses technical aspects of describing levels of language attainment.Criteria for descriptor formulation are discussed. Methodologies for scale developmentare then listed, and an annotated bibliography is provided.Descriptor formulationExperience of scaling in language testing, the theory of scaling in the wider field ofapplied psychology, and preferences of teachers when involved in consultationprocesses (e.g. UK graded objectives schemes, Swiss project) suggest the following set ofguidelines for developing descriptors:• Positiveness: It is a common characteristic of assessor-orientated proficiency scalesand of examination rating scales for the formulation of entries at lower levels to benegatively worded. It is more difficult to formulate proficiency at low levels interms of what the learner can do rather than in terms of what they can’t do. But iflevels of proficiency are to serve as objectives rather than just as an instrument forscreening candidates, then positive formulation is desirable. It is sometimespossible to formulate the same point either positively or negatively, e.g. in relationto range of language (see Table A1).An added complication in avoiding negative formulation is that there are somefeatures of communicative language proficiency which are not additive. The lessthere is the better. The most obvious example is what is sometimes calledIndependence, the extent to which the learner is dependent on (a) speech adjustmenton the part of the interlocutor (b) the chance to ask for clarification and (c) thechance to get help with formulating what he/she wants to say. Often these points canbe dealt with in provisos attached to positively worded descriptors, for example:Can generally understand clear, standard speech on familiar matters directedat him/her, provided he/she can ask for repetition or reformulation fromtime to time.Can understand what is said clearly, slowly and directly to him/her in simpleeveryday conversation; can be made to understand, if the speaker can takethe trouble.or:Can interact with reasonable ease in structured situations and shortconversations, provided the other person helps if necessary. • Definiteness: Descriptors should describe concrete tasks and/or concrete degrees ofskill in performing tasks. There are two points here. Firstly, the descriptor shouldavoid vagueness, like, for example ‘Can use a range of appropriate strategies’. What

Page 15: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

is meant by strategy? Appropriate to what? How should we interpret ‘range’? Theproblem with vague descriptors is that they can read quite nicely, but an apparentease of acceptance can mask the fact that everyone is interpreting them differently.Secondly, since the 1940s, it has been a principle that distinctions between steps ona scale should not be dependent on replacing a qualifier like ‘some’ or ‘a few’ with‘many’ or ‘most’ or by replacing ‘fairly broad’ with ‘very broad’ or ‘moderate’ with‘good’ at the next level up. Distinctions should be real, not word-processed and thismay mean gaps where meaningful, concrete distinctions cannot be made.• Clarity: Descriptors should be transparent, not jargon-ridden. Apart from the barrierto understanding, it is sometimes the case that when jargon is stripped away, anapparently impressive descriptor can turn out to be saying very little. Secondly, theyshould be written in simple syntax with an explicit, logical structure.• Brevity: One school of thought is associated with holistic scales, particularly thoseused in America and Australia. These try to produce a lengthy paragraph whichAppendix A: developing proficiency descriptors comprehensibly covers what are felt to be the major features. Such scales achieve‘definiteness’ by a very comprehensive listing which is intended to transmit adetailed portrait of what raters can recognise as a typical learner at the levelconcerned, and are as a result very rich sources of description. There are twodisadvantages to such an approach however. Firstly, no individual is actually‘typical’. Detailed features co-occur in different ways. Secondly, a descriptor which islonger than a two clause sentence cannot realistically be referred to during theassessment process. Teachers consistently seem to prefer short descriptors. In theproject which produced the illustrative descriptors, teachers tended to reject or splitdescriptors longer than about 25 words (approximately two lines of normal type).• Independence: There are two further advantages of short descriptors. Firstly they aremore likely to describe a behaviour about which one can say ‘Yes, this person cando this’. Consequently shorter, concrete descriptors can be used as independentcriteria statements in checklists or questionnaires for teacher continuousassessment and/or self-assessment. This kind of independent integrity is a signalthat the descriptor could serve as an objective rather than having meaning onlyrelative to the formulation of other descriptors on the scale. This opens up a rangeof opportunities for exploitation in different forms of assessment (see Chapter 9).Scale development methodologiesThe existence of a series of levels presupposes that certain things can be placed at onelevel rather than another and that descriptions of a particular degree of skill belong toone level rather than another. This implies a form of scaling, consistently applied.There are a number of possible ways in which descriptions of language proficiency canbe assigned to different levels. The available methods can be categorised in threegroups: intuitive methods, qualitative methods and quantitative methods. Mostexisting scales of language proficiency and other sets of levels have been developedthrough one of the three intuitive methods in the first group. The best approachescombine all three approaches in a complementary and cumulative process. Qualitativemethods require the intuitive preparation and selection of material and intuitiveinterpretation of results. Quantitative methods should quantify qualitatively pre-testedmaterial, and will require intuitive interpretation of results. Therefore in developingthe Common Reference Levels, a combination of intuitive, qualitative and quantitativeapproaches was used.If qualitative and quantitative methods are used then there are two possible startingpoints: descriptors or performance samples.Users of the Framework may wish to consider and where appropriate state:• Which of the criteria listed are most relevant, and what other criteria are used explicitlyor implicitly in their context;

Page 16: CHAPTER 9 - WordPress.com  · Web viewpoints’: the score(s) on the test deemed necessary to meet the proficiency standard set. The scales of illustrative descriptors are made up

• To what extent it is desirable and feasible that formations in their system meet criteriasuch as those listed.Appendix A: developing proficiency descriptors