JLTA 2017 Keynote Speech - JST

JLTA 2017 Keynote Speech

Task-Based Language Assessment

Aligning Designs With Intended Uses and Consequences

John M. NORRIS Educational Testing Service

－ 1 －

－ 2 －

JLTA Journal, vol. 21: pp. 3–20, 2018 Copyright© 20XX Japan Language Testing Association DOI: 10.20622/jltajournal.21.0_3 Print ISSN 2189-5341 Online ISSN 2189-9746

Task-Based Language Assessment Aligning Designs With Intended Uses and Consequences

John M. NORRIS

Educational Testing Service

Abstract Constructed-response tasks have captured the attention of testers and educators for some time (e.g., Cureton, 1951), because they present goal-oriented, contextualized challenges that prompt examinees to deploy cognitive skills and domain-related knowledge in authentic performances. Such performances present a distinct advantage when teaching, learning, and assessment focus on what learners can do rather than merely emphasizing what they know (Wiggins, 1998). Over the past several decades, communicative performance tasks have come to play a crucial role in language assessments on a variety of levels, from classroom-based tests, to professional certifications, to large-scale language proficiency exams (Norris, 2009, 2016). However, the use of such tasks for assessment purposes remains contentious, and numerous language testing alternatives are available at potentially lower cost and degree of effort. In order to facilitate decisions about when and why to adopt task-based designs for language assessment, I first outline the relationship between assessment designs and their intended uses and consequences. I then introduce two high-stakes examples of language assessment circumstances (job certification and admissions testing) that suggest a need for task-based designs, and I review the corresponding fit of several assessments currently in use for these purposes. In relation to these purposes, I also suggest some of the positive consequences of task-based designs for language learners, teachers, and society, and I point to the dangers of using assessments that do not incorporate communicative tasks or do so inappropriately. I conclude by highlighting other circumstances that call for task-based designs, and I suggest how advances in technology may help to address associated challenges. Keywords: task-based language assessment, performance assessment, test design, validity, intended uses, consequences

A rich tradition of innovation in language testing has led to the development of numerous approaches to, and methods for, assessing language learning, proficiency, and other dimensions of the language ability construct. From ‘tried-and-true’ multiple choice tests of grammatical knowledge or listening comprehension, to performance assessments in writing

－ 3 －

and speaking, to reduced redundancy tests like cloze and elicited imitation, an impressive array of possibilities is available for gauging what second language (L2) learners know and can do in a given target language. Rapid advances in computer-based testing, drawing in particular on the combination of natural language processing theories with automated speech/writing recognition and evaluation technologies, have also pushed language testing into new territory, including mobile delivery, automated scoring, and increasingly ‘smart’ machine-mediated interaction with test takers. As a result of these and other innovations, test users today are presented with not only a wide array of possibilities for measuring and interpreting L2 ability but also the challenge of how to choose among them to best meet their needs.

One persistent trend in language testing has to do with the desire to incorporate communication tasks into assessments, such that learners must demonstrate the extent to which they can actually use the L2 to get things done. Task-based language assessments (TBLA), according to Norris (2016), involve: “the elicitation and evaluation of language use (across all modalities) for expressing and interpreting meaning, within a well-defined communicative context (and audience), for a clear purpose, toward a valued goal or outcome” (p. 232). Although authentic communication tasks have come to play a role in classroom- and standards-based assessments, professional certification exams, and even standardized proficiency tests, their adoption and use remains somewhat contentious. TBLA can present a variety of potential challenges, such as: selecting which tasks to include in an assessment, replicating authentic contexts and performance demands, scoring task performances and outcomes reliably, providing all test takers with an equally fair assessment, and generalizing about learners’ L2 abilities across tasks and domains of language use. In light of other possibilities for language testing, it may be perceived that these kinds of challenges or costs outweigh any tangible benefits of engaging in TBLA. In this paper, I provide a basic rationale for adopting TBLA despite such challenges. I begin by outlining an approach for helping test developers and test users make decisions about which test designs provide a fitting match for the intended uses and consequences of their assessments. I then reflect on some of the major contributions that are made by TBLA within language education and society, and I highlight some of the ways in which apparent challenges may actually provide critical opportunities for enhancing the positive impact (and avoiding negative consequences) of language assessment. I conclude by indicating future work that will help to align task-based language assessment designs with fitting uses and consequences.

Language Assessment Design, Use, and Consequence At their most basic, language tests provide information about what L2 learners know

and can do in the target language. The nature of that information depends on what dimensions of language knowledge and ability are targeted by design, ranging from discrete tests of lexical meanings or grammatical rules, to indirect tests of linguistic enabling skills (e.g., phonemic awareness, meta-linguistic judgments), to performance tests of productive

－ 4 －

communication abilities (Norris & Ortega, 2012). Underlying all language tests—implicitly if not explicitly—is some notion of a construct, that is, the particular scope and focus of language knowledge or ability about which information is gathered from test takers as they complete test tasks, and about which interpretations are made. We interpret that learners understand frequent vocabulary words, can comprehend the main idea of a lecture, are effective presenters, have a certain level of L2 proficiency, and so on, based on a limited set of behaviors that the test has asked them to engage in. Language tests are considered accurate to the extent that the behaviors they elicit can be shown to provide reasonably trustworthy estimations of the constructs they target; in other words, a given test should measure what it is intended to measure. Historically, and in some recent perspectives (e.g., Borsboom, Cramer, Kievit, Scholten, & Franić, 2009), the primary if not exclusive criterion by which test validity should be determined has to do with its accuracy as a measuring device of a particular construct. In this view, language test designs are good to the extent that they measure a construct accurately.

However, as all language teachers know, tests function as much more than mere measuring devices when they are put into use for particular purposes in educational, professional, and other societal contexts. Language assessment involves the use of information provided by tests or similar procedures for making decisions and taking actions, and these in turn lead to consequences for a variety of stakeholders. The reality that tests are inevitably put to use for particular purposes and with very real outcomes has led to an expanded notion of assessment validity and associated expectations for what constitutes good test design. Building on seminal contributions by Cronbach (1980) and Messick (1989), among others, scholars working at the front lines of applied measurement have come to a revised conceptualization of validity that posits effectiveness at bringing about certain intended uses and consequences as the fundamental starting point for determining assessment quality. Kane’s (2006, 2012) argument-based framework, the requirements of a ‘theory of action’ for educational assessments (e.g., Bennett, 2010), and other approaches to understanding and evaluating assessment validity all emphasize the need to (a) design assessment systems so that they enable accurate interpretations about test takers, that (b) lead to appropriate decisions and actions, which (c) stand a high likelihood of resulting in positive impacts and mitigating potential negative consequences. Much more than measurement accuracy, the bottom line in terms of good assessment design has to do with its effectiveness for supporting particular intended uses and consequences.

A focus on designing assessments with intended uses and consequences in mind has achieved some consensus within the language testing community as well. Kane’s argument-based approach has been widely adopted and applied to language test design and validity evaluation in recent years (e.g., Chapelle, 2012; Chapelle, Enright, & Jamieson, 2008), and Bachman’s (2005) notion of an assessment use argument has encouraged language testers to incorporate intended uses and consequences into all phases of test design, delivery, and evaluation. A particular challenge for language assessment in this regard has to do with

－ 5 －

the wide array of actual uses to which language tests are put in various settings. Language tests are used to make decisions about individual language learners/users (selection, admission, placement, advancement, certification, licensure, etc.), to inform teaching and learning (feedback, grading, diagnosis, achievement, motivating learners, etc.), and to evaluate programs (demonstrating outcomes, improving curriculum and instruction, aligning programs with societal expectations, etc.), to name a few common educational uses for assessment. Determining which among many possible language test designs—targeting a variety of possible constructs of language knowledge and ability in specific ways—might best support such diverse intended uses requires systematic attention to the interaction of a handful of key features in the assessment setting.

As I have outlined previously (Norris, 2000, 2008), a basic starting point for aligning language test designs with intended uses has to do with establishing answers to several critical questions. First of all, who are the intended users of the assessment, and what stakes do they have on the line as assessment information is gathered and acted upon? Second, what information does the test or other assessment procedure need to provide, at what levels of accuracy and specificity, within what constraints on data collection, such that intended interpretations can be supported? Third, how will distinct assessment users put test information and interpretations to use in making what kinds of decisions and/or taking what kinds of actions? Finally, how should these test-informed interpretations and uses lead to specific impacts on intended users, test takers, and other potential stakeholders in the assessment setting, and what else might happen as a result of assessment? In Norris (2008; see also Byrnes, Maxim, & Norris, 2010; Norris & Pfeiffer, 2003) I explored how quite distinct language test designs were identified and put to use as a result of the educators within a German language program systematically answering these questions. For example, whereas a reduced redundancy test (the C-test format) was selected for use as an efficient placement test, the need to demonstrate language learning outcomes according to a commonly understood metric led to the adoption of a standardized oral proficiency interview format, and the intended use of assessments within the classroom for formative/summative purposes indicated the need for a locally devised task-based writing assessment.

As this example demonstrates, applying a heuristic that interrogates key features of an assessment—in terms of intended users, interpretations, uses, and consequences—provides a critical and encompassing foundation for gauging the effectiveness of various possible language test designs in meeting specific needs for assessment. Of interest in the rest of this paper, then, is the consideration of circumstances for language assessment use that would seem to call for task-based designs.

Contributions of Task-Based Language Assessment Clearly, there is an intimate relationship between the language assessment designs we

adopt, the uses to which those assessments will be put, and the positive or negative consequences that ensue. TBLA designs offer numerous potential benefits for assessment

－ 6 －

users (e.g., learners, teachers, decision makers of various kinds), but they also come with associated costs (e.g., development and delivery) and other challenges. Of interest here are answers to key questions about this relationship, including: (a) under which test use circumstances would task-based assessment designs seem to be called for? (b) when would it be inappropriate not to adopt some form of TBLA? and (c) how do different kinds of TBLA designs respond to specific intended uses while moderating potential challenges? In this section, I sketch out several circumstances for intended assessment use that seem to call for TBLA, and I illustrate how specific assessment designs provide better or worse fit to these intended uses and the desired consequences for assessment.

Job-Specific Language Assessment for Professional Certification

Perhaps the most clear-cut case for adopting a task-based design has to do with the use of assessments for certifying the level of language abilities or proficiency required for participating in specific professions or job types (Douglas, 2000; McNamara, 1996). Jobs for which language certification is required tend to be those in which the demands for communication are frequent and have high stakes involved (e.g., in the medical professions, aviation, translation/interpretation, law enforcement, call centers), and where language must be used in specific ways to accomplish critical job functions. The intended users of such assessments are typically the professional organizations, governmental agencies, or employers responsible for ensuring that certified individuals meet the minimum competencies expected of a given profession (e.g., the International Civil Aviation Organization). These users are also responsible for defining exactly what those minimum competencies may be, frequently in the form of published standards that delineate both the communication demands of the job (often in the form of essential task types) as well as the corresponding levels of language proficiency required for accomplishing them. Decisions made on the basis of language certification assessments indicate whether or not an individual has demonstrated sufficient capabilities in the areas covered by the standards such that successful job performance in the target language will not be hindered; thus, these assessments play at least a partial role in determining who is authorized, allowed, or licensed to participate in certain job types. Decisions made on the basis of certification assessments also have consequences, for test takers and for those responsible for certifying them, but perhaps most critically for the individuals or groups being served by the given profession. Intended consequences include providing access to the profession for linguistically qualified individuals, maintaining the reputation and value of professional organizations or agencies, and ensuring accountability of professions and professionals to their clients. Unintended consequences may range to the severe, when professionals are not linguistically able to perform their jobs, including potentially the endangerment of clients in professions where successful target-language communication is critical for ensuring their health and well-being.

Key features of an assessment design that meets these intended uses typically include: (a) sufficient sampling of the knowledge/skill competencies identified to be critical by the

－ 7 －

profession; (b) high-fidelity replication or simulation of performance demands associated with the profession; and (c) determination and description of levels of ability deemed adequate for competent participation in the profession (see discussion in Luecht, 2016). The Canadian English Language Benchmark Assessment for Nurses (CELBAN) is a good example of how a task-based assessment design meets the demands of intended assessment uses and consequences for a job-specific language certification (see details in Centre for Canadian Language Benchmarks, 2002, 2003, 2004). The purpose of this assessment is to determine whether otherwise qualified internationally-educated nurses who speak English as a second language command a threshold of proficiency sufficient to accomplish the range of communication tasks typical of the nursing profession. The starting point for test development was a needs analysis of the communication demands encountered by nurses across a range of healthcare contexts in Canada. Results pointed not only to the language functions that nurses perform, but also the content that must be conveyed, the context of communication (including situations and participants), and the level of proficiency required for accomplishing a given task according to the Canadian Language Benchmarks (CLB; Centre for Canadian Language Benchmarks, 2000, 2012). Target tasks from this domain were then representatively sampled into a four-section assessment covering each of the four skills. Critically, each section features items and test delivery formats that replicate specific task performance contexts, participants, and criteria for accomplishment. For example, on the listening section, examinees must comprehend main ideas and details presented in video or audio recordings that contain rich contextual information, such as location (medical office, patient home, etc.) and interlocutors (patients, doctors, family members). The reading section features input from authentic sources, such as patient charts, health care articles, and doctor’s notes, emphasizing the need to comprehend specific lexico-grammatical forms in context. The writing section requires examinees to produce job-specific language accurately in filling out a form and writing a patient report. Finally, in the speaking section, the examinee interacts with two assessors in role-play and interview tasks that simulate typical interactions with patients and other health-care professionals (e.g., taking a patient history by asking a series of questions).

Without doubt, the cost and logistical challenge of developing and delivering a job-specific task-based assessment like the CELBAN is considerable—so why bother doing so, especially if other generic tests of relevant language proficiency levels are already available? For example, in Canada, the Canadian English Language Proficiency Index Program (CELPIP) and the International English Language Testing SystemTM (IELTS) are tests recognized by the Canadian government to indicate a test taker’s language proficiency according to the CLB. Why not simply set a high cut-score on one of these tests and assume that it reflects sufficient language ability for nursing communication purposes? According to developers of the CELBAN, the use of other tests like these was perceived to be inappropriate in that they “were not based on a Target Language Use (TLU) analysis for nursing, and the language demands (content and context) of the tests did not represent the

－ 8 －

nursing profession” (Lewis & Kingdon, 2016, p. 70). In fact, the original impetus for the assessment came from a national survey of nursing profession stakeholders, the results of which strongly indicated that “existing assessment instruments […] were too general to adequately evaluate the ability of internationally-educated nurses to communicate effectively in the profession in Canada” (Centre for Canadian Language Benchmarks, 2002, p. 1). Similarly, another major purpose underlying the assessment was to encourage positive washback on instruction and learning, such that nurses in training “have the opportunity to develop the levels of communicative language proficiency […] needed to communicate effectively in health care in Canada” (Lewis & Kingdon, 2016, p. 74).

The core validity issue for this type of assessment is the need to ensure that professionals can use the target language in specific communication contexts in order to accomplish well-defined goals. Fitting assessments hold examinees to a standard that aligns with the expectations of a profession and its clientele, and they guide language learning towards associated communicative demands. Task-based designs, anchored in the actual discourse practices and standards of the profession, provide a much more defensible basis for certifying the job-specific language abilities of candidates than do general proficiency language tests or tests designed for assessing language use in other domains. Along these lines, in reviewing assessments used for certifying English abilities in the aviation profession, Alderson (2010) identified the widespread use of IELTS for various purposes as particularly problematic, observing that:

[T]he IELTS test was not developed for the purposes of licensing pilots or air traffic controllers, but as evidence of proficiency in English for admission to tertiary institutions where instruction takes place in English. The IELTS test is, however, also widely (mis)used for purposes of immigration and for some forms of professional recognition, including medical councils. (p. 56)

In pursuit of better fitting assessments, professional organizations, certifying agencies,

and employers have pushed increasingly for the development of task-based language assessments for job-specific language certification (e.g., Elder et al., 2012; Lockwood, 2015). In some cases, there is a heightened sense of urgency in doing so due to the considerable importance of language ability on the job and the potential negative impact of not ensuring that professionals can, in fact, utilize language in specific ways. As Alderson (2010) argued, for example, “The consequences of inadequate language tests being made available to license pilots, air traffic controllers and other aviation personnel are almost too frightening to contemplate” (p. 63).

Language Proficiency Assessment for Higher Education Admissions

Assessing language proficiency to determine readiness for university-level academic study presents another interesting, if contested, case for the adoption of task-based designs.

－ 9 －

Test users here include in particular university admissions officers and others tasked with the job of deciding which applicants should be admitted to programs of study, and, according to Eckes (forthcoming), “One of the key questions confronting higher education admission committees is whether or not the candidates possess the level of language skills and proficiencies required for success in the academic study they apply for” (n.d.). On the opposite side of the assessment, test takers are also important stakeholders—they choose among available tests in order to present a particular representation (generally whatever seems the most favorable) of their language ability, in order to maximize their chances of gaining admission. In addition, admissions assessments are sometimes adopted as a means of educational reform. As Ockey (2017) points out, “Two important aims of a university English entrance exam are to promote effective English education and to identify the most capable English users for university studies” (p. 4). Thus, government agencies or policy makers may choose to stipulate specific assessment types or characteristics that they believe reflect the desired outcomes of language education; given the high-stakes nature of admissions decisions, the reasoning follows that such a policy decision will wash back on the teaching and learning of the target language in order to effect changes (e.g., the decision to accept only four-skills English tests for admissions purposes in Japanese universities is intended to improve the teaching of productive skills in Japanese schooling). In the case of higher education admissions testing, then, language proficiency assessments are used to (a) identify those individuals who are deemed linguistically capable of succeeding in academic programs of study, (b) demonstrate the scope and depth of academic language ability possessed by an individual student, and (c) stipulate the types and levels of language ability valued by educational institutions and policy makers.

Decisions about which assessment designs should be used for these purposes, as well as the decisions about individual test takers made on their basis, lead to high-impact consequences. Applicants’ life choices may be affected as they are justly or inappropriately admitted or denied to university study on the basis of language test scores (e.g., Deygers, Van den Branden, & Van Gorp, 2017). Universities have a vested interest in ensuring student success, and they can ill afford admitting students who are not linguistically capable of the various demands of academic study in a given target language, only to see them fail. Language education providers, including public and private schools as well as the test preparation industry, also pay close attention to these high-stakes assessments, and the orientation of their teaching efforts is determined in some or substantial part by the ways in which the language proficiency of their students or clients will be assessed.

Given the high-stakes uses and potential positive and negative consequences of admissions assessment, test users and developers would be well-advised to proceed with deliberation and caution in determining what assessment types best meet a range of needs. In this regard, it is not coincidental that the most high-profile English proficiency assessments of this sort, IELTS and the Test of English as a Foreign LanguageTM Internet-Based Test (TOEFL iBT®), have both adopted robust, albeit distinct, task-based designs. According to its

－ 10 －

developers (IELTS, 2012), “IELTS is a task-based test covering the four skills” (p. 5). More specifically oriented towards academic settings, the original test specification documents for the redesign of the TOEFL (Jamieson, Jones, Kirsch, Mosenthal, & Taylor, 2000) emphasized that “The test will measure examinees’ English language proficiency in situations and tasks reflective of university life” (p. 11). This commitment to a task-based approach has considerable implications for test development and delivery, in order to support the claim that the assessment provides a trustworthy indication of proficiency in the English higher education domain. The TOEFL iBT provides a useful case in point of how that commitment to TBLA has been translated into practice.

As detailed in Chapelle et al. (2008), and summarized briefly here, task-based design considerations permeated the TOEFL iBT development process. As a first step, extensive domain analyses were conducted in order to identify the listening, reading, writing, and speaking tasks typical of North American universities; these analyses included observations not only of the types of tasks that characterize classroom discourse but also other language use settings (including social and navigational contexts). A practical challenge issuing from such a thorough academic domain analysis has to do with the potentially widely varying tasks and associated content that may be at play within any given discipline or course of study, as well as distinct task types expected of undergraduate versus graduate students. Administering tests of highly specific tasks—as is typical of professional certification language assessments—is both logistically untenable and presents a likely source of bias against examinees who may be more or less familiar with the content and discourse practices of any specific academic domain. In selecting test tasks for the TOEFL iBT, then, results of domain and associated corpus analyses were used to identify prototypical task types, as well as linguistic and content specifications, that would present examinees with legitimate language performance demands deemed representative of academic language use across (rather than unique to) multiple disciplines and levels of study.

These prototypical task types were then converted into a series of innovative test tasks designed to simulate performances in the university setting. The decision to deliver the assessment in a computer-based format also facilitated the provision of important dimensions of the task context not possible in a paper-based format (inclusion of audiovisual realia, graphical and aesthetic design elements, innovative response formats within the computer screen, etc.), resulting in greater fidelity to the actual demands of language use in situ. For reading and listening sections, examinees must comprehend distinct aspects of authentic input that is reflective of the university setting, including extended texts adopted from actual academic content, classroom lectures on subjects identified in the domain analysis, and conversations among students, faculty, and other participants in the environment. For writing, examinees must demonstrate the ability to respond to input effectively, by first reading a passage and listening to a lecture on the same topic, and then discussing it comprehensively in a well-structured essay. They also write an impromptu essay to express and support their opinion on a common topic. For speaking, examinees are presented with an array of task

－ 11 －

types selected to probe different types of performance ability: tasks that require an expression of opinion or argumentation in relation to common topics; tasks that require the integration of content from both academic reading and listening input, followed by a content-responsible discussion of the topic; and tasks that require a response to input from conversations or lectures. The introduction of integrated task types (combining multiple skills) was a key innovation in the TOEFL iBT intended to better reflect the ways in which language is actually used in university settings. Similarly, the decision to allow examinees to take notes and refer to them during performance is in keeping with common academic discourse practices.

Additional aspects of test design and delivery reflect the task-based orientation of the TOEFL iBT, including the ways in which productive skills tasks are scored by multiple human raters, the provision of detailed rubrics and section-score-level descriptors that make clear the nature of proficient task performances and associated language abilities, and the considerable volume of validity research that has investigated the relationship between TOEFL iBT test tasks, examinee performances, and the actual demands of language use among students in university settings (e.g., Plakans, 2010; Weigle, 2010). The complexities of developing and then delivering a robust task-based assessment of this sort—in a secure and unbiased manner to millions of test takers across the globe—are daunting, and the performance expectations on the near-four-hour assessment are not to be underestimated by test takers. Given that a variety of other language proficiency tests are also available and targeted for use as admissions assessments, why should test developers go so such lengths, and why might test takers and test score users value this task-based approach to assessing English for academic purposes?

In response to this question, it is perhaps most elucidating to contrast the TOEFL iBT approach to testing EAP with a very different assessment design adopted by another test provider. The Duolingo English Test (DET) has been much touted and heavily marketed recently by its developers as an alternative to the TOEFL iBT and other EAP admissions tests; it is specifically recommended for use as a university language admissions assessment (Brenzel & Settles, 2017). The DET is a smartphone-delivered assessment that requires approximately 20 minutes to complete. The test consists of four item types: (a) a vocabulary section where the examinee must identify whether each word is an actual word in English or not; (b) a listening section, or dictation, where examinees hear a series of sentences and try to type exactly what they heard; (c) a sentence completion section, or cloze, where examinees select the appropriate word to complete a series of sentences; and (d) a read aloud section where the examinee reads a series of written sentences aloud. In stark contrast with the task-based EAP orientation and design of the TOEFL iBT, test ‘tasks’ in the DET provide at best an extremely limited indication of an examinee’s language proficiency, doing so on the basis of inauthentic exercises that bear no resemblance to how English is used in academic settings and make no effort to reflect academic content or context, discourse practices, or associated linguistic challenges. In an objective review of the DET by two language testing experts, Wagner and Kunnan (2015) summarized this major problem as follows: “This gap

－ 12 －

between the language (and task) characteristics in the target language use domain (of academic tasks similar to tasks in university courses) and the characteristics of the language and tasks used in the test is a fundamental shortcoming of the test” (p. 326).

What problems might ensue if test users were to adopt an assessment like the DET, an assessment that does not test productive skills at all, that does not test examinees’ abilities to process or produce extended texts (beyond the word or sentence level) at all, a test that explicitly does not attempt to represent the task types found in academic discourse communities at all? First, given the dramatic under-representation of a comprehensive EAP construct, there is no test design basis whatsoever to support the interpretation of test scores as indicators of examinees’ abilities to actually perform the types of tasks encountered in academic settings—the DET clearly does not assess that at all. By contrast, it is precisely this quality of the TOEFL iBT that was highlighted as a major innovation by Alderson (2009) in his objective review of that assessment:

On the whole, TOEFL iBT has achieved this, with a clearer focus on the academic environment, based on research into the language of academic tasks, careful prototyping, trialling, revisions, more trialling, and so on. The inclusion of a compulsory speaking section, the integration of skills in numerous tasks, the use of longer written and spoken texts which are more obviously authentic and academic in nature, and have less obvious bias towards a North American setting, a radical reduction of focus on grammar: all these innovations and more are welcome. (p. 627)

It is also worth emphasizing that no degree of correlation between the DET and other EAP assessments, like the TOEFL iBT or IELTS, can resolve this problem of the DET’s construct under-representation. As all students of language testing learn in their first course, correlation does not imply causality. Moderate, or even high, correlations between two measurement scores never means that they measure the same phenomenon, nor that they provide equivalent scope or depth of information about a given test taker (for numerous examples of incorrect interpretations based on correlations, see Given, 2015). For a straightforward example of this problem, if we examine the correlation between average weight and height in the entire human population, results indicate a very strong relationship, on the order of r = .95. Height and weight are, on average, closely related. However, stating that a person “weighs six feet” or is “eighty kilograms tall” is absurd. If we want to know a given individual’s weight, we need to use a device that measures weight, not height. Similarly, if we want to know a given individual’s ability to accomplish communication tasks typical of an English academic environment, we should use an assessment that actually measures that.

Second, the potential negative consequences for test takers and university admissions decision makers are substantial if they adopt a test like the DET: without a clear, direct, and empirically substantiated relationship between the test design and the use of English in

－ 13 －

academic settings, the potential for mismatch between test takers’ actual language abilities and the demands of EAP tasks is considerable. Test takers and decision makers face the fundamental uncertainty of how a score on a test like the DET reflects any of the actual language use expectations in the academic environment, and the ensuing reality that inaccurate decisions will be made at the expense of both the test takers and the institutions. It is for this reason that Wagner and Kunnan (2015) concluded “the DET seems woefully inadequate as a measure of a test taker’s academic English proficiency or for high-stakes university admissions purposes” (p. 330).

Third, where assessments are adopted with specific kinds of washback in mind—as is increasingly presumed to be a responsibility of test providers—selecting a non-task-based test like the DET could result in negative impact on a variety of stakeholders. As noted by Wagner and Kunnan (2015), the DET design conveys an outdated representation of language ability as a set of discrete, indirect capacities, rather than the ability to marshal one’s linguistic repertoires into functional communicative competence. This overt focus—perhaps reflecting an outright disdain for contemporary notions of language ability—has the potential to mislead language learners and teachers by directing them away from pedagogies that support the development of language ability for actual use. Preparing for the test items found on the DET will also look considerably different than preparing for a legitimately task-based EAP assessment, never mind fundamentally at odds with recommended practices for developing language competence (see, e.g., Norris, Davis, & Timpe-Laughlin, 2017). By contrast, one of the express intentions of the design of TOEFL iBT was to impact language teaching in positive ways, specifically by encouraging language instruction that balanced attention to the four skills and that encouraged a focus on language use for actual communicative purposes as these are realized in academic environments. As noted by Alderson (2009), multi-year investigations of the washback effect of TOEFL iBT have indicated precisely that intended consequence in a variety of contexts, with instruction focusing more on productive skills, integrated task types, and other features of language use tasks within EAP settings (e.g., Wall & Horák, 2006, 2008). Making Good Design Decisions Through Task-Based Language Assessment

As illustrated by these two examples of high-stakes assessment uses and associated consequences, TBLA designs align in convincing ways with certain purposes for language assessment, once these are considered thoroughly and deliberately. Where assessments are required to support interpretations about L2 users’ abilities to communicate effectively in specific contexts, to accomplish specific goals, to integrate their language knowledge and skills in successful performance—these uses for assessment call for TBLA. Where assessments are intended to (or simply will, as a result of their importance) guide or otherwise influence language education, such that learners actually develop the ability to use the L2 in meaningful ways, TBLA is needed. Fundamentally, where assessments should reflect widely held values for language acquisition—as a critical type of human capital, the development of

－ 14 －

which enables access to participation in the modern society—TBLA is essential. It is no coincidence that major representations of the valued outcomes of language learning, such as the Common European Framework of Reference (Council of Europe, 2001) and the Canadian Language Benchmarks (CLB, 2012), feature tasks prominently as the unit of analysis for encapsulating language proficiency. Assessments that do not likewise feature communication tasks in their designs will inevitably fall short, when the intent is to reflect such values (and no amount of supposed ‘alignment’ to the CEFR scale can resolve that problem).

Given this link between tasks and contemporary values for language learning, it should also be apparent that TBLA designs can and should play a prominent role in assessments used for classroom and educational program purposes. While details are beyond the scope of this paper, it is worth noting the variety of needs met by TBLA designs in relation to teaching and learning, including: (a) achievement testing that emphasizes learning outcomes in the form of ability to use the target language to accomplish real-world goals (e.g., Fischer, Chouissa, Dugovičová, & Virkkunen-Fullenwider, 2011); (b) formative assessment that seeks opportunities to raise awareness and provide rich feedback (to learners, teachers) about features of learner language development en route to communicative competence (e.g., Byrnes, 2002; Weaver, 2013); and (c) alignment of assessment practices with curriculum and instruction, where all of these educational components support a common target of L2 ability for use (e.g., Adair-Hauck, Glisan, Koda, Swender, & Sandrock, 2006; Byrnes et al., 2010). Also of interest is the notion that certain aspects of language proficiency development are likely only amenable to assessment via task-based designs, for example learners’ abilities to integrate sociopragmatic and pragmalinguistic knowledge in accomplishing highly context-dependent target tasks (see Timpe-Laughlin, 2018).

Of course, designing task-based assessments that can effectively respond to such an array of intended uses and consequences is not without its challenges. A major concern for implementing TBLA within classrooms and programs has to do with the logistics of collecting extended task performances from all learners, rating or scoring those performances on the basis of meaningful criteria, and providing feedback in ways that will guide learning. Technological developments and affordances may offer some solutions to these challenges, as our capacities for creating meaningful task performance environments increase. For example, complex interactive scenarios may be designed and delivered via computer, such that learners are: (a) virtually embedded within a simulated environment that calls for language use to accomplish meaningful tasks; (b) provided with authentic input of various kinds to supply essential communicative context; (c) connected with interlocutors, to provide audiences for communication; (d) supported in their performances with scaffolds (e.g., guiding questions, replay of audio/video); and (e) recorded as they accomplish tasks (e.g., video, audio, writing, on-screen behaviors; see one example in Wolf, Lopez, Oh, & Tsutagawa, 2017). Providing such technology-mediated task-based assessments may alleviate a heavy burden from teachers, who otherwise would typically assume responsibility for all of these steps; teachers, then, can turn their attention to observing performances, grading, providing feedback, and pursuing

－ 15 －

related learning-oriented activities. It is also the case that certain aspects of task-based language performance are becoming increasingly amenable to automated scoring and feedback by the computer. Advances in natural language processing, speech recognition, and writing evaluation already enable the use of automated scoring of writing and speaking tasks for certain assessment purposes (e.g., Burstein, Tetreault, & Madnani, 2013; Chen et al., 2018), and automated formative feedback related to linguistic features of task performances is certainly feasible (e.g., Hegelheimer & Heift, 2017). Of course, there will continue to be a good argument for maintaining a human dimension to task-based assessments for certain purposes, especially in the case of scoring, where communication with other humans is the targeted construct and where humans remain best capable of discerning learners’ abilities to do so (e.g., in the case of TOEFL iBT speaking and writing tasks).

Conclusion

To be clear, TBLA is not appropriate for all intended uses of language assessment, and developing and delivering effective TBLA can be demanding. There may be many circumstances where the costs and challenges of engaging in task-based assessments outweigh apparent benefits, and other alternatives in assessment are at times better suited to the specific purposes of test users. The recommendation here is not that all language testing should be task-based, rather it is that language assessment designs should be adopted in recognition of the full scope of intended uses and consequences that define their purposes in the first place. Often, in language education settings that reflect contemporary societal values attributed to language learning, or in those contexts where decisions need to be made about what learners can do in the target language, task-based designs do seem best suited to enabling the interpretations, uses, and consequences sought. It also seems apparent that, in certain circumstances, adopting non-task-based designs may pose the possibility of real harm to test takers, to score users, and to others. Both test developers and test users are responsible for critically evaluating the alignment, or lack thereof, between the assessments they design or adopt, the ways in which those assessments are actually used, and the real consequences that ensue.

Acknowledgements

This paper is a version of a plenary address delivered at the Japan Language Testing Association annual conference in September 2017. I am indebted to JLTA for the invitation to share my ideas about task-based language assessment at their annual conference. I also appreciate the support provided by Emiko Kaneko to make my attendance at the conference possible.

References

Adair-Hauck, B., Glisan, E., Koda, K., Swender, E., & Sandrock, P. (2006). The Integrated Performance Assessment (IPA): Connecting assessment to instruction and learning.

－ 16 －

Foreign Language Annals, 39, 359–382. doi:10.1111/j.1944-9720.2006.tb02894.x Alderson, J. C. (2009). Test review: Test of English as a Foreign Language™: Internet-based

Test (TOEFL iBT®). Language Testing, 26, 621–631. doi:10.1177/0265532209346371 Alderson, J. C. (2010). A survey of aviation English tests. Language Testing, 27, 51–72. doi:

10.1177/0265532209347196 Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment

Quarterly, 2, 1–34. doi:10.1207/s15434311laq0201_1 Bennett, R. E. (2010). Cognitively Based Assessment of, for, and as Learning: A preliminary

theory of action for summative and formative assessment. Measurement: Interdisciplinary Research and Perspectives, 8, 70–91. doi:10.1080/15366367.2010.508686

Borsboom, D., Cramer, A. O. J., Kievit, R. A., Scholten, A. Z., & Franić, S. (2009). The end of construct validity. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 135–170). Charlotte, NC: IAP Information Age Publishing.

Brenzel, J., & Settles, B. (2017). The Duolingo English Test: Design, validity, and value. DET Whitepaper. Retrieved from https://englishtest.duolingo.com/resources

Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater automated essay scoring system. In M. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 55–67). New York, NY: Routledge.

Byrnes, H. (2002). The role of task and task-based assessment in a content-oriented collegiate foreign language curriculum. Language Testing, 19, 419–437. doi:10.1191/0265532202lt 238oa

Byrnes, H., Maxim, H., & Norris, J. M. (2010). Realizing advanced FL writing development in collegiate education: Curricular design, pedagogy, assessment. In The Modern Language Journal, Monograph. Cambridge, MA: Wiley-Blackwell.

Centre for Canadian Language Benchmarks. (2000). CLB 2000: Theoretical framework. Ottawa, Canada: Author.

Centre for Canadian Language Benchmarks. (2012). Canadian language benchmarks: English as a second language for adults. Ottawa, Canada: Author. Retrieved from http://www.cic.gc.ca/english/pdf/pub/language-benchmarks.pdf

Centre for Canadian Language Benchmarks. (2002). Phase I: Benchmarking the English language demands of the nursing profession across Canada. Ottawa, Canada: Author. Retrieved from http://blogs.rrc.ca/ar/the-canadian-english-language-benchmark-assessment -for-nurses-celban

Centre for Canadian Language Benchmarks. (2003). Phase II: The development of CELBAN (Canadian English Language Benchmark Assessment for Nurses): A nursing-specific language assessment tool. Ottawa, Canada: Author. Retrieved from http://blogs.rrc.ca/ar/the-canadian-english-languagebenchmark-assessment-for-nurses-celban

－ 17 －

Centre for Canadian Language Benchmarks. (2004). Phase III: Implementation of CELBAN January–June 2004. Final report. Ottawa, Canada: Author. Retrieved from http://blogs. rrc.ca/ar/the-canadian-english-language-benchmark-assessment-for-nurses-celban

Chapelle, C. A. (2012). Validity argument for language assessment: The framework is simple…. Language Testing, 29, 19–27. doi:10.1177/0265532211417211

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the Test of English as a Foreign Language. New York, NY: Routledge.

Chen, L., Zechner, K., Yoon, S. Y., Evanini, K., Wang, X., Loukina, A., ... & Mundkowsky, R. (2018). Automated scoring of nonnative speech using the SpeechRater SM v. 5.0 Engine. ETS Research Report Series, RR-18-10.

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.

Cronbach, L. J. (1980). Validity on parole: How can we go straight. In W. B. Schrader (Ed.), New directions for testing and measurement: Measuring achievement, progress over a decade: No. 5 (pp. 99–108). San Francisco, CA: Jossey-Bass.

Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621–694). Washington, DC: American Council on Education.

Deygers, B., Van den Branden, K., & Van Gorp, K. (2017). University entrance language tests: A matter of justice. Language Testing. Advance online publication. doi:10.1177/0265532 217706196

Douglas, D. (2000). Assessing languages for specific purposes. Cambridge, UK: Cambridge University Press.

Eckes, T. (in press). Language proficiency assessments in college admissions. In M. E. Oliveri & C. Wendler (Eds.), Higher education admission and placement practices: An international perspective. Cambridge, UK: University of Cambridge.

Elder, C., Pill, J., Woodward-Kron, R., McNamara, T., Manias, E., Webb, G., & McColl, G. (2012). Health professionals’ views of communication: Implications for assessing performance on a health specific English language test. TESOL Quarterly, 46, 409–419. doi:10.1002/tesq.26

Fischer, J., Chouissa, C., Dugovičová, S., Virkkunen-Fullenwider, A. (2011). Guidelines for task-based university language testing. Graz, Austria: European Center for Modern Languages.

Given, T. (2015). Spurious correlations. New York, NY: Hachette Books. Hegelheimer, V., & Heift, T. (2017). Computer-assisted corrective feedback and language

learning. In H. Nassaji & E. Kartchava (Eds.), Corrective feedback in second language teaching and learning (pp. 67–81). New York, NY: Routledge.

International English Language Testing System (IELTS). (2012). IELTS guide for teachers. Manchester, England: British Council.

Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 framework. Princeton, NJ: Educational Testing Service.

－ 18 －

Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: Greenwood Publishing.

Kane, M. (2012). Validating score interpretations and uses. Language Testing, 29, 3–17. doi:10. 1177/0265532211417210

Lewis, C., & Kingdon, B. (2016). CELBANTM: A ten-year retrospective. TESL Canada Journal, 33, 69–82. doi:10.18806/tesl.v33i2.1238

Lockwood, J. (2015). Language for specific purpose (LSP) performance assessment in Asian call centres: Strong and weak definitions. Language Testing in Asia, 5(3). doi:10.1186/s40 468-014-0009-6

Luecht, R. (2016). Professional certification and licensure examinations. In A. Rupp & J. Leighton (Eds.), The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications (pp. 446–471). Malden, MA: John Wiley & Sons.

McNamara, T. (1996). Measuring second language performance. New York, NY: Longman. Messick, S. (1989). Meaning and values in test validation: The science and ethics of

assessment. Educational researcher, 18(2), 5–11. doi:10.2307/1175249 Norris, J. M. (2000). Purposeful language assessment. English Teaching Forum, 38(1), 18–

23. Norris, J. M. (2008). Validity evaluation in language assessment. New York, NY: Peter Lang. Norris, J. M. (2009). Task-based teaching and testing. In M. Long & C. Doughty (Eds.),

Handbook of language teaching (pp. 578–594). Cambridge, NY: Blackwell. Norris, J. M. (2016). Current uses for task-based language assessment. Annual Review of

Applied Linguistics, 36, 230–244. doi:10.1017/S0267190516000027 Norris, J. M., Davis, J., & Timpe-Laughlin, V. (2017). Second language educational

experiences for adult learners. New York, NY: Routledge. Norris, J. M., & Ortega, L. (2012). Assessing learner knowledge. In S. M. Gass & A. Mackey

(Eds.), The Routledge handbook of second language acquisition (pp. 573–589). New York, NY: Routledge.

Norris, J. M., & Pfeiffer, P. (2003). Exploring the use and usefulness of ACTFL Guidelines oral proficiency ratings in college foreign language departments. Foreign Language Annals, 36, 572–581. doi:10.1111/j.1944-9720.2003.tb02147.x

Ockey, G. J. (2017). Approaches and challenges to assessing oral communication on Japanese entrance exams. JLTA Journal, 20, 3–14. doi:10.20622/jltajournal.20.0_3

Plakans, L. (2010). Independent vs. integrated writing tasks: A comparison of task representation. TESOL Quarterly, 44, 185–194. doi:10.5054/tq.2010.215251

Timpe-Laughlin, V. (2018). Pragmatics in task-based language assessment: Opportunities and challenges. In N. Taguchi & Y. Kim (Eds.), Task-based approaches to teaching and assessing pragmatics (pp. 287–304). Amsterdam, The Netherlands: John Benjamins.

Wagner, E., & Kunnan, A. J. (2015). The Duolingo English Test. Language Assessment Quarterly, 12, 320–331. doi:10.1080/15434303.2015.1061530

－ 19 －

Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study. ETS Research Report Series, MS-34.

Wall, D., & Horák, T. (2008). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 2, Coping with change. ETS Research Report Series, TOEFLiBT-05.

Weaver, C. (2013). Incorporating a formative assessment cycle into task-based language teaching. In A. Shehadeh & C. Coombe (Eds.), Researching and implementing task-based language learning and teaching in EFL contexts (pp. 287–312). Amsterdam, The Netherlands: John Benjamins.

Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27, 335–353. doi:10.1177/026553221036 4406

Wiggins, G. (1998). Educative assessment: Designing assessments to inform and improve student performance. San Francisco, CA: Jossey Bass.

Wolf, M. K., Lopez, A., Oh, S., & Tsutagawa, F. S. (2017). Comparing the performance of young English language learners and native English speakers on speaking assessment tasks. In M. Wolf & Y. Butler (Eds.), English language proficiency assessments for young learners (pp. 171–190). New York, NY: Routledge.

－ 20 －

JLTA 2017 Keynote Speech - JST

Documents

Transcript of JLTA 2017 Keynote Speech - JST