Language Assessment: Opportunities and Challenges

Language Assessment: Opportunities and Challenges

Lyle F. Bachman

UCLA

[email protected]

In the past three decades the field of language testing has matured in both the breadth of research questions it addresses and in the range of research methods at its disposal for addressing these questions and issues. We still grapple with the nature of language ability and the validity of the inferences we make on the basis of assessment results. However, the field is also addressing the difficult questions about how and why language assessments are used, the societal values that underlie such use, the consequences of assessment use, and the ethical responsibilities of test developers and users. Largely because of the increasing worldwide demand for accountability in K-12 education, where there are huge and growing numbers of students whose native language is not the language of instruction, and the growing need in the United States for individuals with high levels of foreign language proficiency, the greatest challenges that language assessment as a field now faces are in the arenas in which language tests are used to make decisions about individuals and institutions.

The immediate and long-term prospects for language assessment are filled with opportunities and challenges. There is a huge demand worldwide for greater involvement of individuals with expertise in language testing in the areas of classroom and accountability assessment. The assessment requirements of No Child Left Behind (NCLB) in the United States for example, have placed increased demands for useful assessments. Recent initiatives of the United States government to increase our nation’s capacity in foreign languages, will also require useful assessments of foreign languages, particularly the less commonly taught languages. Similar demands for the involvement of individuals with expertise in language assessment can be found in countries around the globe. Turning these challenges into accomplishments will depend upon the willingness and capability of language testers to apply the knowledge and skills acquired over the past half century to the urgent practical assessment needs of our education systems and societies.

In this paper, I will list very briefly some ways in which I believe the field of language testing has matured over the past 30 years, and what some of the issues of continuing concern are. I will then briefly describe an “assessment use argument” as a conceptual framework for problematizing many of these issues and for providing a principled basis for bringing together the rich diversity of research approaches at or disposal in order to investigate them empirically. I will then mention some of the challenges and opportunities that face language testers in the 21st century.

The Past 30 Years: Some Kudos

In an overview article 7 years ago, Bachman (2000) noted several areas of development in the field of language testing: a widening scope of inquiry, greater methodological diversity and

mailto:[email protected]

sophistication, and advances in practice. As I believe these areas are still relevant, I will simply list these areas with some of Bachman’s references, as well as some more recent ones.

Widening Scope of Inquiry

Language assessment research has widened its scope of inquiry in a number of ways. It has broadened its view of language ability, has come to recognize the variety and complexity of factors other than language ability that affect test performance, has engaged in a deepening conversation with researchers in SLA, has taken seriously the consequences of assessment use and issues of ethics and professionalism, and has become more deeply involved in issues of language assessment in schools and classrooms.

Ø Nature of language ability Move from dominant view of language ability/proficiency as a unitary or global ability (e.g., Lowe, 1985; Oller, 1979) to a view that language ability is multicomponential (e.g., Bachman & Palmer, 1996; Canale, 1983; Oller, 1983). (See also the references, under Issues of continuing concern).

Ø Factors that affect performance on language tests increased interest in and understanding of factors other than language ability (e.g., Anderson, Bachman, Cohen, & Perkins, 1991; Clapham, 1996; Cohen, 2007; Kobayashi, 2002; Lumley & O' Sullivan, 2005; Sasaki, 1996; Song & Cheng, 2006)

Ø Closer contact with SLA research issues (e.g., Bachman, 1989; Bachman & Cohen, 1998; Douglas & Selinker, 1985; Kunnan & Lakshamanan, 2006; Wigglesworth, 2001)

Ø Impact of language assessment on instructional practice (“washback”) (e.g., Alderson & Wall, 1993; e.g., Alderson & Wall, 1996; Cheng, 1997; Cheng, Watanabe, & Curtis, 2004; Green, 2007; Wall, 1993, p. 299)

Ø Issues of ethics and professionalism in language testing: Move from little consideration of ethical issues to a concern for such issues as central to the field (e.g., Boyd & Davies, 2002; Davies, 1997; Shohamy, 1997a, 1997b; Stansfield, 1993). (See also the references, under Issues of continuing concern).

Ø Increased involvement with K-12 and classroom language assessment: Move from virtually no interest in school-based or classroom assessment to a growing interest and body of research and practice in this area (e.g., Ke, 2006; Leung, 2004; Rea-Dickins, 2000, 2004)

Greater Methodological Diversity and Sophistication

Language testing researchers now routinely employ both quantitative and qualitative methodologies in both the development of practical language assessments and in their basic research. Some methodological approaches that were either nonexistent or barely used 30 years ago have become standard, mainstream tools for language assessment research and practice. In addition, language assessment researchers are increasingly finding that the use of mixed methods can greatly enhance the relevance and significance of our research.

Ø Quantitative approaches Criterion-referenced measurement (e.g., Brown, 1989; Brown & Hudson, 2002; Hudson, 1991; Lynch & Davidson, 1994); Generalizability theory (e.g., Bachman, Lynch, & Mason, 1995; Bolis, Hinofotis, & Bailey, 1982; Kunnan, 1992; Schoonen, 2005; Stansfield & Kenyon, 1992); Item-response theory (e.g. Bonk & Ockey, 2003; Choi & Bachman, 1992; Henning, 1984, 1992; McNamara, 1990; O'Loughlin, 2002; Weigle, 1994); Structural equation modeling (e.g., Bachman & Palmer, 1981; Choi, Kim, & Boo, 2003; Kunnan, 1998; Shin, 2005; Xi, 2005)

Ø Qualitative approaches Conversation/discourse analysis (e.g., Brown, 2003; Huhta, Kalaja, & Pitkanen-Huhta, 2006; Lazaraton, 1996, 2002; Swain, 2001; van Lier, 1989) Verbal protocol analysis (e.g., Buck, 1991; Cohen, 1984; Lumley, 2002; Uiterwijk & Vallen, 2005)

Ø Mixed methods (e.g., Anderson et al., 1991; Brown, 2003; Clapham, 1996; North, 2000; O'Loughlin, 2001; Sasaki, 1996; Uiterwijk & Vallen, 2005; Weigle, 1994)

Advances in Practice

The past 30 years have also seen advances in language assessment practice in several areas.

Ø Cross-cultural pragmatics (e.g., Hudson, 1993; papers in Hudson & Brown, 2001; Hudson, Detmer, & Brown, 1992; Roever, 2006; Yamashita, 1996)

Ø Languages for specific purposes (e.g., Douglas, 2000; Hamp-Lyons & Lumley, 2001; Skehan, 1984; Weir, 1983)

Ø Vocabulary (e.g., Laufer & Nation, 1999; Meara & Buxton, 1987; Read, 1993, 2000; Read & Chapelle, 2001)

Ø Computer/web-based language assessment (e.g., Alderson & Windeatt, 1991; Chalhoub-Deville, 1999; Chapelle, 1997; Chapelle & Douglas, 2006; Hicks, 1986)

Issues of Continuing Concern

Nature of Language Ability

One major area of inquiry continues to be the nature of language ability. The dominant view in the field continues to be that language ability consists of a number of interrelated areas, such as grammatical knowledge, textual knowledge, and pragmatic knowledge, and that these areas of language knowledge are managed by a set of metacognitive strategies that also determine how language ability is realized in language use or the situated negotiation of meaning (Bachman, 1990; Bachman & Palmer, 1996; Chapelle, 1998; Chapelle, 2006). Recently, however, researchers who focus more closely on the nature of the interactions in language use have argued that the view of language ability as solely a cognitive attribute of language users ignores the essentially social nature of the interactions that take place in discourse. These researchers argue that language ability resides in the contextualized interactions or discursive practices that characterize language use (e.g., Chalhoub-Deville,

1995, 2003; Chalhoub-Deville & Deville, 2005; McNamara, 1997, 2003; Young, 2000). In a critical review of this debate, Bachman (2007) identified three different approaches to defining language ability: (a) ability-focused, (b) task-focused, and (c) interaction-focused. He concluded that the theoretical issues raised by these different approaches to defining the construct—language ability—present challenging questions for both empirical research in language testing and for practical test design, development, and use. For language testing research, these questions imply the need for a much broader methodological approach, involving both so-called quantitative and qualitative perspectives. For language testing practice, they imply that focus on ability, task, or interaction, to the exclusion of the others, will lead to weaknesses in the assessment itself, or to limitations on the uses for which the assessment is appropriate.

A closely related issue is that of the extent to which language ability includes topical knowledge. The effect of test takers’ topical or content knowledge on language test performance is well documented in the language assessment literature (e.g., Alderson & Urquhart, 1985; Clapham, 1996; Douglas & Selinker, 1993; Pappajohn, 1999), and the dominant view has been that this is a source of bias in language tests.[1] That is, it is either generally assumed or specifically stated, in designing a language test and interpreting scores from such a test, that “language knowledge” or “language ability” is what we want to assess, and not test takers’ content knowledge. An alternative, or perhaps complementary, view has been articulated that in the area of languages for specific purposes (LSP) assessment. According to this view, what we want to assess is what Douglas (2000) has called “specific purpose language ability,” which is a combination of language ability and background knowledge. Davies (2001) has argued that LSP assessment has no theoretical basis, but can be justified largely on pragmatic grounds. Bachman and Palmer (1996) argued that whether one includes topical knowledge as part of the construct to be assessed in a language test is essentially a function of the specific purpose for which the test is intended and the levels of topical knowledge that the test developer can assume test takers have.

Uses of Language Assessments

Although validity and validation continue to be a major area of focus of in language assessment research (e.g., Bachman, 2005; Chapelle et al., 2004) this is no longer the sole, or even the dominant concern of the field. Language testers are investigating the difficult questions about how and why language assessments are used, the ethical responsibilities of test developers and users (e.g., Bishop, 2004; Boyd & Davies, 2002; Davies, 2004; McNamara, 1998, 2001), fairness in language assessment (e.g., Elder, 1997; Kunnan, 2000, 2004), the impact and consequences of assessment use (e.g., Hawkey, 2006; Shohamy, 2001), particularly on instructional practice (e.g., Alderson & Wall, 1993; Bailey, 1996; Cheng, 1997; Cheng, Watanabe, & Curtis, 2004; Qi, 2005; Wall, 1996; Wall, 2005), the societal values that underlie such use (McNamara & Roever, 2006) and larger sociocultural contexts in which language tests are used (e.g., McNamara & Roever, 2006). What I find extremely encouraging is that these two strains of research and concern are coming together in a growing body of research that investigates both the validity of score interpretations and the consequences of assessment use (e.g., Bachman, 2005, 2006; papers in Kunnan, 2000; Reath, 2004).

Differing Epistemologies

McNamara (2006) argued that two distinct epistemologies, “quantitative” and “qualitative,” have evolved in the field, and that the vigorous debate these have spurred is healthy for the field. This debate reflects the larger historical debate that has engaged researchers in applied linguistics, education, and the social sciences for decades. Bachman (2006) pointed out that many characterizations of these differences are overly simplistic and described them not as holistic methodologies, but in terms of several different dimensions.

An ongoing critical examination of the epistemological foundations of our research approaches is, as McNamara and Bachman have argued, essential to the vitality of our field. (See, for example, the papers in Chalhoub-Deville, Chapelle, & Duff, 2006). To facilitate such a critical discourse, I believe that we need an epistemology that provides a principled approach to addressing our concerns with both validity and consequences, using whatever research approaches and tools are appropriate and at our disposal. As noted previously, the arsenal of methodological approaches to language assessment research is considerable. What, until recently, has been lacking is a principled basis for linking our concerns with validity and consequences in a way that provides a rationale for combining qualitative and quantitative approaches to research. In my view, an “assessment use argument,” (AUA) as described by Bachman (2005) and Bachman and Palmer (forthcoming) provides such a basis.

Assessment Use Argument

Drawing on argument-based approaches to validity in educational measurement (e.g., Kane, 2001; Kane, Crooks, & Cohen, 1999; Mislevy, Steinberg, & Almond, 2002), Bachman (2005, 2006) has described what he calls an “assessment use argument” as a conceptual framework for linking inferences from assessment performance to interpretation and use. Bachman and Palmer (in press) elaborate on this, describing an AUA as a series of data-claim links, based on Toulmin’s (2003) structure of practical reasoning. An AUA explicitly states the interpretations and decisions that are to be based on assessment performance as well as the consequences of using an assessment and of the decisions that are made. Bachman and Palmer argue that an AUA provides an overarching inferential framework to guide the design and development of language assessments and the interpretation and use of language assessment results.

An AUA consists of a series of claims which can be illustrated as in Figure 1:

Figure 1.

The arrows between the rectangles go both ways to illustrate that the claims, which may also be stated as questions, serve as a guide for both test development and for the interpretation and use of assessment results. In using an AUA for designing and developing an assessment, the developer would first ask what the consequences of using the assessment might be, and the extent to which these will be beneficial to stakeholders. Then he or she would consider the decisions to be made and whether these are sensitive to existing community values[2] and are equitable, with respect to different groups of stakeholders. Then the developer would consider the interpretations that are needed to make the intended decisions, and the extent to which these will be:

meaningful with respect to a general theory of language ability or a particular learning syllabus,

impartial to all groups of test takers, generalizable to the intended target language use domain, relevant to the decision to be made, and

sufficient for the decision to be made. (Bachman & Palmer, in press)

Finally, the test developer would consider how the assessment results (i.e., scores or descriptions) and how to assure that these are consistent across different aspects of the measurement procedure (e.g., items, tasks, raters, forms).

In interpreting the performance on an assessment, the assessment user would consider the inferences that are based on test takers’ performance on the assessment. He or she would consider the consistency of the assessment report, the meaningfulness, impartiality, generalizability, relevance, and sufficiency of the interpretation, and so on.

Whereas the claims of an AUA constitute the conceptualization that is needed to either design an assessment or to interpret and use the results of an assessment, these claims need to be supported in order to justify using the assessment for a particular purpose. This support is provided in the form of warrants, which are propositions that we use to justify the inference one claim to the next (Bachman, 2005). A warrant to support an inference from a score to an interpretation, for example, might be that the ratings derived from observing test takers’ performance are consistent both across different raters and across multiple ratings by the same rater. Warrants supporting an inference from an interpretation to a decision might consist of the following, for example:

· Relevant legal requirements and existing community values are carefully considered in the decisions that are made (Values warrant).

· Stakeholders who are at equivalent levels on the construct to be assessed, as indicated by the interpretations of their assessment reports, have equivalent chances of being classified in the same group (Equitability warrant). (Bachman & Palmer, in press)

Warrants, in turn, must be supported by backing, which may consist of evidence from empirical research, documentation, regulations, laws, and community or societal values. The backing for the consistency of ratings, for example, might include classical inter- and intrarater reliability estimates or variance components and dependability estimates from a generalizability study. Backing for the warrants of values and equitability, for example, might consist of:

· Laws, regulations, policy, surveys of and focus group meetings with stakeholders.

· Decision rules described in the assessment specifications; standard setting procedures for setting cut scores; studies of the relationship between assessment performance and classification decisions (Bachman & Palmer, in press).

Justifying Assessment Use

Bachman and Palmer (in press) describe justification as the process of providing a rationale and evidence to justify the use of a particular assessment. Since it is the use of a specific assessment that needs to be justified, justification is inherently local. In other words, the AUA for a particular assessment provides a “local theory” that makes explicit claims about the roles of consequences, decisions, interpretations, and assessment reports in the assessment, and identifies the evidence that needs to be collected to support these claims. The

purpose of an AUA is thus not to falsify some general theory of language ability or a particular approach to designing language tests. Rather, its purpose is to provide and empirically support a coherent argument that is convincing to stakeholders that using the assessment will help promote the intended beneficial consequences. The AUA also identifies appropriate methodologies for collecting evidence and thus embraces a multiplicity of methodological approaches, both quantitative and qualitative.[3]

The Future: Challenges and Opportunities

I believe that the greatest challenges language assessment as a field faces are not in the cerebral spheres of validity theory, postmodern critical social theory, and moral philosophy. Nor are they to be found in sophisticated statistical and measurement models or in ever-refined approaches to naturalistic observation. Rather, the challenges we, as language testers, face are in the arenas where language tests are being used to make decisions about individuals and institutions. There is a huge demand worldwide for greater involvement of individuals with expertise in language testing in the areas of classroom and accountability assessment. Although classroom language assessment is one of the most exciting areas in our field (e.g., Broadfoot, 2005; Rea-Dickins, 2000, 2004), this is still not considered “mainstream language testing” by many. In the past quarter century, language testers have been only marginally involved in issues of accountability assessment for K-12 and adult education. This has been and continues to be the case worldwide, where the “action” in large-scale accountability assessment has been the domain of psychometricians and educators, with language testers providing occasional advice from the periphery. Finally, while the vast majority of published research in language testing over the past half-century has focused on learners/users of English as a second or foreign language, there is a growing body of research and experience, again worldwide, in the assessment of languages other than English, as these are being learned both as second and foreign languages.

The assessment demands of NCLB (United States Congress, 2001, 2002) in the United States have greatly increased the pressure on states to develop more useful assessments for both accountability and in the service of classroom language learning. In neither area, in my view, have language testers been adequately involved. Of particular concern to language testers and other applied linguists, should be issues of assessing the English language development and academic achievement of English Language Learners (ELLs) (Hakuta & Beatty, 2000; Koenig, 2002; Solano-Flores & Trumbull, 2003). Similar concerns and issues surround the assessment of ELLs in adult schools (e.g., Mislevy & Knowles, 2002; U.S. Department of Education, 2001)

Recent initiatives on the part of the United States government to increase the nation’s capacity in foreign languages are also placing increased demands for useful assessments of foreign languages, particularly the less commonly taught languages (U.S. Department of Defense, 2005; U.S. Department of Education, U.S. Department of State, U.S. Department of Defense, & Office of the Director of National Intelligence, 2006). As increasingly larger amounts of government resources, at all levels—federal state, and local—are likely to be going into foreign language instruction in the coming years, there will most likely be a concomitant need for greater accountability (O' Connell & Norwood, 2007). In K-12 education, there is already an accountability mechanism in place, through NCLB, and it can be expected, for better or worse, that as the federal government invests more heavily in language instruction at this level, an accountability mechanism will be required, and this will necessitate the development

of assessments of foreign language proficiency that meet accepted professional standards for validity and impact.

Conclusion

The immediate and long-term prospects for language testing as a field are filled with opportunities and challenges. Turning these opportunities and challenges into accomplishments will depend upon the willingness and capability of language testers to apply the knowledge and skills acquired over the past half century to the urgent practical assessment needs of our education system, from kindergarten to adult school, and of our society. It will also depend upon our willingness to leave the comfortable confines of the academy and join our colleagues in education and measurement to toil in the fields of practice. I believe that language testers have a unique combination of knowledge and skills, as well as a growing understanding of the issues of involved in addressing the validity of interpretations and the consequences of test use. If we can but apply this expertise to the practical problems of assessment in our education systems and society, we are in the position to provide leadership and contribute greatly to making our meritocracy fair and equitable.

References

Alderson, J. C., & Urquhart, A. H. (1985). The effect of students' academic discipline on their performance on ESP reading tests. Language Testing, 2, 192-204.

Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.

Alderson, J. C., & Wall, D. (Eds.). (1996). Washback [Special issue] Language Testing, 13.

Alderson, J. C., & Windeatt, S. (1991). Computers and innovation in language testing. In J. C. Alderson & B. North (Eds.), Language testing in the 1990s: The communicative legacy (pp. 226-236). London: Macmillan.

Anderson, N., Bachman, L. F., Cohen, A. D., & Perkins, K. (1991). An exploratory study into the construct validity of a reading comprehension test: Triangulation of data sources. Language Testing, 8, 41-66.

Bachman, L. F. (1989). Language testing-SLA interfaces. Annual Review of Applied Linguistics, 9, 193-209.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17, 1-42.

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2, 1-34.

Bachman, L. F. (2006a). Generalizability: A journey into the nature of empirical research in applied linguistics. In M. Chalhoub-Deville, C. A. Chapelle, & P. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives (pp. 165-207). Amsterdam: Benjamins.

Bachman, L. F. (2006b, April). Linking interpretation and use in educational assessments. Paper presented at the National Council for Measurement in Education, San Francisco.

Bachman, L. F. (2006c). A research use argument: An alternative paradigm for empirical research in applied linguistics? Paper presented at the Annual Meeting of the American Association for Applied Linguistics, Ottawa, CA.

Bachman, L. F. (2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In J. Fox, M. Wesche, & D. Bayliss (Eds.), What are we measuring? Language testing reconsidered. Ottawa: University of Ottawa Press.

Bachman, L. F., & Cohen, A. D. (1998). Language testing-SLA interfaces: An update. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. New York, UK: Cambridge University Press.

Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12, 238-257.

Bachman, L. F., & Palmer, A. S. (1981). The construct validation of the FSI oral interview. Language Learning, 31, 67-86.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Bachman, L. F., & Palmer, A. S. (in press). Language assessment in practice (2nd ed.). Oxford: Oxford University Press.

Bailey, K. M. (1996). Working for washback: A review of the washback concept in language testing. Language Testing, 13, 257-279.

Bishop, S. (2004). Thinking about a professional ethics. Language Assessment Quarterly, 1, 109-122.

Bolis, R. E., Hinofotis, F. B., & Bailey, K. M. (1982). An introduction to generalizability theory in second language research. Language Learning, 32, 245-258.

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the L2 group oral discussion task. Language Testing, 20, 89-110.

Boyd, K., & Davies, A. (2002). Doctors' orders for language testers: The origin and purpose of ethical codes. Language Testing, 19, 296-322.

Broadfoot, P. M. (2005). Dark alleys and blind bends: Testing the language of learning. Language Testing, 22, 123-141.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1-25.

Brown, J. D. (1989). Improving ESL placement tests using two perspectives. TESOL Quarterly, 23, 65-83.

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. New York: Cambridge University Press.

Buck, G. (1991). The testing of listening comprehension: An introspective study. Language Testing, 8, 67-91.

Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller (Ed.), Issues in language testing research. Rowley, MA: Newbury House.

Chalhoub-Deville, M. (1995). A contextualized approach to describing oral language proficiency. Language Learning, 45, 251-281.

Chalhoub-Deville, M. (1999). Issues in computer-adaptive testing of reading proficiency. New York: University of Cambridge Local Examinations Syndicate and Cambridge University Press.

Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20, 369-383.

Chalhoub-Deville, M., Chapelle, C. A., & Duff, P. (Eds.). (2006). Inference and generalizability in applied linguistics: Multiple perspectives. Amsterdam: Benjamins.

Chalhoub-Deville, M., & Deville, C. (2005). A look back at and forward to what language testers measure. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 815-831). Mahwah, NJ: Erlbaum.

Chapelle, C. A. (1997). Conceptual foundations for the design of computer-assisted language tests. In A. Huhta, V. Kohonen, L. Kurki-Suonio, & S. Luoma (Eds.), Current developments and alternatives in language assessment: Proceedings of LTRC 96 (pp. 520-525). Jyvaskyla: University of Jyvaskyla.

Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 32-70). New York: Cambridge University Press.

Chapelle, C. A., & Douglas, D. (2006). Assessing language ability by computer. New York: Cambridge University Press.

Chapelle, C. A. (2006). L2 vocabulary acquisition theory: The role of inference, dependability and generalizability in assessment. In M. Chalhoub-Deville, C. A. Chapelle, & P. A. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives (pp. 47-64). Amsterdam: Benjamins.

Cheng, L. (1997a). How does washback influence teaching? Implications for Hong Kong. Language and Education, 11, 38-54.

Cheng, L. (1997b). The washback effect of public examination change on classroom teaching: An impact study of the 1996 Hong Kong certificate of education in English on classroom teaching of English in Hong Kong secondary schools. Doctoral Dissertation, University of Hong Kong, Hong Kong.

Cheng, L., Watanabe, Y., & Curtis, A. (Eds.). (2004). Washback in language testing: Research contexts and methods. Mahwah, NJ: Lawrence Erlbaum.

Choi, I.-C., & Bachman, L. F. (1992). An investigation into the adequacy of three IRT models for data from two EFL reading tests. Language Testing, 9, 51-78.

Choi, I.-C., Kim, S. K., & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing, 20, 295-320.

Clapham, C. (1996). The development of IELTS: A study of the effect of background knowledge on reading comprehension (Vol. 4). New York: University of Cambridge Local Examinations Syndicate/Cambridge University Press.

Cohen, A. D. (1984). On taking tests: What the students report. Language Testing, 1, 70-81.

Cohen, A. D. (2007). The coming of age of research on test-taking strategies. Language Assessment Quarterly, 3, 307-331.

Davies, A. (Ed.). (1997). Ethics in language testing. [Special issue] Language Testing, 14.

Davies, A. (2001). The logic of testing languages for specific purposes. Language Testing, 18, 133-148.

Davies, A. (2004). (Ed.). The ethics of language assessment. [Special issue], Language Assessment Quarterly, 1.

Douglas, D. (2000). Assessing language for specific purposes: Theory and practice. New York: Cambridge University Press.

Douglas, D., & Selinker, L. (1985). Principles for language tests within the 'discourse domains' theory of interlanguage: Research, test construction and interpretation. Language Testing, 2, 205-226.

Douglas, D., & Selinker, L. (1993). Performance on a general versus a field-specific test of speaking proficiency by international teaching assistants. In D. Douglas & C. A. Chapelle (Eds.), A new decade of language testing research (pp. 235-256). Arlington, VA: TESOL.

Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14, 261-277.

Green, A. (2007). Watching for washback: Observing the influence of the international English language testing system academic writing test in the classroom. Language Assessment Quarterly, 3, 333-368.

Hakuta, K., & Beatty, A. (Eds.). (2000). Testing English-language learners in U. S. Schools. Washington, DC: National Academy Press.

Hamp-Lyons, L., & Lumley, T. (2001). (Eds.). Assessing language for specific purposes. [Special issue], Language Testing, 18.

Hawkey, R. (2006). Impact theory and practice. New York: Cambridge University Press.

Henning, G. (1984). The advantages of latent trait measurement in language testing. Language Testing, 1, 123-134.

Henning, G. (1992). Dimensionality and the construct validity of language tests. Language Testing, 9, 1-11.

Hicks, M. M. (1986). Computerized multilevel ESL testing, a rapid screening methodology. In C. W. Stansfield (Ed.), Technology and language testing (pp. 79-90). Arlington, VA: TESOL.

Hudson, T. (1991). Relationships among IRT item discrimination and item fit indices in criterion-referenced language testing. Language Testing, 8, 160-181.

Hudson, T. (1993). Testing the specificity of ESP reading skills. In D. Douglas & C. A. Chapelle (Eds.), A new decade of language testing research (pp. 58-82). Arlington, VA: Teachers of English to Speakers of Other Languages.

Hudson, T., & Brown, J. D. (Eds.). (2001). A focus on language test development. Honolulu, HI: Second Language Teaching & Curriculum Center, University of Hawaii at Manoa.

Hudson, T., Detmer, E., & Brown, J. D. (1992). A framework for testing cross-cultural pragmatics. Honolulu, HI: Second Language Teaching & Curriculum Center, University of Hawaii at Manoa.

Huhta, A., Kalaja, P., & Pitkanen-Huhta, A. (2006). Discursive construction of a high-stakes test: The many faces of a test-taker. Language Testing, 23, 326-350.

Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319-342.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational measurement: Issues and practice, 18, 5-17.

Ke, C. (2006). A model of formative task-based language assessment for Chinese as a foreign language. Language Assessment Quarterly, 3, 207-227.

Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text organization and response format. Language Testing, 19, 193-220.

Koenig, J. A. (Ed.). (2002). Reporting test results for students with disabilities and English-language learners. Washington, DC: National Academy Press.

Kunnan, A. J. (1992). An investigation of a criterion-referenced test using g-theory, and factor and cluster analysis. Language Testing, 9, 30-49.

Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 1-14). New York: Cambridge University Press.

Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context (pp. 27-48). New York: Cambridge University Press.

Kunnan, A. J. (Ed.). (2000). Fairness and validation in language assessment: Selected papers from the 19th language testing research colloquium, Orlando. New York: Cambridge University Press.

Kunnan, A. J. (Ed.). (1998). Structural equation modeling. [Special issue] Language Testing, 15.

Kunnan, A. J., & Lakshamanan, U. (2006). Language assessment and language acquisition: A cross-linguistics perspective. [Special issue] Language Assessment Quarterly, 3.

Laufer, B., & Nation, P. (1999). A vocabulary-size test of controlled productive ability. Language Testing, 16, 33-51.

Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: The case of case. Language Testing, 13, 151-172.

Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. New York: Cambridge University Press.

Leung, C. (2004). Developing formative teacher assessment: Knowledge, practice and change. Language Assessment Quarterly, 1, 5-18.

Lowe, P., Jr. (1985). The ILR proficiency scale as a synthesizing research principle: The view from the mountain. In C. J. James (Ed.), Foreign language proficiency in the classroom and beyond. Lincolnwood, IL: National Textbook.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean the raters? Language Testing, 19, 246-276.

Lumley, T., & O'Sullivan, B. (2005). The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking. Language Testing, 22, 415-437.

Lynch, B. K., & Davidson, F. (1994). Criterion-referenced language test development: Linking curricula, teachers and tests. TESOL Quarterly, 28, 727-743.

McNamara, T. (1990). Item response theory and the validation of an ESP test for health professionals. Language Testing, 7, 52-76.

McNamara, T. F. (1997). 'Interaction' in second language performance assessment: Whose performance? Applied Linguistics, 18, 46-46.

McNamara, T. (1998). Policy and social considerations in language assessment. Annual Review of Applied Linguistics, 18, 304-319.

McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18, 333-349.

McNamara, T. (2003). Looking back, looking forward: Rethinking Bachman. Language Testing, 20, 466-473.

McNamara, T. (2006). Validity and values: Inferences and generalizability in language testing. In M. Chalhoub-Deville, C. A. Chapelle & P. A. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple research perspectives.Amsterdam: John Benjamins.

McNamara, T., & Roever, K. (2006). Language testing: The social dimension. Oxford: Blackwell.

Meara, P., & Buxton, B. (1987). An alternative to multiple-choice vocabulary tests. Language Testing, 4, 142-151.

Mislevy, R. J., & Knowles, K. (Eds.). (2002). Performance assessments for adult education: Exploring measurement issues. Washington, DC: National Academy Press.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based language assessment. Language Testing, 19, 477-496.

North, B. (2000). The development of a common framework scale of language proficiency: Vol. 8. Bern: Peter Lang.

O' Connell, M. E., & Norwood, J. L. (Eds.). (2007). International education and foreign languages: Keys to securing America’s future. Washington, DC: National Academies Press.

O' Loughlin, K. (2001). The equivalence of direct and semi-direct speaking tests. New York: Cambridge University Press.

O' Loughlin, K. (2002). The impact of gender on oral proficiency testing. Language Testing, 19, 169-192.

Oller, J. W., Jr. (1979). Language tests at school. London: Longman.

Oller, J. W., Jr. (1983). A consensus for the eighties? In J. W. Oller (Ed.), Issues in language testing research (pp. 351-356.). Rowley, MA: Newbury House.

Pappajohn, D. (1999). The effect of topic variation in performance testing: The case of the chemistry teach test for international teaching assistants. Language Testing, 16, 52-81.

Qi, L. (2005). Stakeholders' conflicting aims undermine the washback function of a high-stakes test. Language Testing, 22, 142-173.

Rea-Dickins, P. (Ed.). (2000). Assessing young learners. [Special issue] Language Testing, 1.

Rea-Dickins, P. (Ed.). (2004). Exploring diversity in teacher assessment. [Special issue] Language Testing, 21.

Read, J. (1993). The development of a new measure of L2 vocabulary knowledge. Language Testing, 10, 355-371.

Read, J. (2000). Assessing vocabulary. New York: Cambridge University Press.

Read, J., & Chapelle, C. A. (2001). A framework for second language vocabulary assessment. Language Testing, 18, 1-32.

Reath, A. (2004). Language analysis in the context of the asylum process: Procedures, validity and consequences. Language Assessment Quarterly, 1, 209-233.

Roever, K. (2006). Validation of a web-based test of ESL pragmalinguistics. Language Testing, 23, 229-256.

Sasaki, M. (1996). Second language proficiency, foreign language aptitude, and intelligence: Quantitative and qualitative analyses. Bern: Peter Lang.

Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1-30.

Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the structure of language tests. Language Testing, 22, 31-57.

Shohamy, E. (1997). Critical language testing and beyond. Paper presented at the AAAL Annual Conference, Orlando, FL.

Shohamy, E. (1997). Testing methods, testing consequences: Are they ethical? Language Testing, 14, 340-349.

Shohamy, E. (2001). The power of tests: A critical perspective on the uses of language tests. London: Pearson.

Skehan, P. (1984). Issues in the testing of English for specific purposes. Language Testing, 1, 202-220.

Solano-Flores, G., & Trumbull, E. (2003). Examining language in context: The need for new research and practice paradigms in the testing of English-language learners. Educational Researcher, 32, 3-13.

Song, X., & Cheng, L. (2006). Language learner strategy use and test performance of Chinese learners of English. Language Assessment Quarterly, 3, 243-266.

Stansfield, C. W. (1993). Ethics, standards and professionalism in language testing. Issues in Applied Linguistics, 4, 15-30.

Stansfield, C. W., & Kenyon, D. M. (1992). Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview. System, 20, 347-364.

Swain, M. (2001). Examining dialogue: Another approach to content specification and validating inferences drawn from test scores. Language Testing, 18, 275-302.

Toulmin, S. E. (2003). The uses of argument (Updated ed.). New York: Cambridge University Press.

U.S. Department of Defense. (2005). Defense language transformation roadmap. Arlington, VA.

U.S. Department of Education. (2001). Measures and methods for the national reporting system for adult education: Implementation guidelines. Washington, D.C.: Author.

U.S. Department of Education, U.S. Department of State, U.S. Department of Defense, & Office of the Director of National Intelligence. (2006). National security language initiative. http://www.ed.gov/about/inits/ed/competitiveness/nsli/nsli.pdf. 2006

Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second generation immigrants in Dutch tests. Language Testing, 22, 211-234.

United States Congress. (2001). H.R. 1, No Child Left Behind Act of 2001.

United States Congress. (2002). Public law 107-110, No Child Left Behind Act of 2001.

van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency interviews as conversation. TESOL Quarterly, 23, 489-508.

Wall, D. (1996). Introducing new tests into traditional systems: Insights from general education and from innovation theory. Language Testing, 13, 334-354.

Wall, D. (2005). The impact of high-stakes examinations on classroom teaching: A case study using insights from testing and innovation theory. New York: Cambridge University Press.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197-223.

Weir, C. J. (1983). The associated examining board's test in English for academic purposes: An exercise in content validation. In A. Hughes & D. Porter (Eds.), Current developments in language testing (pp. 147-153). London: Academic Press.

Wigglesworth, G. (2001). Influences on performance in task-based oral assessments. In M. Bygate, P. Skehan & M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching and testing (pp. 186-209). Harlow, Essex: Pearson Education Ltd.

http://www.ed.gov/about/inits/ed/competitiveness/nsli/nsli.pdf

Xi, X. (2005). Do visual chunks and planning impact performance on the graph description task in the speak exam? Language Testing, 22, 463-508.

Yamashita, S. O. (1996). Six measures of JSL pragmatics. Honolulu, HI: Second Language Teaching & Curriculum Center, University of Hawaii at Manoa.

Young, R. F. (2000). Interactional competence: Challenges for validity. A joint symposium, interdisciplinary interfaces with language testing, of the Language Testing Research Colloquium and the American Association for Applied Linguistics. Vancouver, B.C.

1. Interestingly, in the research on the assessment of educational achievement, it is the reverse: Content knowledge related to academic courses is the construct of interest, whereas language ability is typically treated as a source of bias or measurement error.

2. I recognize that one of the thrusts of critical applied linguistics, as well as so-called critical language testing is that existing community values may themselves be inequitable and hence need to be constantly scrutinized, particularly by those who will be affected by the decisions that are made.

[3] Elsewhere (Bachman, 2006a, 2006b, 2006c) I have argued that a similar conceptual framework, which I call a “research use argument” can inform empirical research in applied linguistics.

Language Assessment: Opportunities and Challenges

Documents

Transcript of Language Assessment: Opportunities and Challenges