Chapter Two Language Testing: An Historical...
-
Upload
nguyendung -
Category
Documents
-
view
221 -
download
3
Transcript of Chapter Two Language Testing: An Historical...
1
Chapter Two
Language Testing: An Historical Overview
2.1. Introduction
The present chapter intends to focus on an historical overview of language
testing with a review of the available literature. In addition, this chapter
will provide a brief survey of the types, approaches, and characteristics of
the available English language tests.
2.2. Historical Overview of Language Testing
Researches in the area of language testing over the last few decades have
contributed a lot in making a test more and more practical, reliable and valid.
The present state of language testing is the result of a consistent theoretical
advancement - though slow and even lacking in operation and application
especially in ESL / EFL context - over the last one century in the shape of
experimentations and innovations in language testing. The historical
overview of the developments in this area is best viewed in the nine
historical phases categorized by Spolsky (2000:536-552) in his review
article, entitled “Language Testing in The Modern Language Journal”,
where he reviews the articles published in the Modern Language Journal
2
over eighty years, reflecting upon the nature, views, principles and practices
of testing in these phases.
2.2.1. The Early Years:
This phase is most prominently known for the ‘aural and oral test’ that was
designed and implemented by the committee of the Association of Modern
Language Teachers for university admission tests in French, German, and
Spanish. This test was based on data collected from 1000 schools, where the
“majority of respondents agreed that a test was the only fair way to
recognize and encourage the widespread direct method of teaching, which
emphasized speaking the language” (Spolsky, 2000:537). Carroll (1954)
considers this test as one of the earliest calls for objective psychological
testing. This article, according to Spolsky (2000:537), was significant in
three ways – first, because this aural-oral test concentrated on the spoken
language rather than written language and ‘recognized clearly the curricular
impact (washback) of a test’. Secondly, in order to meet the feasibility
criteria of a test, we find ‘ease triumphing over principle’. Thirdly, objective
tests became widely acceptable for uniformity.
The Committee on Resolutions and Investigations of the Association of
Modern Language Teachers of the Middle States and Maryland, which
became a basis for the innovations in language testing 30 years later by the
3
Armed Services Training Program (ASTP) and the Audio-Lingual Method
after that, observed in 1917 that ‘no actual oral test is included in this
(entrance examinations for French, Spanish and German) examination’
(Committee on Resolutions and Investigations, 1917:252, in Fulcher,
2003:02). However, it further claims that ‘no candidate could pass it, who
had not attained abundant oral, as well as aural training’ (Committee on
Resolutions and Investigations, 1917:252, in Fulcher, 2003:02).
The story in the United Kingdom is found to be different. ‘While testing
speaking in the United States was frequently seen as ‘desirable but not
feasible’, the University of Cambridge Local Examinations Syndicate
(UCLES) was not hampered by concerns over reliability or measurement
theory. A sub-test of spoken English was included in the Certificate of
Proficiency in English at its introduction in 1913, (Roach, 1945, in Fulcher,
2003:05).
2.2.2. The Inter-war Years:
These years have been significant in so many ways in the area of
experimentations, innovations and theoretical advancements of language
testing. Tests of oral and aural skills started increasing in number. Spolsky
(2000:538) lists such references as Thorndike (1921), Decker (1925),
4
Henmon et al. (1929), Lundeberg (1929), Tharp (1930), Clarke (1931), and
Rogers & Clarke (1933).
Wood (1927:96), in his pioneering work on language tests for New York
Schools, acknowledged the importance of using ‘conversational materials’
for assessing students’ speaking abilities, but this test remained ‘paper and
pencil test’ because speaking tests would be ‘subjective’ and logistically
difficult to put into operation.
Seeking objectivity became the key intention of the test developers in this
phase. Hence they concentrated on ‘new-type’ multiple choice tests as
reliable, objective measures of language ability (Fulcher, 2003:02). The first
true speaking test used in North America was The College Board’s English
Competence Examination, introduced in 1930 for overseas students applying
to study at US colleges and universities (College Entrance Examination
Board, 1939, in Fulcher 2003:02). The College Entrance Examination Board
was given the task of providing this test by the American Association of
Collegiate Registrars. It was used effectively only until 1935 due to financial
reasons.
Some other works in this direction are Broom & Kaulfers (1927), Cheydleur
(1928) and Brinsmade (1928). Seibert & Goddard (1935) developed scoring
key to attain objectivity in language testing. Smith & Cambell (1942) too
5
developed tests and a scoring key to relieve fears about multiple-choice
tests.
Different types of tests were developed by Cheydleur (1931, 1932b),
Kaulfers (1933), Haden & Stalnaker (1934), Kurath & Stalnaker (1936),
Stalnaker & Kurath (1935), Feder & Cochran (1936), Fife (1937), and
Kaulfers (1937).
Prognosis or Aptitude testing, initiated by Kaulfers (1930) became a
significant concern in this phase. Spolsky (2002:539-541) list various studies
in this regard. They are Cheydleur (1932a), Richardson (1933), Symonds
(1930a, 1930b), Henmon (1934), Wagner & Strabel (1935), Michel (1934,
1936), Matheus (1937), Tallent (1938), Kaulfers (1939), Spoerl (1939),
Maronpot (1939) and Gabbert (1941).
The goal of instruction became important for test developments. In this
regard Spolsky (2000:541) mentions mainly two studies by Conant (1934)
and Frantz (1939).
This phase has also been significant in witnessing the first detailed
proficiency scale for language tests in Sammartino (1938).
6
“Unlike in the United States, the primary purpose of an examination in the
United Kingdom was to support the syllabus and encourage good teaching
and learning” (Brereton 1944, in Fulcher, 2003:06).
2.2.3. The World War II Experience:
Due to the scientific advancements like planes and radios, the focus in
language teaching shifted from reading to aural-oral abilities of language
learners. Similar to Sammartinos (1938), Kauflers (1944) developed a
performance scale for measuring aural comprehension. Spolsky (2000:543)
identifies Roach (1945) as working on the same lines in England. Since the
Second World War, testing second language speaking attained priority over
other skills.
“When it was realized that the American service personnel were ill equipped
to communicate, that their inability to speak in a second language might be
‘a serious handicap to safety and comfort’, the push for tests of second
language speaking began in earnest. From the needs of the military, the
practice soon spread to colleges and universities” (Fulcher, 2003: xi).
Fulcher (2003:06) further notes:
“the Second World War was a watershed in the history of testing speaking. … The Army Specialized Training Programe (ASTP) was created in 1942 in order to address the communication problems of American service personnel through the delivery of language
7
programmes that focused on speaking in the fields of engineering, medical, and area studies / language”.
Even Angiolillo (1947:32) identifies ASTP as the first language instruction
programme with the specific aim ‘to impart to the trainee a command of the
colloquial spoken form of a language and to give the trainee a sound
knowledge of the area in which the language is used.’
In the UK too the testing of speaking had a military focus, though slow in
comparison to the US. Fulcher (2003:08) informs:
“much of the language teaching that went on in the United Kingdom, and those part of the world over which the Allies had control, was done by the Army Education Corps. Much of this was teaching English to Allied Forces in Europe.”
2.2.4. The Early Post War Period:
This period contributed couple of more tests in this phase. Bottke & Milligan
(1945) and Bovee & Froelich (1946), for instance, suggested some new
aptitude tests; Clapp (1947) meditated on placement tests; Coutant (1948)
and Kaulfers (1948) made special observations about language teaching in
California; and Miller (1948) designed a test of Spanish-American culture.
In this period, there had been very reluctant language testing research as
such. As language teaching was not a distinct discipline, language testing
followed the general principles of testing available in humanities or social
sciences. Classroom tests were conducted by the teachers themselves which
8
were the results of Grammar Translation or Reading-oriented methods that
were in use at that time.
2.2.5. The 1950s:
This decade witnessed an addition of some more tests, mainly concerning
aspects of oral communication. They are Raymond (1951), Furness (1955),
Eley (1956), Borg & Goodman (1956), Buechel (1957), Harding (1958),
Mueller (1959), Valley (1959), Ayer (1960), and Sadnavitch & Popham
(1961).
This decade was dominated by the Foreign Service Institute (FSI) testing
unit in the Army services, though political and bureaucratic problems held
up the implementation work from 1952 to 1956. The FSI was again given
the responsibility in 1956 to provide evidence of foreign language
proficiency for all Foreign Service Personnel. In 1958 the FSI testing unit
further developed the 1952 scales by adding a checklist of five factors for
raters, each to be measured on the six-point scale. These five factors were
accent, comprehension, fluency, grammar and vocabulary. This can be
called to be the first step towards developing multiple trait rating.
9
2.2.6. The 1960s:
Spolsky (2000:544) reports that ‘interest in aptitude reemerged in the
postwar period and took on a new urgency …’. This resulted in such studies
as Salomon (1954), Eterno (1961), Blickenstaff (1963), Mendeloff (1963),
Guerra, Abrahamson, and Newmark (1964), Di Giulio (1967), McCarthur
(1965), and Shane (1971).
Due to the success of the FSI rating, the method was adopted and even
adapted by the Defence Language Institute (DLI), the Central Intelligence
Agency (CIA), and the Peace Corps. Quinones (Fulcher 2003:10). It is
further informed that in 1968, partly as a result of experiences of language
needs during the Vietnam War, these diverse agencies came together to
produce a standardized version, today known as Interagency Language
Roundtable (ILR).
From the early 1950s through the late 1960s, when contrastive analysis was
a thriving discipline and structuralism and behaviorism combined to make
language teaching scientific, testing focused on specific language elements
such as phonological, grammatical and lexical contrasts between the two
languages. Tests during this decade stressed mastery of discrete linguistic
skills. That is why this approach came to be known as the ‘discrete point’
approach to language testing as against the ‘integrative’ approach. Such
10
discrete item tests were constructed by breaking down the language into its
component parts relating to the four skills of listening, speaking, reading and
writing and various hierarchical units of language in phonology, graphology,
morphology, lexis and syntax.
2.2.7. The 1970s:
Researchers in this phase recommended the case for the cloze test. Such
studies as Oller (1972), Stubbs & Tucker (1974), Brier, Clausing, Senko,
and Purcell (1978), Brown (1980), Caulfield & Smith (1981), Gaies,
Gradman, and Spolsky (1977), and Lange & Clausing (1981) raised various
issues pertaining to the cloze test.
Liskin-Gasparro (1984b:22) provides information regarding the substantial
use of the FSI by the Peace Corps. The new testing system for academics
and teachers outside the control of the government agencies was introduced.
In the 1970s, as a consequence, it was adopted by many universities and
states for the purpose of bilingual teacher certification. Barnwell (1987:36)
identified three reasons for the popularity of the FSI approach in the
educational context. These are:
i. as a direct test of speaking ability, having high validity, ii. high inter rater reliability, and iii. the growing interest in the notional/functional approach to teaching
English.
11
Due to these reasons the relationship between teaching and testing was also
realized. Carroll’s (1967) experimentation with German, Russian, French,
and Spanish students in the USA became the focus of considerable attention
by 1979, when it was replicated by the Educational Testing Services (ETS)
(Liskin-Gasparro 1984b:27).
Besides this in 1979 the President’s Commission on Foreign Language and
International Studies recommended the setting up of a National Criteria and
Assessment Program to develop language tests and to assess language
learning in the US.
This decade has also been able to raise various diversified issues such as
teachers’ competence to test (Taylor, 1972), differences between men and
women response in tests (Jacobson & Imhoof, 1974), similarities in first and
second language learning (Politzer, 1974; Oller 1976a), computerized testing
(Boyle, Smith, and Eckert, 1976), repetition tasks (Bell, Doyle, and Talbott
(1977), enforced latency period for tests (Meredith, 1978), aural
comprehension test (Madsen 1979), and repetition and dictation (Natalicio,
1979).
12
2.2.8. The 1980s:
The recommendations of the President’s Commission of 1979 gave the
ACTFL (The American Council on the Teaching of Foreign Languages) the
role of producing National Criteria, and the ACTFL Provisional Proficiency
Guidelines appeared in 1982 (ACTFL, 1982), the complete Guidelines in
1986 (ACTFL, 1986), and the revised Guidelines in 1999 (ACTFL, 1999).
These guidelines gave birth to the Proficiency Movement.
The 1980s argued the need for a ‘common yardstick’ to develop listening
and reading tests (Woodford, 1980). It focused on direct proficiency testing,
communicative skills in classroom tests, and the computerized testing
(Clark, 1983). It witnessed more samples of cloze tests (Shohamy, 1982;
Heilenman, 1983; Bensoussan and Ramzar, 1984), and continued the debate
over prognostic testing (Curtin, Avner, and Smith 1983).
This decade has been remarkable for the publication of the ACTFL (the
American Council on the Teaching of Foreign Languages) Proficiency
Guidelines in 1982 which provided a common yardstick, but it had to
undergo certain attacks, from Savignon (1985), Lantolf and Frawley (1985),
Schulz (1986) and Kramsch (1986). This Guideline actually talked of a shift
in testing target from the ‘grammar-based achievement testing’ to the
‘communicative-based and oral proficiency’. Lowe (1986) came forward to
13
defend the Guidelines, but suggested that there is a need for further studies
and research in this direction. The debate took a positive shape with the
contributions of such researchers as Hieke (1985) and Byrnes (1987).
Besides these guidelines, testing situation in this decade was involved with
some other issues too. These issues involved such aspects as scoring criteria
(Currall and Kirk, 1986), relationship between intelligence and language
ability (Boyle, 1987; Cooper, 1987), the relationship between students’
overall grade and language ability (Spurling, 1987), and the goal of
developing a ‘common metric’ (Lambert, 1987). Premised on ACTFL
Proficiency Guidelines for foreign language reading, Lee & Musumeci
(1988) constructed a test based on text-type, reading skill, and task-based
performance.
From the late 1960s to the present time, the dissatisfaction with structuralism
and behaviorism led to linguistic research on communicative competence
and the context of language use. According to Spolsky (1978) and Jones
(1977), there has been a large discrepancy between language teaching and
testing. While the aim of teaching was to develop practical communication,
language testing stressed on the mastery of linguistic elements. However,
recently when the goal of teaching has become communicative, language
tests have started testing communication and other pragmatic skills (Morrow
14
1977, Canale and Swain 1980, Carroll 1980). Testers have come to realize
that the whole of the communicative event is considerably greater than the
sum of its linguistic elements (Clark 1983). As against the ‘discrete’ point
approach, ‘integrative’ approach emerged with its emphasis on
communication, authenticity and context. Oller (1976, 1978) criticized the
‘discrete’ approach and said that language competence is a unified set of
interacting abilities which can not be separated and tested and claimed that it
required integration rather than testing of discrete items of grammar, reading
and vocabulary. Following this line of thinking, two types of integrative tests
emerged. ‘Cloze’ tests are reading passages mutilated by the deletion of
every nth word. The subject is required to fill up the blanks with appropriate
words, which tests the subjects’ competence in a language. The ‘dictation’
tests are short passages read to the students by the teacher: the passage is
read at a normal speed and the students listen. In the second reading it is
broken into phrases and the students write what they listen. Lastly, the work
of the students is checked by themselves while the teacher reads the passage
at a normal speed. Dictation and cloze tests are appropriate integrative tests
(Savignon 1982, Oller 1979, Streiff 1975). At present, when the aim of
teaching is to develop communicative competence, the tests are largely
performance based which requires the learners to use language naturally.
15
There is a call for direct testing which tests the learners in a variety of
language functions.
2.2.9. The 1990s:
This decade witnessed further studies concerning some recent issues
pertaining to language testing. It becomes evident from various works like
Heining-Boynton (1990) who prepared an inventory for evaluating language
teaching programmes in elementary schools; Meredith (1990) on the use of
Oral Proficiency Interview (OPI); Anderson (1991) on the individuality of
proficiency tests; Dunkel (1991) on using computer to test listening
comprehension; Abraham & Chapelle (1992) on Cloze tests; Shohamy,
Gordon, and Kraemer (1992) on the usefulness of writing scales and
intensive training for raters; Stansfield & Kenyon (1992) on developing a
tape-recorded version of the interview; Stansfield, Scott, & Kenyon (1992)
on the development of a test for translators.
The historical overview of language testing is also done by researchers in
terms of three stages in the recent history of language testing: the Pre-
scientific, the Psychometric-Structuralist, and the Psycholinguistic-
Sociolinguistic, as specified by Spolsky (1975). These phases also reflect the
intentions and theoretical basis for language testing before the emergence of
Communicative testing.
16
i) The Pre-Scientific Era: Although Spolsky (1975) traces this era from
the Chinese civil service examinations more than two thousand years ago; he
considers its present form from the 18th century Cambridge Tripos onwards.
It was characterized by the use of essays, open-ended examinations, oral
examining. Testing in this era did not rely on linguistic theory.
ii) The Psychometric-Structuralist Era: This era refers to the joint
venture of the psychometrics and the structural linguistics. While
psychometrics suggested ways on the objective and reliable methods of
testing, the structuralists identified elements of language to be tested.
Language testing in this era was dominated by criteria for the establishment
of educational measuring instruments developed within the traditions of
psychometrics (Alderson & Hughes, 1981:05).
The first person to claim the need was Robert Lado, who was responsible for
the discrete point approach. According to this approach the target language
was broken down into small testable segments using contrastive analysis. A
test in each test item was intended to give information about the candidates’
ability to handle that particular point of language.
Morrow (1979:144-145) calls the approach of this era ‘atomistic’ due to the
assumption that ‘knowledge of the elements of a language is equivalent to
knowledge of the language’. Though this approach provided easily
17
quantifiable data for testing, it carried a major drawback as pointed out by
Morrow (1979). He stated that the knowledge of discrete elements is
worthless unless the user can synthesize those elements according to the
linguistic demands of the situation. Oller (1979:212) rightly claims that “the
whole is greater than the sum of its parts …”
iii) The Psycholinguistic-Sociolinguistic Era: This era began with the
emergence of the global integrative testing. Oller’s (1979, cited in Miyata-
Boddy & Langham, 2000:76) argument on the utility of global integrative
testing provided a closer measure of the ability to combine language skills in
the way they are used for actual language use rather than discrete point
testing. However, even this view has been refuted by Bachman (1990),
Alderson (1981) and Morrow (1979). This era believed that “the whole is
greater than the sum of its parts …” (Weir, 1990, in Miyata-Boddy &
Langham, 2000:76).
About the developments in this era, Wesche (1983:41) says that “in the past
decades or so, linguistic science has paid increasing attention to the
communicative – or message transmission – aspect of language as opposed
to an earlier almost exclusive emphasis on grammatical forms and the ways
in which they could be combined to form grammatical sentences. … More
recently, attention has increasingly been given to how the communicative
18
ability of second language speakers can be tested.” More discussion on this
aspect and the other aspects of this era will be done in the next chapter.
So far through the historical overview discussed above, an attempt has been
made to survey under what circumstances, how and when the language tests
have developed. But a discussion on historical overview remains unjustified,
if the various kinds and approaches to tests and testing are not mentioned.
Therefore, the following section of this chapter will deal with the types,
approaches, Criteria, and test techniques language tests that have taken
shape so far, since its infancy.
2.3. Types of Language Tests
Over the years along the research developments various kinds of tests have
been formulated in different situations for different purposes. Some major
ones are being mentioned discussed below:
2.3.1 Proficiency Test:
Proficiency tests are intended to measure one’s overall, general ability in a
language in order to assess whether the learners are proficient enough to
handle the language in a given situation. It is not content or objective
specific. These tests are conducted, for example, to test whether a student’s
English is good enough to follow a course of study at a British university.
19
That is, these tests take into account the level of proficiency and the kind of
English needed to follow courses in particular subject areas. Cambridge
examinations (First Certificate Examinations and Proficiency Examination)
and the Oxford EFL examination (Preliminary and Higher) are a few
examples of other types of Proficiency tests which are more general in their
target. They try to establish whether candidates have reached a certain
standard with respect to certain specified abilities.
There are other proficiency tests too, which do not have any occupation or
course of study in mind. Here proficiency in a language is perceived in a
more general term. Existing tests like the Cambridge First Certificate in
English Examination (FCE) and the Cambridge Certificate of Proficiency in
English Examination (CPE) are appropriate examples for this type of test.
The function of such tests is to show whether students have reached a certain
standard with respect to a set of specified abilities. The examining bodies of
such tests are independent of teaching institutions. The FCE and CPE, as
mentioned above, are linked to levels in the ALTE (Association of Language
Testers in Europe) framework, which draws heavily on the work of Council
of Europe.
2.3.2 Placement Test:
20
These tests are conducted at the beginning of a course of study in order to
group the learners, for instance, to elementary, intermediate or advanced
levels. Placements tests are used to find the initial level of proficiency of
learners who have enrolled for a particular course. These tests are generally
situation based. A Placement test, therefore, designed and conducted at one
institution cannot be used in totality at another institution. This test is
generally tailor-made, an in-house product. Well et al (1994) and Fulcher
(1997) provide a detailed discussion on the issues in placement test
development.
2.3.3 Progress Test:
These tests are also termed as Mid term or Mid year tests because they are
conducted during the continuation of a course of study. Such tests help the
teachers and learners in assessing how far they have been able to meet the
teaching and learning targets, respectively. The results of these tests help in
modifying the teaching methodology or materials, if there is a need for it.
Some institutions conduct two or three progress tests spread over one year.
Ideally tests should reflect a clear progression towards the final achievement
based on course objectives.
2.3.4 Achievement Test:
21
The Progress test can also be understood in terms of ‘Achievement’ test.
This test is closely related to the curriculum and they establish exactly what
a learner has learned in a given teaching context. That is why; most teachers
are involved in either framing or evaluating this test. The achievement test
can be categorized into two types: ‘Final Achievement Test’ and ‘Progress
Achievement Test’.
‘Final Achievement Tests’ are actually final tests conducted at the end of a
semester or an academic year after course completion. The content of this
test is related to the content of the course as specified in the syllabus. That is
why it is referred to as the syllabus-content approach. The major
disadvantage of this test is that if the course and materials are not designed
properly, the results will have a negative wash back. That is, it will not
indicate successful achievement of course objectives. However, the results
of these tests help the teachers in correcting or modifying their strategies or
materials only to be used in the next academic session. The feedback is not
given to the learners who take this test. The amendments felt to be made are
experimented on the fresh learners who join the course.
There is an alternative approach too in this test, where the test is directed
more on the course objective rather than on the course content. This helps in
22
making the course designers explicit about objectives; hence this test is able
to capture the learners’ achievements in terms of these objectives.
‘Progress Achievement Test’ is practically synonymous with ‘Progress
Test’, as discussed in brief above. This test is intended to measure the
consistent progress of learners during their academic session. Even here it is
intended to make the test focused on the short term course objectives for a
better performance and positive wash back. This is because low scores
obtained prove to be discouraging for students. Such tests as sessionals,
tutorials, quizzes, weekly assignments, and weekly tests are examples of this
type of test.
2.3.5 Diagnostic Tests:
Diagnostic tests are meant to identify the learners’ strengths and weaknesses.
Unlike achievement tests, diagnostic tests not only provide an impression of
the level of proficiency, but they also establish what a learner has or has not
mastered. These tests provide data which can be used to adapt a teaching
programme through its follow-up techniques. They intend to establish what
further teaching is required. Such tests are extremely useful for
individualized instruction or self-instruction. This test can be conducted at
the beginning or during the course to determine proficiency related to
vocabulary, grammatical accuracy, or linguistic appropriacy. But conducting
23
and administering such a test is difficult from various practical points of
views.
Besides, not many diagnostic tests have been designed so far. The
availability of computers have come up as a help, and we have witnessed the
development of a web-based DIALANG, which offers diagnostic test in
fourteen European languages, each having five modules : reading, writing,
listening, grammatical structures, and vocabulary.
The above types of tests reflect the very purpose and use of tests in actual
practice. Over decades of language testing, the following sets / binaries of
approaches to language testing emerged, based on which the tests were
prepared:
i) Direct versus Indirect Testing
The ‘direct’ versus ‘indirect’ continuum refers to the degree to which a test
task approaches the actual criterion performance. As the term suggests,
‘Direct’ testing means a direct assessment of the learners in the skills
concerned. That is, if we want to test the ability of composition among the
learners, we get them to produce a composition or if we have to evaluate the
pronunciation, we ask them to speak. In other words, a ‘direct test’ requires
the candidate to perform the skill that is being tested. Direct test is generally
24
better conducted on the productive skills, like reading and writing, in an
authentic situation. Though the tests cannot be called to be authentic
situation in the real sense of the term, still the situation can be brought very
close to real life. The example of a direct test is a continuous piece of
writing within a limited time frame, with an unknown topic.
In this approach the positive points are that we are clear about the test target
to be assessed; hence it becomes easier to develop the authentic situation, to
assess and interpret learners’ performance, and to get a positive backwash
effect.
Unlike the ‘direct test’, the ‘Indirect test’ attempts to measure the abilities
which underlie the skills in which we are interested (Hughes 1996:15) like
the multiple choice tests of grammar and usage. One good example of
indirect testing is Lado’s (1961) paper – pencil test for testing pronunciation
ability through the identification of rhyming pairs of words. Another
example comes from grammar based test indirectly measuring writing
ability, for instance in TOEFL and IELTS, where learners are required to
identify erroneous or inappropriate expressions in formal Standard English.
The problem with indirect testing is with regard to construct and content
validity whether the test reflects what is being tested and what is actually
being covered in the test.
25
This approach to testing lacks in conveying the aim of the test hence creates
problems at the level of marking and evaluation, especially when the
evaluators / assessors are different from the test designer. Here learners’
performance on the test itself does not confirm the learners’ better
performance in the overall skill itself. That is, this approach to testing does
not necessarily promise a positive backwash effect. However, the ‘indirect’
testing is best suited for the diagnostic purposes.
Today ‘direct’ testing has attained an upper hand over ‘indirect’ testing. It is
important to mention here the existence of ‘Semi-Direct’ testing too. One
appropriate example is the ‘speaking test’ based on tape-recorded stimuli, to
which learners have to respond.
ii) Discrete point versus Integrative Testing
The term ‘discrete point’ refers to the testing of one element at a time. That
means one needs to construct separate tests for various language items.
Opposed to this, the ‘integrative’ tests combine various aspects of language
in one wholesome / integrated test. We can take, for example, the case of
writing a composition where the form, content, use of vocabulary, proper
use of grammar, etc can be checked.
26
Discrete point testing (such as diagnostic tests of grammar items) is
generally ‘indirect’. It has traditionally been referred to as tests of grammar
in which single sentences are the maximum stimulus unit presented. These
tests also use separate tasks for or subsets to assess the ‘four skills’. In other
words, this type of test selects particular levels of language code
organization (e.g. phonology, syntax) to test grammatical points (e.g. speech
sound discrimination, correct choice of preposition), while dealing with only
one skill (reading, writing, listening or speaking).
Integrative testing, with an exception of the cloze procedures, is basically
‘direct’ in nature. It involves higher level of language organization in a test;
hence it comes to be integrative by involving language in discourse, rather
than at the word, sentence or syllable level. Such tests integrate various
skills for assessment in one activity. These tests evaluate language
proficiency of learners keeping in mind their communicative competence
and performance. Studies by Carroll (1961) and Oller (1979) are remarkable
with regard to this pair of approach to language testing.
iii) Norm-referenced versus Criterion-referenced Testing
‘Norm-referenced’ testing is meant to evaluate and place the learners in
terms of other learners. That is, it tells us the learners’ ranking (like top ten
or bottom five students) as conducted by TOEFL, IELTS, or The Michigan
27
Test of English Language Proficiency. It intends to make comparisons
between individuals. It is, for instance, necessary in a situation, where a
fixed quota of the best achievers, has to be selected.
The ‘criterion-referenced’ testing, on the other hand, intends to tell what a
learner can actually do in the target language. That is, it suggests how
proficient a learner is in performing certain skills / functions through the
target language. This test does not rank-order testees. The simple aim is to
find out whether a learner has mastered a set of specific objectives. Here the
tasks are generally set and the performances are evaluated. That means
learners are supposed to measure their progress in relation to meaningful
criteria. Tests conducted by Interagency Language Roundtable (ILR) and
Berkshire Certificate of Proficiency in German Level 1, for instance, have
developed certain criteria for the language skills to be reflected for being
determined as proficient in the respective skills. Some other examples are
ACTFL Oral Proficiency Interview, the FBI Listening Summary Translation
Exam (Scott et al, 1996), the Canadian Academic English Language (CAEL)
Assessment (Jennings et al, 1999).
A meaningful discussion of this binary of approaches can be found in the
studies by Popham (1978), Ebel (1978), Skehan (1984), Hudson and Lynch
(1984), Hughes (1986), and Brown and Hudson (2000).
28
iv) Objective testing versus Subjective testing
The main distinction between the types of testing is in terms of methods of
scoring. If a test provides a scoring scale to be followed by the scorer, the
test is claimed to be ‘objective’, otherwise it is ‘subjective’. The ‘subjective’
testing or scoring is actually a sort of ‘impressionistic’ scoring, which varies
from the evaluation of a composition task to the scoring of short answers in
response to a reading comprehension passage.
Objectivity in testing is given priority for better reliability and greater
agreement between various scores / evaluators of the same test. This means,
it is better if a test is constructed with its specifications and scoring scale.
However, in EFL and ESL countries, in general, such things exist only in
principle.
v) Computer adaptive testing
Since the last quarter of the 21st century, a consistent effort (successful in
some instances) is going on to develop computer adaptive language tests.
This type of testing mainly helps in collecting information on learners’
ability. It also helps in conducting and aligning the difficulty levels of tests
as per the ability of the learners. That is, to begin with all learners are given
an average level test, where the learners’ abilities are identified and they are
29
further tested accordingly. For instance, those who respond correctly (in the
introductory average-level test) are presented with more difficult items,
while those who respond incorrectly are presented with easier items.
With the introduction of such practices as e-learning, e-content generation
and on-line teaching and testing, the tests are required to be adapted as per
the new challenges of testing. Chalhoub-Deville (1999), Chalhoub-Deville
and Deville (1999), and Fulcher (2000) discuss issues relating to the role of
computers adaptation for language testing.
vi) Communicative language testing
The next chapter will deal with a detailed description of this approach. In a
nut shell, this approach actually came up as a reaction almost revolutionizing
the earlier practices with regard to the matter and manner of the language
tests. It suggests the need to test learners’ ability to communicate. In order to
do so, the test designers had to replace the aspects of grammatical items and
linguistic competence by communicative grammar, functions and notions of
the target language and the communicative competence. Although it has its
own limitations, it had a considerable influence. Such studies as Morrow
(1979), Brumfit and Johnson (1979), Canale and Swain (1980), Alderson
and Hughes (1981, part 1), Hughes and Porter (1983), Davies (1988), and
30
Weir (1990) are most representative contributions with regard to
Communicative language testing.
2.4 Characteristics of language tests:
So far we saw that language testing over the last one century has been
approached differently and has produced various test types. These tests
carried certain typical features of their own. But they all were expected to
carry the following three characteristics: namely, validity, reliability, and
practicality. That is, any test that we use must be appropriate in terms of
objectives, dependable in the evidence it provides, and applicable to our
particular situation. The development of the above concepts is a typical
feature of the ‘psychometric’ phase that emerged after the ‘pre-scientific’
days.
i) Validity:
If a test is found to be based upon a sound analysis of the skill(s) intended to
be measured, and if there is sufficient evidence that test scores correlate
fairly highly with actual ability in the skills area being tested, then we may
call that test to be valid. In simpler words, a test is said to be valid if it
measures accurately what is intended to measure.
31
By definition, though the concept of validity seems to be quite simple, the
researchers have often found it complex and difficult to separate it from the
concept of reliability. General discussion on validity and ways of measuring
it can be studied in Messick (1989), Bradshaw (1990), Wall et al (1994),
Cumming and Berwick (1996), Fulcher (1997), Anastasi and Urbina (1997),
and Nitko (2001).
In recent times, the term ‘construct validity’ has been increasingly used in
place of the general notion of validity. This is so because language tests are
generally created in order to measure some theoretical constructs in the form
of language items like reading ability, fluency in speaking, accuracy in
writing, written composition, control of grammar and so on. This term was
first used in the context of psychological tests of personality to measure
psychological constructs. Bachman and Palmer (1981) is one of the early
attempts to introduce construct validity to language testing. Cripper and
Davies (1988) and Hughes, Porter and Weir (1988) made further significant
contributions on these lines.
However, it is not enough to say that a test needs to have construct validity.
Rather it is required to reflect the construct validity in the form of some
evidences, like ‘Content validity’, ‘Criterion-related validity’, and ‘face
validity’.
32
The notion of validity is reflected in the following forms:
a) Content Validity: If a test is designed to measure mastery of a
specific skill or the content of a particular course of study, it is expected that
the test is based upon a careful analysis of that skill or the outline of the
course. That is, a test must take into consideration the language item (a skill
or some grammatical structure), the aspects of this skill to be tested, and its
extent / proportion as per the level of the learners (intermediate level or
advanced level, for instance). A mere inclusion of a language item from a
course of study is not enough to evidence the ‘content validity’. In other
words, it is important for the test designers to consult the test specification, if
at all it exists. The availability of ‘Content validity’ can be judged on the
basis of the proximity of ‘test specification’ and ‘test content’. ‘Content
validity’ is directly proportionate to the backwash effect and the accurate
measure of what the test is intended to measure. The higher is the ‘content
validity’, the more accurate (higher construct validity) will be the test and
the backwash effect.
It has been often observed that the test designers ignore the ‘content validity’
and give preference to simpler and easier content over the important content.
b) Criterion-related validity: This evidence of construct validity is
often termed as ‘empirical validity’, because here the correlation between
33
the test scores is established by comparing them to independent, outside
criterion such as marks attained at the end of a course or instructors’ or
supervisors’ ratings. If there is a high correlation between the two sets of test
scores, that means the ‘criterion-related validity’ is also high. This validity is
of two types:
Concurrent validity: It is established when both the test and the
criterion are administered at about the same time. This can be exemplified
through the ratings of the two types (in terms of duration, coverage, or
mode, etc) of achievement tests for the same content. If the comparison
between the two tests shows a higher agreement, then both the test are
considered to carry higher concurrent validity.
Predictive Validity: This refers to the degree to which a test can
predict a candidates’ future performance. This can be well understood in
terms of a proficiency test, which is given to predict a learner’s ability to
take up a graduate course at a British or American university.
c) Face Validity: A test is considered to carry ‘face validity’ if it is
confirmed that it measures what it is supposed to measure. A pen-pencil test
to measure a learner’s phonetic / pronunciation ability, for example, lacks in
‘face validity’. Face validity actually depends on the content and criterion-
34
related validities. That is if a test lacks in content, it automatically promises
to lack in face validity.
ii) Reliability: The term ‘reliability’ refers to the stability of test scores.
That is, a test is considered to be reliable only if the test scores do not vary
significantly due to the change in such external factors like day, date and
place, or due to the evaluation conducted by two or more sets of independent
examiners, or even due to the two parallel forms of the same test.
In order to attain reliability one needs to ‘construct’, ‘administer’, and
‘score’ tests in such a way that the scores obtained on a test on a particular
occasion are similar to the scores obtained by the same students with the
same ability, but at a different time. The more similar the scores, the more
reliable the test is supposed to be. Reliability is perceived in terms of ‘Test
Reliability’ and ‘Scorer Reliability’.
Test Reliability: Test reliability is attained by conducting sampling of
tasks on varieties of students. Sometimes test reliability is affected by the
change in administration or the lack of motivation among students.
Scorer Reliability: It concerns the stability or consistency with which
test performances are evaluated. Scorer reliability is often affected in
subjective tests or impressionistic evaluations on such test items like
35
composition task, or free response tests, etc. To achieve this reliability, an
attempt is often made to develop a scoring scale to be followed by all
evaluators. Scorer reliability is nearly perfect in the case of multiple choice
test, if the options to the test are carefully drawn.
Studies by Lado (1961), Feldt and Brennan (1989), Anastasi and Urbina
(1997), Nitko (2001), and Brown and Hudson (2002) are some
representative contributions with regard to reliability of tests.
iii) Practicality: A test may be very valid and reliable and yet may not be
practicable in its application. The first and foremost aspect is the cost
effectiveness of the test. Besides, economy of time is also significant. Aspect
of time is to be managed keeping in mind the level of students. Hence this is
also related to the difficulty level of the test, length and breath of the task. In
addition, ease of administration and scoring also becomes important for
practicality of the test.
2.5 Summing up:
The present chapter has mainly tried to capture the advancements made in
the field of language testing over the last one century. In the process it also
takes into consideration its overall development in terms of its types,
approaches and characteristics.
36
References:
Abraham, R. G. & Chapelle, C. A. 1992. The meaning of cloze test scores: An item difficulty perspective. MLJ, 76, 468-479.
Alderson, J. C. (1981) Report of the discussion on communicative language testing. In J. C. Alderson and A. Hughes (eds.). Issues in Language Testing. ELT Documents 111. London: The British Council.
Alderson and Hughes (eds.) 1981.Issues in Language Testing. ELT Documents 111.London; The British Council.
Anastasi, A. and S. Urbina. 1997. Psychological Testing (7th Edition). Upper Saddle River, N.J.: Prentice Hall. In Hughes, A. 1989. Testing For Language Teachers. Cambridge: Cambridge University Press.
Anderson, N. J. 1991. Individual differences in strategy use in second language reading and testing. MLJ, 75, 460-472.
Angiolillo, P. 1947. Armed Forces Foreign Language Teaching. New York: Vanni. p. 32.
Ayer, G. W. 1960. An auditory discrimination test based on Spanish, MLJ, 44, 227-230.
Bachman, L. F. (1990) Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
Barnwell, D. 1987. Oral Proficiency Testing in the United States. British Journal of language teaching. 25 (1): 35-42.
Bell T. O., Doyle, V. & Talbott, B. L. 1977. Review essay: The Mat-Sea-Cal oral proficiency tests: A review of. MLJ, 61, 136-138.
Bensoussan, M. & Ramzar, R. 1984. Testing EFL reading comprehension using a multiple choice rational cloze. MLJ, 68, 230-239.
Blickenstaff, C. B. 1963. Musical talents and foreign language learning ability. MLJ, 47, 359-363.
Borg, W. R. & Goodman, J. S. 1956. Development of an individual test of English for foreign students. MLJ, 40, 240-244.
Bottke, K. G. & Milligan, E. E. 1945. Test of aural and oral aptitude for foreign language study. MLJ, 29, 705-709.
Bovee, A. G. & Froelich, G. J. 1946. Some observations on the relationship between mental ability and achievement in French. MLJ, 30, 333-336.
Boyle, J. P. 1987. Intelligence, reasoning and language proficiency. MLJ, 71, 277-288. Boyle, T. A., Smith, W. F., & Eckert, R. G. 1976. Computer mediated testing: A
branched programachievement test. MLJ, 60, 428-440.
Bradshaw, J. 1990. Test-Takers’ Reactions to Placement Test. Language Testing 7: 13-30.
37
Brereton, J. L. 1944. In Glenn Fulcher. 2003. Testing Second Language Speaking. Edinburgh: Pearson Education Limited.
Brier, E. J., Clausing, G., Senko, D., & Purcell, E. 1978. A look at cloze testing across languages and levels. MLJ, 62, 23-26.
Brinsmade, C. 1928. Concerning the College Board examinations in Modern languages. MLJ, 8, 87-100, 212-227.
Broom, E. & Kaulfers, W. V. 1927. Two types of objective tests in Spanish. MLJ, 11, 517-521.
Brown, J. D. 1980. Relative merits of four methods for scoring cloze tests, MLJ, 64, 311-317.
Brown, J. D. and T. Hudson. 2002. Criterion-referenced Language Testing.Cambridge: Cambridge University Press.
Brumfit, C. J. & K. Johnson. 1979. The Communicative Approach to Language Teaching. Oxford: Oxford University Press/ ELBS.
Buechel, E. H. 1957. Grades and ratings in language proficiency evaluations. MLJ, 41, 41-47.
Byrnes, H. 1987. Proficiency as a framework for search in second language acquisition. MLJ, 71, 44-49.
Canale, M. and M. Swain. 1980. Theoretical Bases of Communicativer Approaches to Second Language Teaching and Testing. Applied Linguistics 1: 1-47
Carroll, J. B. 1954. Notes on the achievement of measurement in foreign languages. Unpublished manuscript, cited in Spolsky, B. 2000. Language Testing in the Modern Language Journal, 84, 536-552.
Carroll, J. B. 1961. Fundamental Considerations in Testing for English Language Proficiency of Foreign Students. In H. B. Allen and R. N. Campbell (eds.) 1972. Teaching English as a Second Language: A Book of Readings.New York: McGraw Hill.
Caulfield, J. & Smith, W.C. 1981. The reduced redundancy test and the cloze procedure as measures of global language proficiency. MLJ, 65, 54-58.
Chalhoub-Deville M. & C. Deville. 1999. Computer Adaptive Testing in Second Language Contexts. Annual Review of Applied Linguistics. 19:273-99.
Cheydleur, F. D. 1928. Results and significance of the new type of modern language tests. MLJ,12, 513-531.
Cheydleur, F. D. 1931. The use of placement testsin modern languages at the university of Wisconsin. MLJ, 15, 262-280.
Cheydleur, F. D. 1932a. Mortality of modern language students: Its causes and prevention. MLJ, 17, 104-136.
Cheydleur, F. D. 1932b. The relationship between functional and theoretical grammar. MLJ, 16, 310-334.
38
Clapp, H. L. 1947. Meditations on a placement programme or when should a foreign language be studied?. MLJ, 31, 203-207
Clark, J. L. D. 1983. Language testing: Past and current status – Directions for the future. MLJ, 67, 431-443.
Clarke, F. M. 1931. Results of the Bryn Mawr Test in French administered in New York City high schools.Bulletins of High Points. 13, 4-13.
College Entrance Examination Board. 1939. Thirty Ninth Annual Report of the Secretary. New York: College Entrance examination Board. In Fulcher, Glenn. 2003. Testing Second Language Speaking. Edinburgh: Pearson Education Limited. P. 2.
Committee on Resolutions and Investigations. 1917. Report of Committee on resolutions and Investigations appointed by the association of Modern Langugae Teachers of the Middle States and Maryland. In Fulcher, Glenn. 2003. Testing Second Language Speaking. Edinburgh: Pearson Education Limited. P. 2.
Conant, J. B. 1934. Notes and news: President Conant speaks again. MLJ,19, 465-466. Cooper, T. C. 1987. Foreign language study and SAT – verbal scores. MLJ, 32, 596-
599. Coutant, V. 1948. Evaluation in Foreign language teaching.MLJ, 32, 595-599.
Cripper, C. and A. Davies. 1988. ELTS Validation Project Report. Cambridge: The British Council and Cambridge local Examinations Syndicate. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Cumming, A. and R. Berwick. (eds.).1996. Validation in Language Testing.Clevedon: Multilingual Matters. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Currall, S. C. & Kirk, R. E. 1986. Predicting success in intensive foreign language courses.MLJ, 70, 107-113.
Curtin, C., Avner, A., & Smith, L. A. 1983. The Pimsleur Battery as a predictor of student performance. MLJ. 67, 33-40.
Davies, A. 1988. Communicative Language Testing. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Decker, W. C. 1925. Oral and aural tests as integral parts of Regents’ examinations. MLJ, 9, 369-371.
Di Giulio, E. 1967. Testing in the language laboratory. MLJ, 51, 103-104. Dunkel, P. A. 1991. Computerized testing of non-participatory L2 listening
comprehension proficiency: An ESL prototype development effort. MLJ, 75, 64-73. Ebel, R. L. 1978. The Case for Norm-referenced Measurements. Educational
Researcher 7(11):3-5. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Eley, E. G. 1956. Testing the language arts. MLJ, 40, 310-315.
39
Eterno, J. A. 1961. Foreign language pronunciation and musical aptitude. MLJ, 45, 168-170.
Feder, D. D. & Cochran, G. 1936. Comprehension maturity tests: A new departure in measuring reading ability. MLJ. 20, 201-208.
Feldt, L. S. and R. L. Brennan. 1989. Reliability. In Linn (ed.). 1989. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Fife, R. H. 1937. Reading tests in French and German. MLJ, 22, 56-57. Frantz, A. I. 1939. The reading knowledge test in the foreign languages: A survey.
MLJ, 23, 440-446. Fulcher, G. 1997. An English Language Placement Tests: issues in Reliability and
Validity. Language Testing. 14: 113-138. Fulcher, G. 2000. Computers in Language Testing. In Brett and Motteram (eds.) 2000.
Fulcher, Glenn. 2003. Testing Second Language Speaking. Edinburgh: Pearson Education Limited.
Furness, E. L. 1955. A plea for a broader program of testing. MLJ, 39, 255-259. Gabbert, T. A. 1941. Predicting success in an intermediate junior college reading
course in Spanish. MLJ, 8, 637-641. Gaies, S., Gradman, H., & Spolsky, B. 1977. Towards the measurement of functional
proficiency: Contextualization of the noise test. TESOL Quarterly, 11, 51-57. Guerra, E. L., Abrahamson, D. A., & Newmark, M. 1964. The New York City foreign
language oral ability rating scale. MLJ, 48, 486-489. Haden, E. & Stalnaker, J. M. 1934. A new type of comprehensive foreign language test.
MLJ, 19, 81-92. Harding, F. D. J. 1958. Tests in selection of language students. MLJ, 42, 120-122.
Heilenman, L. K. 1983. The use of a cloze procedure in foreign language placement. MLJ, 67, 121-126.
Heining-Boynton, A. L. 1990. The development and testing of the FLES program evaluation inventory. MLJ, 74, 432-439.
Henmon, V. A. C., Bohan, J. E., Brigham, C. C., Hopkins, L. T., Rice, G. A., Symonds, P. M. Todd, J. W., & Van Tassel, R. J. (eds.). 1929. Prognosis tests in the modern foreign languages: Report prepared for the Modern Foreign Language Study and the Canadian Committee on Modern Languages (Vol. 16). New York: Macmillan Company.
Henmon, V. A. C. 1934. Recent developments in the study of foreign language problems. MLJ, 19, 187-201.
Hieke, A. E. 1985. A componential approach to oral fluency evaluation. MLJ,I 69, 135-142.
40
Hudson, T. & B. Lynch. 1984. A Criterion-referenced Approach to ESL Achievement Testing . Language Testing 1: 171-201.
Hughes, A. and D. Porter. (Eds.). 1983. Current Development in Language Testing. London: Academic Press.
Hughes, A. 1986. A Pragmatic Approach to Criterion-referenced Foreign Language Testing. In Hughes, A. 1989. Testing for Language Teachers. Cambridge: Cambridge University Press.
Hughes, A., D. Porter and C. Weir. (Eds.).1988. Validating the ELTS Test: A Critical Review. Cambridge: The British Council and University of Cambridge Local Examination Syndicate. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Hughes, A. 2003. Testing for Language Teachers. Cambridge: Cambridge University Press.
Jacobson, M. & Imhoof, M. 1974. Predicting success in learning a second language. MLJ, 58, 329-336.
Kaulfers, W. V. 1930. Why prognose in the foreign languages? MLJ, 14, 296-301.
Kaulfers, W. V. 1933. Practical techniques for testing comprehension in extensive reading. MLJ 17, 321-327.
Kaulfers, W. V. 1937. Objective tests and exercises in French Pronunciation. MLJ, 22, 186-188.
Kaulfers, W. V. 1939. Prognosis and its alternatives in relation to the guidance of students. German Quarterly, 12, 81-84.
Kauflers, W. V. 1944. War time development in modern language achievement testing. MLJ, 28, 136-150.
Kaulfers, W. V. 1948. Post war language teaching in California Colleges. MLJ, 32, 368-372.
Kramsch, C. J. 1986. From language proficiency to international competence. MLJ, 70, 366-372.
Kurath, W. & Stalnaker, J. M. 1936. Two German vocabulary tests. MLJ, 21, 95-102. Lado, R. 1961. Language Testing. London. Longman.
Lambert, R. D. 1987. The case for a national foreign language center: An editorial. MLJ, 71, 1-11
Lange, D. L. & Clausing, G. 1981. An examination of two methods of generating and scoring Cloze tests with students of German on three levels. MLJ, 65, 254-261.
Lantolf, J. P. & Frawley, W. 1985. Oral proficiency testing: A critical analysis. MLJ, 69, 228-234.
Lee, & Musumeci. 1988. In Spolsky, B. 2000. The Modern Language Journal, Vol. 84, No. 4, Special Issue: A Century of Language Teaching and Research: Looking Back and Looking Ahead, Part 1 (Winter, 2000), pp. 536-552. Blackwell Publishing
41
on behalf of the National Federation of Modern Language Teachers Associations. URL: http://www.jstor.org/stable/330305. Accessed on 07/12/2009 00:43.
Liskin-Gasparro, J. E. 1984b. The ACTFL Proficiency Guidelines: A Historical Perspective. P. 22. In Fulcher, Glenn. 2003. Testing Second Language Speaking. Edinburgh: Pearson Education Limited.
Lowe, J. P. 1986. Proficiency: Panacea, framework, process? A reply to Kramsch, Schulz, and particularly to Bachman and Savignon. MLJ, 70, 391-397.
Lundeberg, O. K. 1929. Recent developments in audition-speech tests. MLJ, 144, 193-202.
Madsen, H. S. 1979. An indirect measure of listening comprehension. MLJ, 63, 429-435.
Maronpot, R. P. 1939. Discovering and salvaging modern language risks. MLJ, 23, 595-598.
Matheus, J. F. 1937. Corelation between psychological test scores, language test scores, and semester grades. MLJ, 22, 104-106.
McCarthur, J. 1965. Measurement, interpretation, and evaluation – TV FLES: Item analysis in test instruments. MLJ, 49, 217-219.
Mendeloff, H. 1963. Aural placement by television. MLJ, 47, 110-113.
Meredith, R. A. 1978. Improved oral test scores through delayed response. MLJ, 62, 321-327.
Meredith, R. A. 1990. The oral proficiency interview in real life: sharpening the scale. MLJ, 74, 288-296.
Messick, S. 1989. Validity. In Hughes, A. 1989. Testing for Language Teachers. Cambridgr; Cambridge University Press.
Michel, S. V. 1934. Prognosis in the Modern Foreign Languages. Unpublished master’s thesis, University of Minesota, Minneapolis. Cited in Spolsky 2000.
Michel, S. V. 1936. Prognosis in German. MLJ. 20, 275-287. Miller, M. M. 1948. Test on Spanish and Spanish American Life and Culture. MLJ, 32,
140-144. Miyata-Boddy, Nick & Langham, Clive S. 2000. Communicative language testing - an
attainable goal?. Tokyo: The British Council. Morrow, K. 1979. Communicative Language Testing: Revolution or Evolution? In C. J.
Brumfit and K. Johnson. (eds.). 1979. The Communicative Approach to Language Teaching. Oxford: Oxford University Press/ELBS.
Mueller, T. 1959. Auding Tests. MLJ, 43, 185-187. Natalicio, D. S. 1979. Repetition and dictation as language testing techniques. MLJ, 63,
165-176.
42
Nitko, A. J. 2001. Educational assessment of Students (3rd Edition). Upper Saddle River, N. J.: Prentice Hall.
Oller, J. W. 1972. Scoring methods and difficulty levels for cloze tests of proficiency in English as a second language. MLJ, 56, 151-158.
Oller, J. W. 1976a. Review Essay: The measurement of Bilingualism: A review of. Modern Language Review, 60, 399-400.
Oller. J. W. 1979. language Tests at School: A Pragmatic Approach. London: Longman.
Politzer, R. L. 1974. Devrlopmrntal sentence scoring as a method of measuring second language acquisition. MLJ, 58, 245-250.
Popham, W. J. 1978. The Case for Criterion-referenced Measurements. Educational Researcher 7 (11): 6-10.
Raymond, J. 1951. A controlled association exercise in Spanish. MLJ, 35, 280-289. Richardson, H. D. 1933. Discovering aptitude for the foreign languages. MLJ, 18, 160-
170. Roach, J. O. 1945. Some problems of oral examinations in modern languages: An
experimental approach based on the Cambridge examinations in English for foreign students, being a report circulated to oral examiners and local examiners for those examinations. Cambridge, England: Local Examination Syndicate.
Rogers, A. L. & Clarke, F. M. 1933. Report on Bryn Mawr Test of ability to understand Spoken French. MLJ, 17, 241-248.
Sadnavitch, J. M. & Popham, W. J. 1961. Measurement of Spanish achievement in the elementary schools. MLJ, 45, 297-299.
Salomon, E. 1954. A Generation of Prognosis testing. MLJ, 38, 299-303.
Sammartino, P. 1938. A language achievement scale. MLJ, 22, 429-432. Savignon, S. J. 1985. Evaluation of communicative competence: The ACTFL
Provisional Proficiency Guidelines. MLJ, 69, 129-134. Schulz, R. A. 1986. From achievement to proficiency through classroom instruction:
some caveats. MLJ, 70, 373-379. Seibert, L. C. & Goddard, E. R. 1935. A more objective method of scoring
composition. MLJ, 20, 143-150. Shane, A. M. 1971. An evaluation of the existing college norms for the MLA-
Cooperative Russian Test and its efficacy as placement examination. MLJ, 55, 93-99.
Shohamy, E. 1982. Affective considerations in language testing. MLJ, 66, 13-17. Shohamy, E., Gordon, C. M., & Kraemer, R. 1992. The effect of raters’ background
and training on the reliability of direct writing tests. MLJ, 76, 27-33.
43
Smith, F. P. & Cambell, H. 1942. Objective achievement testing in French recognisition versus recall tests. MLJ, 76, 27-33.
Spoerl, D. T. 1939. A study of some of the possible factors involved in foreign language learning. MLJ, 23, 428-431.
Spolsky, B. 1975. Language Testing: Art or Science? In Moller, Alan. 1981. “Reaction to the Morrow Paper”. In ELT Documents 111 – Issues in Language Testing. London: The British Council.
Spolsky, B. 2000. The Modern Language Journal, Vol. 84, No. 4, Special Issue: A Century of Language Teaching and Research: Looking Back and Looking Ahead, Part 1 (Winter, 2000), pp. 536-552. Blackwell Publishing on behalf of the National Federation of Modern Language Teachers Associations. URL: http://www.jstor.org/stable/330305. Accessed on 07/12/2009 00:43.
Spurling, S. 1987. The fair use of an English language admissions test. MLJ, 71, 410-421.
Stalnaker, J. M. & Kurath, W. A. 1935. A comparision of two two types of foreign language vocabulary test. Journal of Educational Psychology, 26, 435-442.
Stansfield, C. W. & Kenyon, D. M. 1992. The development and validation of a simulated oral proficiency interview. MLJ, 76, 129-141.
Stansfield, C. W., Scott, M. L. & Kenyon, D. M. 1992. The measurement of Translation ability. MLJ, 75, 455-467.
Stubbs, J. B. & Tucker, G. R. 1974. The cloze test as a measure of English proficiency. MLJ, 58, 239-241.
Symonds, P. M. 1930a. A foreign language prognosis test. Teachers’ College Record. 31, 540-546.
Symonds, P. M. 1930b. Foreign Language Prognosis Test. New York: Teachers College. Columbia University.
Tallent, E. R. E. 1938. Three coefficient of correlations that concern modern foreign languages. MLJ, 22, 591-594.
Taylor, J. S. 1972. Would achievement testing of students solve the problem of incompetent language teachers? MLJ, 56, 360-364.
Tharp, J. B. 1930. Effect of oral-aural ability on scholastic ability. MLJ, 15, 10-26. Thorndike, E. L. ed. 1921. The Teachers’ Word Book. New York: Teacher’s College,
Columbia University. Valley, J. R. 1959. College actions on CEEB advanced placement language
examination candidates. MLJ, 43, 261-263. Wagner, M. E. & Strabel, E. 1935. Predicting success and failure in college ancient and
modern foreign languages. MLJ, 19, 285-293. Wall, D., C. Clapham and J. C. Alderson.1994. Evaluating a Placement Test. Language
Testing 11:321-344.
44
Weir, C. J. 1990. Communicative Language Testing. Hemel Hempstead: Prentice Hall. Wesche, M. Bingham. 1983. Communicative Testing in a Second Language. The
Modern Language Journal, 67, 41-55. Woodford, P. E. 1980. Foreign language testing. MLJ, 64, 97-102.