DESIGNING LANGUAGE TEST
Transcript of DESIGNING LANGUAGE TEST
DESIGNING LANGUAGE TEST
Testing is a universal feature of social life. Throughout history people
have been put to the test to prove their capabilities or to establish their credentials;
this is the stuff of Homeric epic, of Arthurian legend.
There are many reasons for developing a critical understanding of the
principles and practice of language assessment. Obviously you will need to do so
if you are actually responsible for language test development and claim expertise
in this field. But many other people working in the field of language study more
generally will want to be able to participate as necessary in the discourse of this
field, for a number of reasons.
First, language tests play a powerful role in many people’s lives, acting as
gateways at important transitional moments in education, in employment, and in
moving from one country to another. Since language tests are devices for the
institutional control of individuals, it is clearly important that they should be
understood, and subjected to scrutiny. Secondly, you may be working with
language tests in your professional life as a teacher or administrator, teaching to a
test, administering tests, or relying on information from tests to make decisions on
the placement of students on particular courses.
Finally, if you are conducting research in language study you may need to
have measures of the language proficiency of your subjects. For this you need
either to choose an appropriate existing language test or design your own.
Thus, an understanding of language testing is relevant both for those
actually involved in creating language tests, and also more generally for those
involved in using tests or the information they provide, in practical and research
contexts.
TEST TYPES
Team System testing tools provides several test types that you can use for
your specific software testing purposes. This section describes those test types. It
also describes how to create and customize tests of each type, because these
processes are specific to the type of test.
However, common among the various test types are many test-related
tasks such as managing tests and working with test results. These common tasks
are described in the section Testing Tools Tasks.
In This Section
Working with Unit Tests
Provides links to topics that describe unit tests and how to create them.
Working with Web Tests
Describes how to create, edit, run, and view Web tests.
Working with Load Tests
Describes the uses of load tests, how to edit and run them, how to collect
and store load test performance data, and how to analyze load test runs.
Working with Manual Tests
Describes how to create and run manual tests, the only non-automated test
type.
Working with Generic Tests
Describes how to create and run generic tests. Generic tests wrap external
programs and tests that were not originally developed for use in the Team
System testing tools.
Working with Ordered Tests
Describes how to create ordered tests, which contain other tests that are
meant to be run in a specified order.
TEST SPECIFICATION
The test case specifications should be developed from the test plan and are
the second phase of the test development life cycle. The test specification should
explain "how" to implement the test cases described in the test plan.
Test Specification Items
Each test specification should contain the following items:
Case No.: The test case number should be a three digit identifer of the
following form: c.s.t, where: c- is the chapter number, s- is the section
number, and t- is the test case number.
Title: is the title of the test.
ProgName: is the program name containing the test.
Author: is the person who wrote the test specification.
Date: is the date of the last revision to the test case.
Background: (Objectives, Assumptions, References, Success Criteria):
Describes in words how to conduct the test.
Expected Error(s): Describes any errors expected
Reference(s): Lists reference documententation used to design the
specification.
Data: (Tx Data, Predicted Rx Data): Describes the data flows between the
Implementation Under Test (IUT) and the test engine.
Script: (Pseudo Code for Coding Tests): Pseudo code (or real code) used to
conduct the test.
TEST ITEMS
The table below presents both pros and cons for various test item types.
Your selection of item types should be based on the types of outcomes you are
trying to assess (see analysis of your learning situation). Certain item types such
as true/false, supplied response, and matching, work well for assessing lower-
order outcomes (i.e., knowledge or comprehension goals), while other item types
such as essays, performance assessments, and some multiple choice questions, are
better for assessing higher-order outcomes (i.e., analysis, synthesis, or evaluation
goals). The italicized bullets below will help you determine the types of outcomes
the various items assess.
With your objectives in hand, it may be useful to create a test blueprint
that specifies your outcomes and the types of items you plan to use to assess those
outcomes. Further, test items are often weighted by difficulty. On your test
blueprint, you may wish to assign lower point values to items that assess lower-
order skills (knowledge, comprehension) and higher point values to items that
assess higher-order skills (synthesis, evaluation).
Item Type Pros Cons
Multiple Choice(see tips for writing multiple choice questions below)
more answer options (4-5) reduce the chance of guessing that an item is correct
many items can aid in student comparison and reduce ambiguity
greatest flexibility in type of outcome assessed: knowledge goals, application goals, analysis goals, etc.
reading time increased with more answers
reduces the number of questions that can be presented
difficult to write four or five reasonable choices
takes more time to write questions
True/False(see tips for writing true/false questions below)
can present many items at once
easy to score used to assess popular
misconceptions, cause-effect reactions
most difficult question to write objectively
ambiguous terms can confuse many
few answer options (2) increase the chance of guessing that an item is
correct; need many items to overcome this effect
Matching efficient used to assess student
understanding of associations, relationships, definitions
difficult to assess higher-order outcomes (i.e., analysis, synthesis, evaluation goals)
Interpretive Exercise(the above three item types are often criticized for assessing only lower-order skills; the interpretive exercise is a way to assess higher-order skills w/ multiple choice, T/F, and matching items)
a variation on multiple choice, true/false, or matching, the interpretive exercise presents a new map, short reading, or other introductory material that the student must analyze
tests student ability to apply and transfer prior knowledge to new material
useful for assessing higher-order skills such as applications, analysis, synthesis, and evaluation
hard to design, must locate appropriate introductory material
students with good reading skills are often at an advantage
Supplied Response
chances of guessing reduced
measures knowledge and fact outcomes well, terminology, formulas
scoring is not objective can cause difficulty for
computer scoring
Essay less construction time, easier to write
encourages more appropriate study habits
measures higher-order outcomes (i.e., analysis, synthesis, or evaluation goals), creative thinking, writing ability
more grading time, hard to score
can yield great variety of responses
not efficient to test large bodies of content
if you give the student the choice of three or four essay options, you can find out what they know, but not what they don't know
Performance Assessments(includes essays above, along with speeches, demonstrations, presentations, etc.)
measures higher-order outcomes (i.e., analysis, synthesis, or evaluation goals)
labor and time-intensive
need to obtain inter-rater reliability when using more than one rater
The table below presents tips for designing two popular item types: multiple
choice questions and true/false questions.
Tips for Writing Multiple Choice Questions
Tips for Writing True/False Questions
Avoid responses that are interrelated. One answer should not be similar to others.
Avoid negatively stated items: "Which of the following is not a method of food irradiation?" It is easy to miss the the negative word "not." If you use negatives, bold-face the negative qualifier to ensure people see it.
Avoid making your correct response different from the other responses, grammatically, in length, or otherwise.
Avoid the use of "none of the above." When a students guesses "none of the above," you still do not know if they know the correct answer.
Avoid repeating words in the question stem in your responses. For example, if you use the word "purpose" in the question stem, do not use that same word in only one of the answers, as it will lead people to select that specific response.
Use plausible, realistic
Do not use definitive words such as "only," "none," and "always," that lead people to choose false, or uncertain words such as "might," "can," or "may," that lead people to choose true.
Do not write negatively stated items, as they are confusing to interpret: "Thomas Jefferson did not write the Declaration of Independence." True or False?
People have a tendency to choose "true," so design at least 60% of your T/F items to be "false" to further minimize guessing effects.
Use precise words (100, 20%, half), rather than vague or qualitative language (young, small, many).
Avoid making the correct answer longer than
responses. Create grammatically parallel
items to avoid giving away the correct response. For example, if you have four responses, do not start three of them with verbs and one of them with a noun.
Always place the "term" in your question stem and the "definition" as one of the response options.
SCORING
Scoring have become a common method for evaluating student work in
both the K-12 and the college classrooms. The purpose of this paper is to describe
the different types of scoring rubrics, explain why scoring rubrics are useful and
provide a process for developing scoring rubrics. This paper concludes with a
description of resources that contain examples of the different types of scoring
rubrics and further guidance in the development process.
What is a scoring?
Scoring are descriptive scoring schemes that are developed by teachers or
other evaluators to guide the analysis of the products or processes of students'
efforts (Brookhart, 1999). Scoring rubrics are typically employed when a
judgement of quality is required and may be used to evaluate a broad range of
subjects and activities. One common use of scoring rubrics is to guide the
evaluation of writing samples. Judgements concerning the quality of a given
writing sample may vary depending upon the criteria established by the individual
evaluator. One evaluator may heavily weigh the evaluation process upon the
linguistic structure, while another evaluator may be more interested in the
persuasiveness of the argument. A high quality essay is likely to have a
combination of these and other factors. By developing a pre-defined scheme for
the evaluation process, the subjectivity involved in evaluating an essay becomes
more objective.
GRADING
Grading is a major concern to both new and experienced instructors. Some
are quite strict at the beginning to prove that they are not pushovers. Others, who
may know their students personally, are quite lenient. Grades cause a lot of stress
for undergraduates; this concern often seems to inhibit enthusiasm for learning for
its own sake (“Do we have to know this for the exam?”), but grades are a fact of
life. They need not be counterproductive educationally if students know what to
expect.
Grades reflect personal philosophy and human psychology, as well as
efforts to measure intellectual progress with objective criteria. Whatever your
personal philosophy about grades, their importance to your students means that
you must make a constant effort to be fair and reasonable and to maintain grading
standards you can defend if challenged.
College courses are supposed to change students; that is, in some way the
students should be different after taking your course. In the grading process you
have to quantify what it is they learned, and give them feedback, according to
some metric, to how much they learned.
The following four philosophies of grading are from instructors at FSU.
Which is closest to your philosophy?
Philosophy 1
Grades are indicators of relative knowledge and skill; that is, a student’s
performance can and should be compared to the performance of other students in
that course. The standard to be used for the grade is the mean or average score of
the class on a test, paper, or project. The grade distribution can be objectively set
by determining the percentage of A’s, B’s, C’s, and D’s that will be awarded.
Outliers (really high or really low) can be awarded grades as seems fit.
Philosophy 2
Grades are based on preset expectations or criteria. In theory, every
student in the course could get an A if each met the preset expectations. The
grades are usually expressed as the percentage of success achieved (e.g., 90% and
above is an A, 80-90% is a B, 70–80% is a C, 60 - 70 a D, and below 60 is an F).
Pluses and minuses can be worked into this range.
Philosophy 3
Students come into the course with an A, and it is theirs to lose through
poor performance or absence, late papers, etc. With this philosophy the teacher
takes away points, rather than adding them.
Philosophy 4
Grades are subjective assessments of how a student is performing
according to his or her potential. Students who plan to major in a subject should
be graded harder than a student just taking a course out of general interest.
Therefore, the standard set depends upon student variables and should not be set
in stone.
Instructor’s Personal Philosophy
Florida State University does not have a suggested grading philosophy.
Such decisions remain with the instructors and their departments. However, the
grading system employed ought to be defensible in terms of alignment with the
course objectives, the teaching materials and methods, and departmental policies,
if any.
The grading system as well as the actual evaluation are closely tied to an
instructor’s own personal philosophy regarding teaching. Consistent with this, it
may be useful in advance, to consider factors that will influence instructors’
evaluation of students.
Some instructors make use of the threat of unannounced quizzes to motivate
students, while others do not.
Some instructors weigh content more heavily than style. It has been suggested
that lower (or higher) evaluations should be used as a tool to motivate
students.
Other instructors may use tests diagnostically, administering them during the
semester without grades and using them to plan future class activities. Extra
credit options are sometimes offered when requested by students.
Some instructors negotiate with students about the method(s) of evaluation,
while others do not. Class participation may be valued more highly in some
classes than in others.
These and other issues directly effect the instructor’s evaluation of students’
performance.
As personal preference is so much a part of the grading and evaluating of
students, a thoughtful examination of one’s own personal philosophy concerning
these issues will be very useful.