DESIGNING LANGUAGE TEST

DESIGNING LANGUAGE TEST

Testing is a universal feature of social life. Throughout history people

have been put to the test to prove their capabilities or to establish their credentials;

this is the stuff of Homeric epic, of Arthurian legend.

There are many reasons for developing a critical understanding of the

principles and practice of language assessment. Obviously you will need to do so

if you are actually responsible for language test development and claim expertise

in this field. But many other people working in the field of language study more

generally will want to be able to participate as necessary in the discourse of this

field, for a number of reasons.

First, language tests play a powerful role in many people’s lives, acting as

gateways at important transitional moments in education, in employment, and in

moving from one country to another. Since language tests are devices for the

institutional control of individuals, it is clearly important that they should be

understood, and subjected to scrutiny. Secondly, you may be working with

language tests in your professional life as a teacher or administrator, teaching to a

test, administering tests, or relying on information from tests to make decisions on

the placement of students on particular courses.

Finally, if you are conducting research in language study you may need to

have measures of the language proficiency of your subjects. For this you need

either to choose an appropriate existing language test or design your own.

Thus, an understanding of language testing is relevant both for those

actually involved in creating language tests, and also more generally for those

involved in using tests or the information they provide, in practical and research

contexts.

TEST TYPES

Team System testing tools provides several test types that you can use for

your specific software testing purposes. This section describes those test types. It

also describes how to create and customize tests of each type, because these

processes are specific to the type of test.

However, common among the various test types are many test-related

tasks such as managing tests and working with test results. These common tasks

are described in the section Testing Tools Tasks.

In This Section

Working with Unit Tests

Provides links to topics that describe unit tests and how to create them.

Working with Web Tests

Describes how to create, edit, run, and view Web tests.

Working with Load Tests

Describes the uses of load tests, how to edit and run them, how to collect

and store load test performance data, and how to analyze load test runs.

Working with Manual Tests

Describes how to create and run manual tests, the only non-automated test

type.

Working with Generic Tests

Describes how to create and run generic tests. Generic tests wrap external

programs and tests that were not originally developed for use in the Team

System testing tools.

Working with Ordered Tests

Describes how to create ordered tests, which contain other tests that are

meant to be run in a specified order.

http://msdn.microsoft.com/en-us/library/ms182629(v=VS.80).aspx







TEST SPECIFICATION

The test case specifications should be developed from the test plan and are

the second phase of the test development life cycle. The test specification should

explain "how" to implement the test cases described in the test plan.

Test Specification Items

Each test specification should contain the following items:

Case No.: The test case number should be a three digit identifer of the

following form: c.s.t, where: c- is the chapter number, s- is the section

number, and t- is the test case number.

Title: is the title of the test.

ProgName: is the program name containing the test.

Author: is the person who wrote the test specification.

Date: is the date of the last revision to the test case.

Background: (Objectives, Assumptions, References, Success Criteria):

Describes in words how to conduct the test.

Expected Error(s): Describes any errors expected

Reference(s): Lists reference documententation used to design the

specification.

Data: (Tx Data, Predicted Rx Data): Describes the data flows between the

Implementation Under Test (IUT) and the test engine.

Script: (Pseudo Code for Coding Tests): Pseudo code (or real code) used to

conduct the test.

TEST ITEMS

The table below presents both pros and cons for various test item types.

Your selection of item types should be based on the types of outcomes you are

trying to assess (see analysis of your learning situation). Certain item types such

as true/false, supplied response, and matching, work well for assessing lower-

order outcomes (i.e., knowledge or comprehension goals), while other item types

such as essays, performance assessments, and some multiple choice questions, are

better for assessing higher-order outcomes (i.e., analysis, synthesis, or evaluation

goals). The italicized bullets below will help you determine the types of outcomes

the various items assess.

With your objectives in hand, it may be useful to create a test blueprint

that specifies your outcomes and the types of items you plan to use to assess those

outcomes. Further, test items are often weighted by difficulty. On your test

blueprint, you may wish to assign lower point values to items that assess lower-

order skills (knowledge, comprehension) and higher point values to items that

assess higher-order skills (synthesis, evaluation).

Item Type Pros Cons

Multiple Choice(see tips for writing multiple choice questions below)

more answer options (4-5) reduce the chance of guessing that an item is correct

many items can aid in student comparison and reduce ambiguity

greatest flexibility in type of outcome assessed: knowledge goals, application goals, analysis goals, etc.

reading time increased with more answers

reduces the number of questions that can be presented

difficult to write four or five reasonable choices

takes more time to write questions

True/False(see tips for writing true/false questions below)

can present many items at once

easy to score used to assess popular

misconceptions, cause-effect reactions

most difficult question to write objectively

ambiguous terms can confuse many

few answer options (2) increase the chance of guessing that an item is

http://www.edtech.vt.edu/edtech/id/assess/blueprint.html

http://www.edtech.vt.edu/edtech/id/assess/analysis.html

correct; need many items to overcome this effect

Matching efficient used to assess student

understanding of associations, relationships, definitions

difficult to assess higher-order outcomes (i.e., analysis, synthesis, evaluation goals)

Interpretive Exercise(the above three item types are often criticized for assessing only lower-order skills; the interpretive exercise is a way to assess higher-order skills w/ multiple choice, T/F, and matching items)

a variation on multiple choice, true/false, or matching, the interpretive exercise presents a new map, short reading, or other introductory material that the student must analyze

tests student ability to apply and transfer prior knowledge to new material

useful for assessing higher-order skills such as applications, analysis, synthesis, and evaluation

hard to design, must locate appropriate introductory material

students with good reading skills are often at an advantage

Supplied Response

chances of guessing reduced

measures knowledge and fact outcomes well, terminology, formulas

scoring is not objective can cause difficulty for

computer scoring

Essay less construction time, easier to write

encourages more appropriate study habits

measures higher-order outcomes (i.e., analysis, synthesis, or evaluation goals), creative thinking, writing ability

more grading time, hard to score

can yield great variety of responses

not efficient to test large bodies of content

if you give the student the choice of three or four essay options, you can find out what they know, but not what they don't know

Performance Assessments(includes essays above, along with speeches, demonstrations, presentations, etc.)

measures higher-order outcomes (i.e., analysis, synthesis, or evaluation goals)

labor and time-intensive

need to obtain inter-rater reliability when using more than one rater

The table below presents tips for designing two popular item types: multiple

choice questions and true/false questions.

Tips for Writing Multiple Choice Questions

Tips for Writing True/False Questions

Avoid responses that are interrelated. One answer should not be similar to others.

Avoid negatively stated items: "Which of the following is not a method of food irradiation?" It is easy to miss the the negative word "not." If you use negatives, bold-face the negative qualifier to ensure people see it.

Avoid making your correct response different from the other responses, grammatically, in length, or otherwise.

Avoid the use of "none of the above." When a students guesses "none of the above," you still do not know if they know the correct answer.

Avoid repeating words in the question stem in your responses. For example, if you use the word "purpose" in the question stem, do not use that same word in only one of the answers, as it will lead people to select that specific response.

Use plausible, realistic

Do not use definitive words such as "only," "none," and "always," that lead people to choose false, or uncertain words such as "might," "can," or "may," that lead people to choose true.

Do not write negatively stated items, as they are confusing to interpret: "Thomas Jefferson did not write the Declaration of Independence." True or False?

People have a tendency to choose "true," so design at least 60% of your T/F items to be "false" to further minimize guessing effects.

Use precise words (100, 20%, half), rather than vague or qualitative language (young, small, many).

Avoid making the correct answer longer than

responses. Create grammatically parallel

items to avoid giving away the correct response. For example, if you have four responses, do not start three of them with verbs and one of them with a noun.

Always place the "term" in your question stem and the "definition" as one of the response options.

SCORING

Scoring have become a common method for evaluating student work in

both the K-12 and the college classrooms. The purpose of this paper is to describe

the different types of scoring rubrics, explain why scoring rubrics are useful and

provide a process for developing scoring rubrics. This paper concludes with a

description of resources that contain examples of the different types of scoring

rubrics and further guidance in the development process.

What is a scoring?

Scoring are descriptive scoring schemes that are developed by teachers or

other evaluators to guide the analysis of the products or processes of students'

efforts (Brookhart, 1999). Scoring rubrics are typically employed when a

judgement of quality is required and may be used to evaluate a broad range of

subjects and activities. One common use of scoring rubrics is to guide the

evaluation of writing samples. Judgements concerning the quality of a given

writing sample may vary depending upon the criteria established by the individual

evaluator. One evaluator may heavily weigh the evaluation process upon the

linguistic structure, while another evaluator may be more interested in the

persuasiveness of the argument. A high quality essay is likely to have a

combination of these and other factors. By developing a pre-defined scheme for

the evaluation process, the subjectivity involved in evaluating an essay becomes

more objective.

GRADING

Grading is a major concern to both new and experienced instructors. Some

are quite strict at the beginning to prove that they are not pushovers. Others, who

may know their students personally, are quite lenient. Grades cause a lot of stress

for undergraduates; this concern often seems to inhibit enthusiasm for learning for

its own sake (“Do we have to know this for the exam?”), but grades are a fact of

life. They need not be counterproductive educationally if students know what to

expect.

Grades reflect personal philosophy and human psychology, as well as

efforts to measure intellectual progress with objective criteria. Whatever your

personal philosophy about grades, their importance to your students means that

you must make a constant effort to be fair and reasonable and to maintain grading

standards you can defend if challenged.

College courses are supposed to change students; that is, in some way the

students should be different after taking your course. In the grading process you

have to quantify what it is they learned, and give them feedback, according to

some metric, to how much they learned.

The following four philosophies of grading are from instructors at FSU.

Which is closest to your philosophy?

Philosophy 1

Grades are indicators of relative knowledge and skill; that is, a student’s

performance can and should be compared to the performance of other students in

that course. The standard to be used for the grade is the mean or average score of

the class on a test, paper, or project. The grade distribution can be objectively set

by determining the percentage of A’s, B’s, C’s, and D’s that will be awarded.

Outliers (really high or really low) can be awarded grades as seems fit.

Philosophy 2

Grades are based on preset expectations or criteria. In theory, every

student in the course could get an A if each met the preset expectations. The

grades are usually expressed as the percentage of success achieved (e.g., 90% and

above is an A, 80-90% is a B, 70–80% is a C, 60 - 70 a D, and below 60 is an F).

Pluses and minuses can be worked into this range.

Philosophy 3

Students come into the course with an A, and it is theirs to lose through

poor performance or absence, late papers, etc. With this philosophy the teacher

takes away points, rather than adding them.

Philosophy 4

Grades are subjective assessments of how a student is performing

according to his or her potential. Students who plan to major in a subject should

be graded harder than a student just taking a course out of general interest.

Therefore, the standard set depends upon student variables and should not be set

in stone.

Instructor’s Personal Philosophy

Florida State University does not have a suggested grading philosophy.

Such decisions remain with the instructors and their departments. However, the

grading system employed ought to be defensible in terms of alignment with the

course objectives, the teaching materials and methods, and departmental policies,

if any.

The grading system as well as the actual evaluation are closely tied to an

instructor’s own personal philosophy regarding teaching. Consistent with this, it

may be useful in advance, to consider factors that will influence instructors’

evaluation of students.

Some instructors make use of the threat of unannounced quizzes to motivate

students, while others do not.

Some instructors weigh content more heavily than style. It has been suggested

that lower (or higher) evaluations should be used as a tool to motivate

students.

Other instructors may use tests diagnostically, administering them during the

semester without grades and using them to plan future class activities. Extra

credit options are sometimes offered when requested by students.

Some instructors negotiate with students about the method(s) of evaluation,

while others do not. Class participation may be valued more highly in some

classes than in others.

These and other issues directly effect the instructor’s evaluation of students’

performance.

As personal preference is so much a part of the grading and evaluating of

students, a thoughtful examination of one’s own personal philosophy concerning

these issues will be very useful.

DESIGNING LANGUAGE TEST

Documents

Transcript of DESIGNING LANGUAGE TEST