Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha...

50
Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008

Transcript of Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha...

Page 1: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Benchmark Advisory Test (BAT) Update

BILC Conference

Athens, Greece

Dr. Ray Clifford and Dr. Martha Herzog

22-26 June 2008

Page 2: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

This Update Has Four Parts

• Why we began the BAT project.

• The role of proficiency standards.

• Why the BAT follows a construct-based, evidence-centered test design.

• When will BAT be available.

Page 3: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Why we began the BAT project.

• 2006 A survey was conducted on the desirability of a BILC-sponsored, “benchmark” test with advisory ratings.

Page 4: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Participation in the Survey

• 16 countries responded to the survey:

Austria Bulgaria Canada

Denmark Estonia Finland

Germany Hungary Italy

Latvia Lithuania Poland

Romania Spain Sweden

Turkey

Page 5: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Survey Results

1. Would your country use a Benchmark Test if one were available?

Definitely yes: 8

Probably yes: 5

Perhaps: 2

Most likely not: 0

Definitely not: 1

Page 6: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Survey Results

2. Does your country use “plus levels” when assigning STANAG ratings?

Definitely yes: 3

Probably yes: 0

Perhaps: 1

Most likely not: 1

Definitely not: 11

Page 7: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Survey Results

3. Would you like to have plus levels incorporated into a Benchmark Test?

Definitely yes: 5

Probably yes: 5

Perhaps: 2

Most likely not: 2

Definitely not: 2

Page 8: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Conclusions

• A “benchmark” test would be welcomed by most countries.

• (The scores should be advisory in nature.)

• Providing “plus” level ratings would allow greater fidelity in making comparisons.

• BILC should proceed with plans to:– Develop a benchmark STANAG test of

reading comprehension.– Explore internet delivery options.

Page 9: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

The Role of Proficiency

Standards

Dr. Martha Herzog

BILC

Athens, Greece

22-26 June 2008

Page 10: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

TRAINING IN TEST DESIGN

• LANGUAGE TESTING SEMINAR– 20 Iterations– 265 participants– 38 nations– 4 NATO officers– Facilitators from 10 nations

Page 11: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

BENCHMARK TESTS

• Tests of all four skills

• Measures Level 1 through Level 3

Page 12: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

STANDARDS

• All standards have three components– Content– Tasks– Accuracy

Page 13: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

TEAMWORK

• The Working Group functions as a team– 13 members from 8 nations– Contributions from many other nations

Page 14: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Summary

• STANDARDS

• TRAINING IN TEST DESIGN

• TEAMWORK

• TECHNOLOGY

Page 15: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Why does the BAT follow a construct-based, evidence-

centered test design?

Page 16: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Why does the BAT follow a construct-based, evidence-centered test design?

Because a CBT, ECD design solves a major problem encountered when testing proficiency in the receptive skills, i.e. in testing Reading and Listening.

Page 17: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

In contrast to traditional test development procedures, CBT allows direct (rather than indirect application) of the STANAG 6001 Proficiency Scales to the development and scoring of Reading and Listening Proficiency Tests.

Page 18: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Norm-Referenced Tests

• Create a table of test specifications.

• Train item writers in item-writing techniques.

• Develop items.

• Test the items for difficulty, discrimination, and reliability by administering them to several hundred learners.

• Use statistics to eliminate “bad” items.

• Administer the resulting test.

• Report results compared to other students.

Page 19: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Norm-Referenced Tests (cont.)

• Each test administration yields a total score.

• However, setting “cut” scores or “passing” scores on norm-referenced tests is a major challenge.

• And relating scores on norm-referenced tests to a polytomous set of criteria (such as levels in the STANAG 6001 or other proficiency scales) is even more problematic.

Page 20: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Traditional MethodOf Setting Cut Scores

100  

   

   

   

   

50  

   

   

   

   

0  

Level3

Group

Level2

Group

Level1

Group

Tes

t to

be

calib

rate

d

Gro

ups

of ”

know

n”

abili

ty

Page 21: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

The Results One Hopes For:

100  

   

   

   

   

50  

   

   

   

   

0  

Level3

Group

Level2

Group

Level1

Group

Dis

tinct

“cu

t” s

core

s be

twee

n th

e sc

ores

of

the

calib

ratio

n gr

oups

Gro

ups

of “

know

n”

abili

ty

Page 22: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

The Results One Always Gets:

100  

   

   

   ???

   ???

50  

   

  ??? 

  ??? 

   

0  

Level3

Group

Level2

Group

Level1

Group

Ban

ds o

f O

verla

ppin

g T

est

S

core

s

Gro

ups

of ”

know

n”

abili

ty

Page 23: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

No matter where the cut scores are set, they are wrong for someone.

100  

   

   

   ???

   ???

50  

   

  ??? 

  ??? 

   

0  

Level3

Group

Level2

Group

Level1

Group

Whe

re in

the

ove

rlapp

ing

rang

e sh

ould

the

cut

sco

re b

e se

t?

Gro

ups

of ”

know

n”

abili

ty

Page 24: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Why is this “overlap” in scores always present?

• A single test score on a multi-level test…– Gives equal credit for every right answer

regardless of its proficiency level.– Camouflages by-level abilities.– Is a “compensatory” score.

• Proficiency abilities…– Are by definition “non-compensatory”.– Require demonstration of sustained ability at

each level.

Page 25: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Better Test Design:Construct-Based Proficiency Testing

• Uses a “floor” and “ceiling” approach similar to that used in Speaking and Writing tests.

• The proficiency rating is assigned based on two separate scores:– A “floor” proficiency level of sustained ability

across a range of tasks and contexts specific to that level.

– A “ceiling” proficiency level of non-sustained ability at the next higher proficiency level.

Page 26: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Therefore Construct-Based Testing

• Tests each proficiency level separately.– Three tests for levels 1 through 3.– Or three subtests within a longer test.

• Rates each level-specific test separately.• Applies the “floor” and “ceiling” criteria used in

rating productive skills using a scale such as:– Sustained (consistent evidence) = 70% to 100%– Developing (present, inconsistent) = 55% to 65%– Emerging (some limited evidence) = 40% to 50%– Random (no visible evidence) = 0% to 35%

Page 27: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Does it make a difference?

Consider the following example.

Page 28: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Level 1 Results

Level 2 Results

Level 3 Results

Total Score

True Level

Alice 195 ?

Bob 195 ?

Carol 195 ?

Page 29: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85 70 40 195

Bob 195

Carol 195

Page 30: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195

Bob 195

Carol 195

Page 31: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 195

Carol 195

Page 32: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90 85 20 195

Carol 195

Page 33: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195

Carol 195

Page 34: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195 2

(Clearly)

Carol 195

Page 35: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True

Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195 2

(Clearly)

Carol 90 60 45 195

Page 36: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195 2

(Clearly)

Carol 90Sustained

60Developing

45Emerging

195

Page 37: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 2

Learner Results @

Level 3

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195 2

(Clearly)

Carol 90Sustained

60Developing

45Emerging

195 1

(Clearly)

Page 38: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

A Total Score (where 195=Level 1)Versus Construct-Based Scoring

Learner Results @

Level 1

Learner Results @

Level 1

Learner Results @

Level 1

Total Score

True Level

Alice 85Sustained

70Sustained

40Emerging

195 2

(Barely)

Bob 90Sustained

85Sustained

20Random

195 2

(Clearly)

Carol 90Sustained

60Developing

45Emerging

195 1(+ developing

ability @ 2)

Page 39: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Scores on Construct-Based Tests are:

valid,easily explained,

andinformative!

But how is a CBT developed?

Page 40: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Construct-Based Proficiency Tests

1. Define each proficiency level as a construct to be tested.

2. Follow a construct-based, evidence-centered test design.

3. Train item writers– In the proficiency scales.– In matching text types to the tasks in the

scales.– In item writing.

Page 41: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Construct-Based Proficiency Tests

4. Develop items that exactly match all of the specifications for each level in the proficiency scale, with...

– Examinee task aligned with the author’s [or the speaker’s ] purpose.

– Level-appropriate topics and contexts.

Page 42: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Construct-Based Proficiency Tests

5. Use “alignment”, “bracketing”, and “modified Angoff” item review and quality control procedures.

– A specifications review to insure alignment of author purpose, text type, and reader task.

– A bracketing review to check the adequacy of the item’s response options for test takers at higher and at lower proficiency levels.

– Modified Angoff ratings of item difficulty for “at-level” test takers to set passing levels.

Page 43: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Construct-Based Proficiency Tests

6. Use data from the Angoff reviews to define “sustained ability” for each level of the test.

7. Assemble the “good” items into level-specific tests or subtests.

8. Do validation testing.

9. Use statistical analyses to confirm reviewer ratings.

Page 44: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Test Development Procedures: Construct-Based Proficiency Tests

10.Replace items that do not “cluster” or act like the other items at each level.

11.Score and report results for each level using “sustained” proficiency criteria.

12.Continue to build the item data bases to enable:

– Random selection of test items for multiple forms.

– Computer adaptive testing.

Page 45: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

What do the results of a CBT Reading proficiency look like?

Here are some initial results from the

BAT English Reading Proficiency Test.

Page 46: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

NATO - Reading Proficiency Profiles

Sustained

Developing

Emerging

Random

Results on theLevel 1 Test

Results on theLevel 2 Test

Results on theLevel 3 Test

Page 47: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

When will BAT be available?• Funds have been set aside for

administering and scoring of 200 free advisory tests.– All four skills will be tested.– BAT Reading, Listening, and Writing tests will

be online tests.– The Speaking test will be conducted over the

telephone.

• These tests are to be used in test norming or calibration studies.

Page 48: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

When will BAT be available?

• We anticipate the following timeline:– About October, 2008. Directions on how to

apply will be sent out.– About November, 2008. Applications will be

submitted.– About December, 2008. Applications will be

reviewed and decisions made about how the 200 tests will be allocated.

– Between February and June. The first round of advisory testing will be conducted.

Page 49: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

When will BAT be available?

• More specific information will be sent out after consultation with ACT.

Page 50: Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Are there any questions?

? ? ?