Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha...
-
Upload
jared-kelley -
Category
Documents
-
view
216 -
download
0
Transcript of Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha...
Benchmark Advisory Test (BAT) Update
BILC Conference
Athens, Greece
Dr. Ray Clifford and Dr. Martha Herzog
22-26 June 2008
This Update Has Four Parts
• Why we began the BAT project.
• The role of proficiency standards.
• Why the BAT follows a construct-based, evidence-centered test design.
• When will BAT be available.
Why we began the BAT project.
• 2006 A survey was conducted on the desirability of a BILC-sponsored, “benchmark” test with advisory ratings.
Participation in the Survey
• 16 countries responded to the survey:
Austria Bulgaria Canada
Denmark Estonia Finland
Germany Hungary Italy
Latvia Lithuania Poland
Romania Spain Sweden
Turkey
Survey Results
1. Would your country use a Benchmark Test if one were available?
Definitely yes: 8
Probably yes: 5
Perhaps: 2
Most likely not: 0
Definitely not: 1
Survey Results
2. Does your country use “plus levels” when assigning STANAG ratings?
Definitely yes: 3
Probably yes: 0
Perhaps: 1
Most likely not: 1
Definitely not: 11
Survey Results
3. Would you like to have plus levels incorporated into a Benchmark Test?
Definitely yes: 5
Probably yes: 5
Perhaps: 2
Most likely not: 2
Definitely not: 2
Conclusions
• A “benchmark” test would be welcomed by most countries.
• (The scores should be advisory in nature.)
• Providing “plus” level ratings would allow greater fidelity in making comparisons.
• BILC should proceed with plans to:– Develop a benchmark STANAG test of
reading comprehension.– Explore internet delivery options.
The Role of Proficiency
Standards
Dr. Martha Herzog
BILC
Athens, Greece
22-26 June 2008
TRAINING IN TEST DESIGN
• LANGUAGE TESTING SEMINAR– 20 Iterations– 265 participants– 38 nations– 4 NATO officers– Facilitators from 10 nations
BENCHMARK TESTS
• Tests of all four skills
• Measures Level 1 through Level 3
STANDARDS
• All standards have three components– Content– Tasks– Accuracy
TEAMWORK
• The Working Group functions as a team– 13 members from 8 nations– Contributions from many other nations
Summary
• STANDARDS
• TRAINING IN TEST DESIGN
• TEAMWORK
• TECHNOLOGY
Why does the BAT follow a construct-based, evidence-
centered test design?
Why does the BAT follow a construct-based, evidence-centered test design?
Because a CBT, ECD design solves a major problem encountered when testing proficiency in the receptive skills, i.e. in testing Reading and Listening.
In contrast to traditional test development procedures, CBT allows direct (rather than indirect application) of the STANAG 6001 Proficiency Scales to the development and scoring of Reading and Listening Proficiency Tests.
Test Development Procedures: Norm-Referenced Tests
• Create a table of test specifications.
• Train item writers in item-writing techniques.
• Develop items.
• Test the items for difficulty, discrimination, and reliability by administering them to several hundred learners.
• Use statistics to eliminate “bad” items.
• Administer the resulting test.
• Report results compared to other students.
Test Development Procedures: Norm-Referenced Tests (cont.)
• Each test administration yields a total score.
• However, setting “cut” scores or “passing” scores on norm-referenced tests is a major challenge.
• And relating scores on norm-referenced tests to a polytomous set of criteria (such as levels in the STANAG 6001 or other proficiency scales) is even more problematic.
A Traditional MethodOf Setting Cut Scores
100
50
0
Level3
Group
Level2
Group
Level1
Group
Tes
t to
be
calib
rate
d
Gro
ups
of ”
know
n”
abili
ty
The Results One Hopes For:
100
50
0
Level3
Group
Level2
Group
Level1
Group
Dis
tinct
“cu
t” s
core
s be
twee
n th
e sc
ores
of
the
calib
ratio
n gr
oups
Gro
ups
of “
know
n”
abili
ty
The Results One Always Gets:
100
???
???
50
???
???
0
Level3
Group
Level2
Group
Level1
Group
Ban
ds o
f O
verla
ppin
g T
est
S
core
s
Gro
ups
of ”
know
n”
abili
ty
No matter where the cut scores are set, they are wrong for someone.
100
???
???
50
???
???
0
Level3
Group
Level2
Group
Level1
Group
Whe
re in
the
ove
rlapp
ing
rang
e sh
ould
the
cut
sco
re b
e se
t?
Gro
ups
of ”
know
n”
abili
ty
Why is this “overlap” in scores always present?
• A single test score on a multi-level test…– Gives equal credit for every right answer
regardless of its proficiency level.– Camouflages by-level abilities.– Is a “compensatory” score.
• Proficiency abilities…– Are by definition “non-compensatory”.– Require demonstration of sustained ability at
each level.
A Better Test Design:Construct-Based Proficiency Testing
• Uses a “floor” and “ceiling” approach similar to that used in Speaking and Writing tests.
• The proficiency rating is assigned based on two separate scores:– A “floor” proficiency level of sustained ability
across a range of tasks and contexts specific to that level.
– A “ceiling” proficiency level of non-sustained ability at the next higher proficiency level.
Therefore Construct-Based Testing
• Tests each proficiency level separately.– Three tests for levels 1 through 3.– Or three subtests within a longer test.
• Rates each level-specific test separately.• Applies the “floor” and “ceiling” criteria used in
rating productive skills using a scale such as:– Sustained (consistent evidence) = 70% to 100%– Developing (present, inconsistent) = 55% to 65%– Emerging (some limited evidence) = 40% to 50%– Random (no visible evidence) = 0% to 35%
Does it make a difference?
Consider the following example.
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Level 1 Results
Level 2 Results
Level 3 Results
Total Score
True Level
Alice 195 ?
Bob 195 ?
Carol 195 ?
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85 70 40 195
Bob 195
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195
Bob 195
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 195
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90 85 20 195
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195 2
(Clearly)
Carol 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True
Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195 2
(Clearly)
Carol 90 60 45 195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195 2
(Clearly)
Carol 90Sustained
60Developing
45Emerging
195
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 2
Learner Results @
Level 3
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195 2
(Clearly)
Carol 90Sustained
60Developing
45Emerging
195 1
(Clearly)
A Total Score (where 195=Level 1)Versus Construct-Based Scoring
Learner Results @
Level 1
Learner Results @
Level 1
Learner Results @
Level 1
Total Score
True Level
Alice 85Sustained
70Sustained
40Emerging
195 2
(Barely)
Bob 90Sustained
85Sustained
20Random
195 2
(Clearly)
Carol 90Sustained
60Developing
45Emerging
195 1(+ developing
ability @ 2)
Scores on Construct-Based Tests are:
valid,easily explained,
andinformative!
But how is a CBT developed?
Test Development Procedures: Construct-Based Proficiency Tests
1. Define each proficiency level as a construct to be tested.
2. Follow a construct-based, evidence-centered test design.
3. Train item writers– In the proficiency scales.– In matching text types to the tasks in the
scales.– In item writing.
Test Development Procedures: Construct-Based Proficiency Tests
4. Develop items that exactly match all of the specifications for each level in the proficiency scale, with...
– Examinee task aligned with the author’s [or the speaker’s ] purpose.
– Level-appropriate topics and contexts.
Test Development Procedures: Construct-Based Proficiency Tests
5. Use “alignment”, “bracketing”, and “modified Angoff” item review and quality control procedures.
– A specifications review to insure alignment of author purpose, text type, and reader task.
– A bracketing review to check the adequacy of the item’s response options for test takers at higher and at lower proficiency levels.
– Modified Angoff ratings of item difficulty for “at-level” test takers to set passing levels.
Test Development Procedures: Construct-Based Proficiency Tests
6. Use data from the Angoff reviews to define “sustained ability” for each level of the test.
7. Assemble the “good” items into level-specific tests or subtests.
8. Do validation testing.
9. Use statistical analyses to confirm reviewer ratings.
Test Development Procedures: Construct-Based Proficiency Tests
10.Replace items that do not “cluster” or act like the other items at each level.
11.Score and report results for each level using “sustained” proficiency criteria.
12.Continue to build the item data bases to enable:
– Random selection of test items for multiple forms.
– Computer adaptive testing.
What do the results of a CBT Reading proficiency look like?
Here are some initial results from the
BAT English Reading Proficiency Test.
NATO - Reading Proficiency Profiles
Sustained
Developing
Emerging
Random
Results on theLevel 1 Test
Results on theLevel 2 Test
Results on theLevel 3 Test
When will BAT be available?• Funds have been set aside for
administering and scoring of 200 free advisory tests.– All four skills will be tested.– BAT Reading, Listening, and Writing tests will
be online tests.– The Speaking test will be conducted over the
telephone.
• These tests are to be used in test norming or calibration studies.
When will BAT be available?
• We anticipate the following timeline:– About October, 2008. Directions on how to
apply will be sent out.– About November, 2008. Applications will be
submitted.– About December, 2008. Applications will be
reviewed and decisions made about how the 200 tests will be allocated.
– Between February and June. The first round of advisory testing will be conducted.
When will BAT be available?
• More specific information will be sent out after consultation with ACT.
Are there any questions?
? ? ?