AUTOMATED TRAIT SCORES FOR TOEFL & GRE WRITING€¦ · • TOEFL iBT Writing: integrated task and...
Transcript of AUTOMATED TRAIT SCORES FOR TOEFL & GRE WRITING€¦ · • TOEFL iBT Writing: integrated task and...
© Pacific Metrics 2016
AUTOMATED TRAIT SCORES FOR TOEFL & GRE WRITING
Sandip Sinharay and Yigal Attali
Pacific Metrics Corporation & ETS
© Pacific Metrics 2016
Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores.
Paul W. Holland (2001)
© Pacific Metrics 2016
Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the subscores.
© Pacific Metrics 2016
• In the context of essay scoring, subscores usually refer to analytic scores, for example, those produced by the 6+1 writing model (Education Northwest).
• Analytic scores are expensive.
ANALYTIC SCORES
© Pacific Metrics 2016
• Given the popularity of automated essay scoring, a question of increasing interest: “Is it viable to report automated subscores or trait scores?”
• Focus here will be on e-rater V.2 (Attali & Burstein, 2006).
AUTOMATED TRAIT SCORES
© Pacific Metrics 2016
• Attali (2007) and Attali and Powers (2008) found three factors underlying the e-rater non-content features for TOEFL CBT and a developmental writing scale – word-choice (W) – grammatical conventions (G)– fluency & organization (F)
PREVIOUS RESEARCH ON AUTOMATED TRAIT SCORES
© Pacific Metrics 2016
• We explore reporting of four automated trait scores for TOEFL iBT Writing and GRE Writing– W– G– F– Content (C).
GOAL OF THE STUDY
© Pacific Metrics 2016
Three important questions• What should the trait scores be
based on?• How to compute the trait scores?• Do the trait scores have added
value?
GOAL OF THE STUDY
© Pacific Metrics 2016
• Features of e-rater V.2 used in this study: Vocabulary, Word length, Grammar, Usage, Mechanics, Col/prep, Organization, Essay length, Style,Value cosine, & Pattern cosine.
AUTOMATED TRAIT SCORES
© Pacific Metrics 2016
• TOEFL iBT Writing: integrated task and independent task
• GRE Writing: argument task and issue task
• In some computations, the reliability was computed from the correlation between the two tasks
TOEFL AND GRE
© Pacific Metrics 2016
• Weights based on regression of human score on the features
• Weights based on cross-task reliability of the features: weight proportional to √r/(1-r)
• Weights based on factor analysis
THREE TYPES OF WEIGHTS ON THE FEATURES
© Pacific Metrics 2016
• Weights based on regression were occasionally negative.
• The other two sets of weights were similar—the weights based on reliability are considered henceforth.
THREE TYPES OF WEIGHTS
© Pacific Metrics 2016
• Vocabulary:1.3, Word length:1.3• Grammar:1.3, Usage:1.3,
Mechanics:1.8, Col/prep:0.8• Organization:1.6, Essay
length:2.9, Style:0.7• Value cosine:1.2, Pattern
cosine:1.3
RELIABILITY-BASED WEIGHTS FOR TOEFL IBT
© Pacific Metrics 2016
• Reliabilities (based on cross-task correlation) – 0.29-0.71 (0.29 for C, ≥ 0.62 for
the other three) for TOEFL – 0.51-0.72 for GRE
RELIABILITY OF THE TRAIT SCORES
© Pacific Metrics 2016
• Haberman’s criterion for added value of subscores:
ADDED VALUE OF THE TRAIT SCORES
Test Form 1Subscore 1Subscore 2Subscore 3
…Total score
Test Form 2Subscore 1Subscore 2Subscore 3
…Total score
© Pacific Metrics 2016
• The three trait scores (W, G, and F) other than C had added value
VALUE OF THE TRAIT SCORES
TOEFL IndWGFC
E-rater score
TOEFL IntWGFC
E-rater score
.62.08.35
.15
© Pacific Metrics 2016
• Wainer et al. (2001) and Haberman (2008) suggested “augmented subscores” where a subscore is improved by borrowing strength from other subscores.
AUGMENTED SUBSCORES
© Pacific Metrics 2016
For example, for the TOEFL independent task, Augmented W = 0.60 W – 0.01 G + 0.02 F + 0.20 C
AUGMENTED TRAIT SCORES
© Pacific Metrics 2016
• Reliabilities (based on cross-task correlation) of augmented trait scores are 0.58-0.72 for TOEFL and 0.55-0.76 for GRE
• Reliability of augmented C score much larger than that of the (unaugmented) C score
RELIABILITY OF THE TRAIT SCORES
© Pacific Metrics 2016
• Haberman (2008) suggested that an augmented subscore has added value if its PRMSE is substantially larger than the corresponding unaugmented subscore.
• All augmented trait scores have added value using that criterion.
ADDED VALUE OF AUGMENTED TRAIT SCORES
© Pacific Metrics 2016
• We explored the reporting of four trait scores for TOEFL iBT Writing and GRE Writing.
• Three of them have added value• The fourth has added value after
augmentation
CONCLUSIONS
© Pacific Metrics 2016
• The added value of three trait scores provides further evidence of the validity of e-rater scores for measuring writing skill.
• It should be examined if the trait scores benefit student learning, teacher instruction, and program decisions.
CONCLUSIONS