Karthik Raman, Thorsten Joachims Cornell University.

21
Methods for Ordinal Peer Grading Karthik Raman, Thorsten Joachims Cornell University

Transcript of Karthik Raman, Thorsten Joachims Cornell University.

Page 1: Karthik Raman, Thorsten Joachims Cornell University.

Methods for Ordinal Peer GradingKarthik Raman, Thorsten Joachims

Cornell University

Page 2: Karthik Raman, Thorsten Joachims Cornell University.

2

Evaluation at Scale is challenging

MCQs & Other Auto-graded questions are not a good test of understanding and fall short of conventional testing. Limits kinds of courses offered.

• MOOCs: Promise to revolutionize education with unprecedented access

• Need to rethink conventional evaluation logistics:• Small-scale classes (10-15 students) : Instructors

evaluate students themselves• Medium-scale classes (20-200 students) : TAs take

over grading process.• MOOCs (10000+ students) : ??

Image Courtesy: PHDComics

Page 3: Karthik Raman, Thorsten Joachims Cornell University.

3

Peer Grading to the Rescue

Cardinal Peer Grading [Piech et. al. 13] require cardinal labels for each assignment.

Each peer grader g provides cardinal score for every assignment d they grade. E.g.: Likert Scale, Letter grade

• Students grade each other (anonymously)!• Overcomes limitations of instructor/TA evaluation:

• Number of “graders” scales with number of students!

8/10=

Page 4: Karthik Raman, Thorsten Joachims Cornell University.

4

Challenge: Cardinal Feedback is difficult to provide and less reliable

Students are not trained graders: Need to make feedback process simple!

Broad evidence showing ordinal feedback is easier to provide and more reliable than cardinal feedback: Project X is better than Project Y vs. Project X is a B+. Grading scales may be non-linear: Problematic for cardinal grades

• 30000$• 45000$• 90000$• 100000$

>

<??

Page 5: Karthik Raman, Thorsten Joachims Cornell University.

5

Ordinal Peer Grading

Each grader g provides ordering of assignments they grade: σ(g).

GOAL: Determine true ordering and grader reliabilities

>

>

..

..> >

… …

P

P

Q

Q

Q

R

R

..

..

Page 6: Karthik Raman, Thorsten Joachims Cornell University.

6

Related Problems Rank Aggregation/Metasearch Problem:

Given input rankings, determine overall ranking. Performance is usually top-heavy (optimize top k results). Estimating grader reliability is not a goal. Ties and data sparsity are not an issue.

Social choice/Voting systems: Aggregate preferences from individuals. Do not model reliabilities and are again top-heavy. Also typically assume complete rankings.

Crowdsourcing: Estimating worker (grader) reliability is key component. Scale of items (assignments) much larger than typical Crowdsourcing task. Also tends to be top-heavy.

Page 7: Karthik Raman, Thorsten Joachims Cornell University.

7

Mallows Model and Variants

GENERATIVE MODEL:

is the Kendall-Tau distance between orderings (# of differing pairs).

OPTIMIZATION: NP-hard. Greedy algorithm provides good approximation.

WITH GRADER RELIABILITY:

Variant with score-weighted objective (MALS) also used.

A P Q R Z… ……… …….. …

PQ R

ORDERINGS PROBABILITY

(P,Q,R) x

(P,R,Q); (Q,P,R)

x/e

(R,P,Q); (Q,R,P)

x/e2

(R,Q,P) x/e3

Page 8: Karthik Raman, Thorsten Joachims Cornell University.

8

Bradley-Terry Model & Variants

GENERATIVE MODEL:

Decomposes as pairwise preferences using logistic distribution of (true) score differences.

OPTIMIZATION: Alternating minimization to compute MLE scores (and grader reliabilities) using SGD subroutine.

GRADER RELIABILITY:

Variants studied include Plackett-Luce model (PL) and Thurstone model (THUR).

Page 9: Karthik Raman, Thorsten Joachims Cornell University.

9

Experimental Validation: New Peer Grading Dataset

Data collected in classroom during Fall 2013: First real large evaluation of machine-learning based peer-grading techniques.

Used two-stages: Project Posters (PO) and Final-Reports (FR) Students provided cardinal grades (10-point scale): 10-Perfect,8-Good,5-Borderline,3-Deficient

Also performed conventional grading: TA and instructor grading.

Statistic Poster ReportNumber of Assignments 42 44

Number of Peer Reviewers 148 153

Total Peer Reviews 996 586

Number of TA Reviewers 7 9

Total TA Reviews 78 88

Page 10: Karthik Raman, Thorsten Joachims Cornell University.

10

How well do OPG methods do w.r.t. Instructor Grades?

Kendall-Tau error measure (lower is better).

As good as cardinal methods (despite using less information).

TAs had error of 22.0 ± 16.0 (Posters) and 22.2 ± 6.8 (Report).

Cardinal

Poster Report15

20

25

30

35

20.8

31.9

21.8 21.420.6

23.7

16.2

29.7

19.8

31.9

19.8

33.0

25.5

30.2

SCAVG

NCS

MAL

MALBC

MALS

BT

THUR

PL

Page 11: Karthik Raman, Thorsten Joachims Cornell University.

11

Benefit of grader reliability Added lazy graders. Can we identify them?

Does significantly better than cardinal methods and simple heuristics.

Better for posters due to more data.

Poster Report15

25

35

45

55

65

75

85

95 100.0

72.4

100.0

80.0

54.6 55.551.7

46.949.4 48.8

37.6

49.5

NCS+G

MAL+G

MALBC+G

MALS+G

BT+G

THUR+G

PL+G

AC

CU

RA

CY

OF I

DEN

TIF

ICA

TIO

N

Cardinal

Page 12: Karthik Raman, Thorsten Joachims Cornell University.

12

What did Students think about it?

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Did students find the peer feedback helpful?

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Did students find peer grading valuable?

Very Yes, but could be better SomewhatNo Other

Page 13: Karthik Raman, Thorsten Joachims Cornell University.

13

Summary

Benefits of ordinal peer grading for large classes.

Adapted rank aggregation methods for performing ordinal peer grading.

Using data from an actual classroom, peer grading found to be a viable alternate to TA grading.

Students found it helpful and valuable

Page 14: Karthik Raman, Thorsten Joachims Cornell University.

14

Thank you!

CODE found at peergrading.org where a Peer Evaluation Web SERVICE is available.

DATA is also available

Page 15: Karthik Raman, Thorsten Joachims Cornell University.

15

Extra Slides

Page 16: Karthik Raman, Thorsten Joachims Cornell University.

16

Statistics of Grades provided

Page 17: Karthik Raman, Thorsten Joachims Cornell University.

17

Self-consistency

Self-consistent

Page 18: Karthik Raman, Thorsten Joachims Cornell University.

18

Other Results Consistent results even on sampling graders.

Scales well (performance and time) with number of reviewers and ordering size.

Page 19: Karthik Raman, Thorsten Joachims Cornell University.

19

Cardinal Peer Grading [Piech et. al. 2013]

Assume generative model (NCS) for graders:

True Score of assignment d

Grader g’s reliability (precision)

Grader g’s bias

Grader g’s grade for assignment d: Normally distributed

Can use MLE/Gibbs sampling/Variational inference ..

Page 20: Karthik Raman, Thorsten Joachims Cornell University.

20

Our Approach to Ordinal Peer Grading Proposed/Adapted different rank-aggregation methods for the OPG

problem:

Mallows model (MAL).

Score-weighted Mallows (MALS).

Bradley-Terry model (BT).

Thurstone model (THUR).

Plackett-Luce model (PL).

Ordering-based distributions

Pairwise-Preference based distributions

Extension of BT for orderings.

Page 21: Karthik Raman, Thorsten Joachims Cornell University.

21

How well do OPG methods do w.r.t. TA Grades?

Inter-TA Kendall-Tau: 47.5 ± 21.0 (Posters) and 34.0 ± 13.8 (Report).

Poster Report15

20

25

30

35

40

23.5

31.3

27.529.3

25.2

30.7

22.2

34.3

23.9

31.4

24.2

31.5

27.3

31.9

SCAVG

NCS

MAL

MALBC

MALS

BT

THUR

PL

Cardinal