Crowdscale Shared Task Challenge 2013

18
Crowdscale Shared Task Challenge 2013 Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine)

description

Crowdscale Shared Task Challenge 2013. Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine). Crowdsourcing. Collect data and knowledge at large scale. Experts: Time-consuming & expensive. Crowdsourcing: Combine many non-experts. - PowerPoint PPT Presentation

Transcript of Crowdscale Shared Task Challenge 2013

Page 1: Crowdscale  Shared Task Challenge  2013

Crowdscale Shared Task Challenge 2013

Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine)

Page 2: Crowdscale  Shared Task Challenge  2013

Crowdsourcing

• Collect data and knowledge at large scale

Experts: Time-consuming & expensive

Crowdsourcing: Combine many non-experts

Page 3: Crowdscale  Shared Task Challenge  2013

Crowdsourcing for Labeling

• Goal: estimate true zi from noisy labels {Lij}.

Tasks: …

…Workers:

Page 4: Crowdscale  Shared Task Challenge  2013

Baseline Methods

• Majority Voting – All the workers have the same performance

Page 5: Crowdscale  Shared Task Challenge  2013

• Majority Voting – All the workers have the same performance

• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)

aj11 aj

12 aj13 aj

14

aj21 aj

22 aj23 aj

24

aj31 aj

32 aj33 aj

34

aj41 aj

42 aj43 aj

44True

Ans

wer

Worker j’ Answer

Baseline Methods

Page 6: Crowdscale  Shared Task Challenge  2013

• Majority Voting – All the workers have the same performance

• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)

• One-coin Model– Each worker characterized by an accuracy parameter

aj bj bj bj

bj aj bj bj

bj bj aj bj

bj bj bj aj

True

Ans

wer

Worker j’ Answer

Baseline Methods

Page 7: Crowdscale  Shared Task Challenge  2013

• Majority Voting – All the workers have the same performance

• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)

• One-coin Model– Each worker characterized by an accuracy parameter

• Other methods: – GLAD [Whitehill et al 09], Belief propagation [Liu et al 12],

Minimax entropy [Zhou et al 12] …

Baseline Methods

Page 8: Crowdscale  Shared Task Challenge  2013

In Practice …

• Model Selection

• Standard models may not work– Special structures on the classes– Unbalanced labels

Page 9: Crowdscale  Shared Task Challenge  2013

• Google Fact Judgment Dataset– 42,624 queries; 57 trained raters; 576 gold queries– Answers: {No, Yes, Skip}

• CrowdFlower Sentiment Judgment Dataset – 98,980 questions; 1,960 workers; 300 gold queries– Answers: 0 (Negative), 1 (Neutral), 2 (Positive),

3 (not related), 4(I can’t tell)

Two Datasets

• Special classes “skip”, “I can’t tell”– Ambiguity of queries

Page 10: Crowdscale  Shared Task Challenge  2013

Evaluation Metric

• Averaged Recall:

• Special classes “skip”, “I can’t tell”– Included in the evaluation?

Page 11: Crowdscale  Shared Task Challenge  2013

Important Properties

• Unbalanced labels (on the gold data)

No Yes Skip

19

531

26

5770 72

92

9

0 (Negative)

1 (Neutral)

2 (Positive)

3 (not related)

4(I can’t tell)

Only 9 instances in the reference data

Google DataCrowdFlower Data

Page 12: Crowdscale  Shared Task Challenge  2013

Evaluation Metric

No Yes Skip

19

531

26

• The importance of minority classes are up-weighted.

Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes”

Overfitting!

• Difficult to predict minority classes– E.g., Only 9 “I can’t tell” in the gold data,

difficult to generalize

Page 13: Crowdscale  Shared Task Challenge  2013

Google Fact Judgment Dataset

• Model selection (MV, one/two-coin EM): – Majority vote is the best– 57 “trained” workers– High and uniform accuracies

• But not good enough …

0.7

Workers’ accuracies

# of

wor

kers

Page 14: Crowdscale  Shared Task Challenge  2013

Google Fact Judgment Dataset

• Our Algorithm:

For each query i1. Count the percentages of labels submitted by the raters: ci(yes), ci(no),

ci(skip)

2.

ci(yes) > 0.4 labeli = yes

ci(no) > 0.8 labeli = no

otherwise labeli = skipEnd

Return {labeli}

Page 15: Crowdscale  Shared Task Challenge  2013

CrowdFlower Sentiment Judgment Dataset

• Model selection: – One-coin EM is best

256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17

Workers’ accuracies

# of

wor

kers

• Overall confusion matrix:

0 1 2 3 4

4

3

2 1

0

Page 16: Crowdscale  Shared Task Challenge  2013

CrowdFlower Sentiment Judgment Dataset

• Model selection: – One-coin EM is best

256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17

Workers’ accuracies

# of

wor

kers

• Overall confusion matrix:

0 1 2 3 4

4

3

2 1

0

Removing Class 4 may improve

performance

Page 17: Crowdscale  Shared Task Challenge  2013

CrowdFlower Sentiment Judgment Dataset • Our algorithm:

1. Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes:

2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then

endif Return

Page 18: Crowdscale  Shared Task Challenge  2013

Thank You