Crowdscale Shared Task Challenge 2013

Crowdscale Shared Task Challenge 2013

Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine)

Crowdsourcing

• Collect data and knowledge at large scale

Experts: Time-consuming & expensive

Crowdsourcing: Combine many non-experts

Crowdsourcing for Labeling

• Goal: estimate true zi from noisy labels {Lij}.

Tasks: …

…Workers:

Baseline Methods

• Majority Voting – All the workers have the same performance


• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)

aj11 aj

12 aj13 aj

14

aj21 aj

22 aj23 aj

24

aj31 aj

32 aj33 aj

34

aj41 aj

42 aj43 aj

44True

Ans

wer

Worker j’ Answer

Baseline Methods



• One-coin Model– Each worker characterized by an accuracy parameter

aj bj bj bj

bj aj bj bj

bj bj aj bj

bj bj bj aj

True

Ans

wer

Worker j’ Answer

Baseline Methods



• One-coin Model– Each worker characterized by an accuracy parameter

• Other methods: – GLAD [Whitehill et al 09], Belief propagation [Liu et al 12],

Minimax entropy [Zhou et al 12] …

Baseline Methods

In Practice …

• Model Selection

• Standard models may not work– Special structures on the classes– Unbalanced labels

• Google Fact Judgment Dataset– 42,624 queries; 57 trained raters; 576 gold queries– Answers: {No, Yes, Skip}

• CrowdFlower Sentiment Judgment Dataset – 98,980 questions; 1,960 workers; 300 gold queries– Answers: 0 (Negative), 1 (Neutral), 2 (Positive),

3 (not related), 4(I can’t tell)

Two Datasets

• Special classes “skip”, “I can’t tell”– Ambiguity of queries

Evaluation Metric

• Averaged Recall:

• Special classes “skip”, “I can’t tell”– Included in the evaluation?

Important Properties

• Unbalanced labels (on the gold data)

No Yes Skip

19

531

26

5770 72

92

9

0 (Negative)

1 (Neutral)

2 (Positive)

3 (not related)

4(I can’t tell)

Only 9 instances in the reference data

Google DataCrowdFlower Data

Evaluation Metric

No Yes Skip

19

531

26

• The importance of minority classes are up-weighted.

Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes”

Overfitting!

• Difficult to predict minority classes– E.g., Only 9 “I can’t tell” in the gold data,

difficult to generalize

Google Fact Judgment Dataset

• Model selection (MV, one/two-coin EM): – Majority vote is the best– 57 “trained” workers– High and uniform accuracies

• But not good enough …

0.7

Workers’ accuracies

# of

wor

kers

Google Fact Judgment Dataset

• Our Algorithm:

For each query i1. Count the percentages of labels submitted by the raters: ci(yes), ci(no),

ci(skip)

2.

ci(yes) > 0.4 labeli = yes

ci(no) > 0.8 labeli = no

otherwise labeli = skipEnd

Return {labeli}

CrowdFlower Sentiment Judgment Dataset

• Model selection: – One-coin EM is best

256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17


# of

wor

kers

• Overall confusion matrix:

0 1 2 3 4

4

3

2 1

0

CrowdFlower Sentiment Judgment Dataset

• Model selection: – One-coin EM is best

256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17


# of

wor

kers

• Overall confusion matrix:

0 1 2 3 4

4

3

2 1

0

Removing Class 4 may improve

performance

CrowdFlower Sentiment Judgment Dataset • Our algorithm:

1. Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes:

2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then

endif Return

Thank You

Crowdscale Shared Task Challenge 2013

Documents

Transcript of Crowdscale Shared Task Challenge 2013