Crowdscale Shared Task Challenge 2013
-
Upload
rajah-salinas -
Category
Documents
-
view
46 -
download
0
description
Transcript of Crowdscale Shared Task Challenge 2013
Crowdscale Shared Task Challenge 2013
Qiang Liu (UC Irvine), Jian Peng (MIT CSAIL), Alexander Ihler (UC Irvine)
Crowdsourcing
• Collect data and knowledge at large scale
Experts: Time-consuming & expensive
Crowdsourcing: Combine many non-experts
Crowdsourcing for Labeling
• Goal: estimate true zi from noisy labels {Lij}.
Tasks: …
…Workers:
Baseline Methods
• Majority Voting – All the workers have the same performance
• Majority Voting – All the workers have the same performance
• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)
aj11 aj
12 aj13 aj
14
aj21 aj
22 aj23 aj
24
aj31 aj
32 aj33 aj
34
aj41 aj
42 aj43 aj
44True
Ans
wer
Worker j’ Answer
Baseline Methods
• Majority Voting – All the workers have the same performance
• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)
• One-coin Model– Each worker characterized by an accuracy parameter
aj bj bj bj
bj aj bj bj
bj bj aj bj
bj bj bj aj
True
Ans
wer
Worker j’ Answer
Baseline Methods
• Majority Voting – All the workers have the same performance
• Two-coin Model (Dawid & Skene 79)– Each worker characterized by a confusion matrix– Learned by expectation maximization (EM)
• One-coin Model– Each worker characterized by an accuracy parameter
• Other methods: – GLAD [Whitehill et al 09], Belief propagation [Liu et al 12],
Minimax entropy [Zhou et al 12] …
Baseline Methods
In Practice …
• Model Selection
• Standard models may not work– Special structures on the classes– Unbalanced labels
• Google Fact Judgment Dataset– 42,624 queries; 57 trained raters; 576 gold queries– Answers: {No, Yes, Skip}
• CrowdFlower Sentiment Judgment Dataset – 98,980 questions; 1,960 workers; 300 gold queries– Answers: 0 (Negative), 1 (Neutral), 2 (Positive),
3 (not related), 4(I can’t tell)
Two Datasets
• Special classes “skip”, “I can’t tell”– Ambiguity of queries
Evaluation Metric
• Averaged Recall:
• Special classes “skip”, “I can’t tell”– Included in the evaluation?
Important Properties
• Unbalanced labels (on the gold data)
No Yes Skip
19
531
26
5770 72
92
9
0 (Negative)
1 (Neutral)
2 (Positive)
3 (not related)
4(I can’t tell)
Only 9 instances in the reference data
Google DataCrowdFlower Data
Evaluation Metric
No Yes Skip
19
531
26
• The importance of minority classes are up-weighted.
Class “Skip” is 531/26 ≈ 20 times more important than Class “Yes”
Overfitting!
• Difficult to predict minority classes– E.g., Only 9 “I can’t tell” in the gold data,
difficult to generalize
Google Fact Judgment Dataset
• Model selection (MV, one/two-coin EM): – Majority vote is the best– 57 “trained” workers– High and uniform accuracies
• But not good enough …
0.7
Workers’ accuracies
# of
wor
kers
Google Fact Judgment Dataset
• Our Algorithm:
For each query i1. Count the percentages of labels submitted by the raters: ci(yes), ci(no),
ci(skip)
2.
ci(yes) > 0.4 labeli = yes
ci(no) > 0.8 labeli = no
otherwise labeli = skipEnd
Return {labeli}
CrowdFlower Sentiment Judgment Dataset
• Model selection: – One-coin EM is best
256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17
Workers’ accuracies
# of
wor
kers
• Overall confusion matrix:
0 1 2 3 4
4
3
2 1
0
CrowdFlower Sentiment Judgment Dataset
• Model selection: – One-coin EM is best
256 47 14 24 27 22 280 26 35 22 11 43 308 30 9 9 22 6 456 14 7 16 13 6 17
Workers’ accuracies
# of
wor
kers
• Overall confusion matrix:
0 1 2 3 4
4
3
2 1
0
Removing Class 4 may improve
performance
CrowdFlower Sentiment Judgment Dataset • Our algorithm:
1. Remove all class 4 in the data, run one-coin EM get posterior distributions on the remaining classes:
2. If ci(4) > 0.5 or entropy( ) > log(4) – 0.27, then
endif Return
Thank You