Rewarding Crowdsourced Workers
Panos Ipeirotis
New York Universityand
Joint work with: Jing Wang, Foster Provost, Josh Attenberg, and Victor Sheng;
Twitter: @ipeirotis
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.com
Example: Build an Web Page Classifier
Need a large number of labeled sites for training Get people to look at sites and label them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr Mechanical Turk: 2500 websites/hr, cost:
$12/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr Mechanical Turk: 2500 websites/hr, cost:
$12/hr
Challenges
We do not know the true category for the objects– Available only after (costly) manual inspection
We do not know quality of the workers
We want to label objects with true categories We want (need?) to know the quality of the workers
1. Initialize “correct” label for each object (e.g., use majority vote)
2. Estimate error rates for workers (using “correct” labels)
3. Estimate “correct” labels (using error rates, weight worker votes according to quality)
4. Go to Step 2 and iterate until convergence
1. Initialize “correct” label for each object (e.g., use majority vote)
2. Estimate error rates for workers (using “correct” labels)
3. Estimate “correct” labels (using error rates, weight worker votes according to quality)
4. Go to Step 2 and iterate until convergence
Expectation Maximization Estimation
Iterative process to estimate worker error rates
Challenge: From Confusion Matrixes to Quality Scores
How to check if a worker is a spammer
using the confusion matrix?(hint: error rate not enough)
How to check if a worker is a spammer
using the confusion matrix?(hint: error rate not enough)
Confusion matrix for spammer worker
P[X → X]=0.847% P[X → G]=99.153%
P[G → X]=0.053% P[G → G]=99.947%
Confusion matrix for good worker P[X → X]=99.847% P[X →
G]=0.153% P[G → X]=4.053% P[G →
G]=95.947%
Challenge 1: Spammers are lazy and smart!
Confusion matrix for spammer P[X → X]=0% P[X → G]=100%
P[G → X]=0% P[G → G]=100%
Confusion matrix for good worker
P[X → X]=80% P[X → G]=20%
P[G → X]=20%P[G → G]=80% Spammers figure out how to fly under the radar…
In reality, we have 85% G sites and 15% X sites
Error rate of spammer = 0% * 85% + 100% * 15% = 15% Error rate of good worker = 85% * 20% + 85% * 20% = 20%
False negatives: Spam workers pass as legitimate
Challenge 2: Humans are biased!
Error rates for legitimate (but biased) employeeP[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →
X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →
X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error rates for legitimate (but biased) employeeP[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →
X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →
X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
We have 85% G sites, 5% P sites, 5% R sites, 5% X sites
Error rate of spammer (all G) = 0% * 85% + 100% * 15%
= 15% Error rate of biased worker = 80% * 85% + 100% * 5% =
73%
False positives: Legitimate workers appear to be spammers
(important note: bias is not just a matter of “ordered” classes)
Solution: Fix bias first, compute error rate afterwards
When biased worker says G, it is 100% G When biased worker says P, it is 100% G When biased worker says R, it is 50% P, 50% R When biased worker says X, it is 100% X
Small ambiguity for “R-rated” votes but other than that, fine!
Error Rates for legitimate (but biased) employeeP[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →
X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →
X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for legitimate (but biased) employeeP[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P →
X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R →
X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
When spammer says G, it is 25% G, 25% P, 25% R, 25% X When spammer says P, it is 25% G, 25% P, 25% R, 25% X When spammer says R, it is 25% G, 25% P, 25% R, 25% X When spammer says X, it is 25% G, 25% P, 25% R, 25% X[note: assume equal priors]
The results are highly ambiguous. No information provided!
Error Rates for spammerP[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0%P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0%P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0%P[X → G]=100.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=0.0%
Error Rates for spammerP[G → G]=100.0% P[G → P]=0.0% P[G → R]=0.0% P[G → X]=0.0%P[P → G]=100.0% P[P → P]=0.0% P[P → R]=0.0% P[P → X]=0.0%P[R → G]=100.0% P[R → P]=0.0% P[R → R]=0.0% P[R → X]=0.0%P[X → G]=100.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=0.0%
Solution: Fix bias first, compute error rate afterwards
[***Assume misclassification cost equal to 1, solution generalizes to arbitrary misclassification costs across categories]
• High cost: probability spread across classes
• Low cost: probability mass concentrated in one class
Assigned Label Corresponding “Soft” Label Label Cost
Spammer: G <G: 25%, P: 25%, R: 25%, X: 25%> 0.75
Good worker: P <G: 100%, P: 0%, R: 0%, X: 0%> 0.0
Expected Misclassification Cost
Quality Score
• A spammer is a worker who assigns labels randomly, regardless of what the true class is.
• Quality score useful for ranking workers
• Unaffected by systematic biases
• Scalar, so no need to examine confusion matrices
)P(
)Worker(1)Worker(
riorCost
CostreQualitySco
Quality Score: A scalar measure of quality
Quality Score
Challenges
• Thresholding has wrong incentive structure:
• Decent (but still useful) workers remain unused
• If you are above the threshold, no need to improve
• Uncertainty: The quality score is not really a fixed number
• Fluctuations in payment are puzzling for workers
• Best to have only increases in payment
Question: How to pay workers?
Two Types of Workers Divide workers into two groups
– Qualified Workers• The quality satisfies the target
quality
– Unqualified Workers• The qualify fails to meet the target
quality levels
– p: the price paid to all qualified workers
– fW(w): the pdf distribution of worker reservation
wage
– FW(w): the cdf distribution of worker reservation
wage
– R: the fixed price paid by external client for
each qualified object
A Simple Pricing Model for Qualified Workers
15fW(w): LogNormal(3,1), Selling price R=50
Example
fW(w)
Revenue
Optimal Worker Salary = 21
The Value of Unqualified Workers
Number of Workers 1 3 5 7 9 11
Classification Cost 0.300
0.216
0.163 0.126 0.099 0.079
We need ~9 workers to achieve the required quality
Binary classification (1:1) Worker confusion matrix
“Accept” classification cost <=0.1
Value: 1/9
– p: the price paid to all qualified workers
– fW(w): the pdf distribution of worker reservation
wage
FW(w): the cdf distribution of worker reservation
wage
– Adjust for the presence of “unqualified”
workers, and each unqualified worker “counts”
as 1/k of a qualified one
– R: the fixed price paid by external client for each qualified
object
A Pricing Model for Workers
Optimal Prices
p* p*/3
p*/9
Quality Score
Challenges
• Thresholding has wrong incentive structure:
• Decent (but still useful) workers remain unused
• If you are above the threshold, no need to improve
• Uncertainty: The quality score is not really a fixed number
• Fluctuations in payment are puzzling for workers
• Best to have only increases in payment
Question: How to pay workers?
Bayesian Estimates for Uncertainty
P[0 → 0]=Beta(2,1) P[0 → 1]=Beta(1,2)
P[1 → 0]=Beta(1,2) P[0 → 0]=Beta(2,1)
P[0 → 0]=Beta(101,1) P[0 → 1]=Beta(1,101)
P[1 → 0]=Beta(1,101) P[0 → 0]=Beta(101,1)
Worker A Worker B
Example of the piece-rate payment of a worker
Real-Time Payment and Reimbursement
Fair Paymen
t
# Tasks 10 20 30 40 Infinity
Piece-rate Payment (cents)
11 18 21 23 40
Example of the piece-rate payment of a worker
10
Real-Time Payment and Reimbursement
Fair Paymen
tPotenti
al “Bonus
”
# Tasks 10 20 30 40 Infinity
Piece-rate Payment (cents)
11 18 21 23 40
Payment
Number of Tasks
Piece
-rate
Paym
ent
10 20
Real-Time Payment and Reimbursement
Fair Paymen
tPotenti
al “Bonus
”
# Tasks 10 20 30 40 Infinity
Piece-rate Payment (cents)
11 18 21 23 40
Example of the piece-rate payment of a worker
Payment
Payment
Reimbursement
Number of Tasks
Piece
-rate
Paym
ent
10 20
Real-Time Payment and Reimbursement
Fair Paymen
t
30
Potential
“Bonus”
# Tasks 10 20 30 40 Infinity
Piece-rate Payment (cents)
11 18 21 23 40
Example of the piece-rate payment of a worker
Payment
Payment
Reimbursement Payment
Reimbursement
Piece
-rate
Paym
ent
10 20
Real-Time Payment and Reimbursement
Fair Paymen
t
30 40
Potential
“Bonus”
# Tasks 10 20 30 40 Infinity
Piece-rate Payment (cents)
11 18 21 23 40
Example of the piece-rate payment of a worker
Payment
Payment
Reimbursement Payment
Reimbursement
Payment
Reimbursement
Piece
-rate
Paym
ent
Synthetic Experiment Setup
N=10,000 tasks
R=200 cents
Cost <= 0.01
Labeling Process: Workers– Arrival frequency: every 600 seconds– Number of each arrival: 10 workers– Submitting speed: 30 seconds per task
The evaluation criterion is Unit Time Profit:
(Synthetic) Experimental Setup
Scatter Plot of Confusion Matrix, Reservation Wage
Qualify level and reservation wage independently distributed
Qualify level and reservation wage positively correlated
Experimental Results
No Correlation Positive Correlation0
5
10
15
20
25
30
35
Quality-Based PriceUniform Price: 4.4(30%-quan-tile)Uniform Price: 5.7(40%-quan-tile)Uniform Price: 7.4(50%-quan-tile)Uniform Price: 9.5(60%-quan-tile)Uniform Price: 12.5(70%-quantile)
24.6%
159.6%
Avera
ge P
rofit
per
Seco
nd
29
Workers reacting to bad rewards/scores
Score-based feedback leads to strange interactions:
The “angry, has-been-burnt-too-many-times” worker: “F*** YOU! I am doing everything correctly and you know
it! Stop trying to reject me with your stupid ‘scores’!”
The overachiever worker: “What am I doing wrong?? My score is 92% and I want to
have 100%”
30
National Academy of Sciences Dec 2010 “Frontiers of Science” conference
Your workers behave like my mice!
An unexpected connection…
31
Your workers behave like my mice!
Eh?
32
Your workers want to use only their motor skills, not their cognitive skills
33
The Biology Fundamentals
Brain functions are biologically expensive (20% of total energy consumption in humans)
Motor skills are more energy efficient than cognitive skills (e.g., walking)
Brain tends to delegate easy tasks to part of the neural system that handles motor skills
34
An unexpected connection at theNAS “Frontiers of Science” conf.
Your workers want to use only their motor skills, not their cognitive skills
Makes sense
35
An unexpected connection at theNAS “Frontiers of Science” conf.
And here is how I train my mice to behave…
36
The Mice Experiment
Cognitive
Solve mazeFind pellet
MotorPush lever three times
Pellet drops
37
How to Train the Mice?
Confuse motor skills!Reward cognition!
I should try this the moment that I get back to my room
38
Punishing Worker’s Motor Skills
Punish bad answers with frustration of motor skills (e.g., add delays between tasks)– “Loading image, please wait…”– “Image did not load, press here to reload”– “404 error. Return the HIT and accept again”
→Make this probabilistic to keep feedback implicit
39
Rewarding (?) Cognitive Effort
Reward good answers by rewarding the cognitive part of the brain – Introduce variety– Introduce novelty– Give new tasks fast– Show score improvements faster (but not the opposite)
– Show optimistic score estimates
40
41
Experiments
Web page classification Image tagging Email & URL collection
42
Experimental Summary (I)
Spammer workers quickly abandon– No need to display scores, or ban– Low quality submissions from ~60% to ~3%– Half-life of low-quality from 100+ HITs to less than 5
Good workers unaffected– No significant effect on participation of workers with
good performance– Lifetime of participants unaffected– Longer response times (after removing the
“intervention delays”; that was puzzling)
43
Experimental Summary (II)
Remember, scheme was for training the mice…
15%-20% of the spammers start submitting good work!
????
44
Two key questions
Why response time was slower for some good workers?
Why some low quality workers start working well?
????
45
System 1: “Automatic” actions
System 2:“Intelligent” actions
46
System 1 Tasks
47
System 2 Tasks
48
Status: Usage of System 1
(“Automatic”)
Status: Usage of System 2
(“Intelligent”)
Not Performing Well?Disrupt and Engage
System 2
Performing Well?Check if System 1
can Handle, remove System 2 stimuli
Performing Well?
Not Performing Well?Hell/Slow ban
Out
Thanks!
Q & A?
Top Related