Massolution NYC 2013: Crowdsourcing Best Practices | Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical Turk
-
Upload
ed-chi -
Category
Technology
-
view
1.697 -
download
1
description
Transcript of Crowdsourcing for HCI Research with Amazon Mechanical Turk
![Page 1: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/1.jpg)
Crowdsourcing for Human Computer Interaction Research
Ed H. Chi Research Scientist Google (work done while at [Xerox] PARC with Aniket Kittur)
![Page 2: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/2.jpg)
User studies
• Getting input from users is important in HCI – surveys – rapid prototyping – usability tests – cognitive walkthroughs – performance measures – quantitative ratings
![Page 3: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/3.jpg)
User studies
• Getting input from users is expensive – Time costs – Monetary costs
• Often have to trade off costs with sample size
![Page 4: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/4.jpg)
Online solutions
• Online user surveys • Remote usability testing • Online experiments • But still have difficulties
– Rely on practitioner for recruiting participants – Limited pool of participants
![Page 5: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/5.jpg)
Crowdsourcing
• Make tasks available for anyone online to complete • Quickly access a large user pool, collect data, and
compensate users • Example: NASA Clickworkers
– 100k+ volunteers identified Mars craters from space photographs
– Aggregate results “virtually indistinguishable” from expert geologists
experts
crowds
http://clickworkers.arc.nasa.gov
![Page 6: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/6.jpg)
Amazon’s Mechanical turk
• Market for “human intelligence tasks” • Typically short, objective tasks
– Tag an image – Find a webpage – Evaluate relevance of search results
• Users complete for a few pennies each
![Page 7: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/7.jpg)
Example task
![Page 8: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/8.jpg)
Using Mechanical Turk for user studies
Traditional user studies
Mechanical Turk
Task complexity Complex Long
Simple Short
Task subjectivity Subjective Opinions
Objective Verifiable
User information Targeted demographics High interactivity
Unknown demographics Limited interactivity
Can Mechanical Turk be usefully used for user studies?
![Page 9: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/9.jpg)
Task
• Assess quality of Wikipedia articles • Started with ratings from expert Wikipedians
– 14 articles (e.g., “Germany”, “Noam Chomsky”) – 7-point scale
• Can we get matching ratings with mechanical turk?
![Page 10: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/10.jpg)
Experiment 1
• Rate articles on 7-point scales: – Well written – Factually accurate – Overall quality
• Free-text input: – What improvements does the article need?
• Paid $0.05 each
![Page 11: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/11.jpg)
Experiment 1: Good news
• 58 users made 210 ratings (15 per article) – $10.50 total
• Fast results – 44% within a day, 100% within two days – Many completed within minutes
![Page 12: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/12.jpg)
Experiment 1: Bad news
• Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)
• Worse, 59% potentially invalid responses
• Nearly 75% of these done by only 8 users
Experiment 1
Invalid comments
49%
<1 min responses
31%
![Page 13: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/13.jpg)
Not a good start
• Summary of Experiment 1: – Only marginal correlation with experts. – Heavy gaming of the system by a minority
• Possible Response: – Can make sure these gamers are not rewarded – Ban them from doing your hits in the future – Create a reputation system [Delores Lab]
• Can we change how we collect user input ?
![Page 14: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/14.jpg)
Design changes
• Use verifiable questions to signal monitoring – “How many sections does the article have?” – “How many images does the article have?” – “How many references does the article have?”
![Page 15: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/15.jpg)
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers – “Provide 4-6 keywords that would give someone a
good summary of the contents of the article”
![Page 16: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/16.jpg)
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers • Make verifiable answers useful for completing
task – Used tasks similar to how Wikipedians described
evaluating quality (organization, presentation, references)
![Page 17: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/17.jpg)
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers • Make verifiable answers useful for completing
task • Put verifiable tasks before subjective
responses – First do objective tasks and summarization – Only then evaluate subjective quality – Ecological validity?
![Page 18: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/18.jpg)
Experiment 2: Results
• 124 users provided 277 ratings (~20 per article) • Significant positive correlation with Wikipedians (r=.
66, p=.01)
• Smaller proportion malicious responses • Increased time on task
Experiment 1 Experiment 2
Invalid comments
49% 3% <1 min
responses 31% 7%
Median time 1:30 4:06
![Page 19: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/19.jpg)
Generalizing to other user studies
• Combine objective and subjective questions – Rapid prototyping: ask verifiable questions about
content/design of prototype before subjective evaluation
– User surveys: ask common-knowledge questions before asking for opinions
![Page 20: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/20.jpg)
Limitations of mechanical turk
• No control of users’ environment – Potential for different browsers, physical
distractions – General problem with online experimentation
• Not designed for user studies – Difficult to do between-subjects design – Involves some programming
• Users – Uncertainty about user demographics, expertise
![Page 21: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/21.jpg)
Quick Summary
1. Use verifiable questions to signal monitoring 2. Make malicious answers as high cost as good-faith
answers 3. Make verifiable answers useful for completing task 4. Put verifiable tasks before subjective responses
• Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost
• Good results require careful task design
![Page 22: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/22.jpg)
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency visualization for Wikipedia – J. Heer’s work at Stanford at looking at perceptual effects
• Coding of large amount of user data – What is a question? In Twitter, Sharoda Paul at PARC
• Decompose tasks into smaller tasks – Digital Taylorism – Frederick Winslow Taylor (1856-1915) 1911 book
'Principles Of Scientific Management'
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards – Games vs. Pay
![Page 23: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/23.jpg)
• @edchi • [email protected] • http://edchi.net
![Page 24: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/24.jpg)
What would make you trust Wikipedia more?
24
![Page 25: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/25.jpg)
What is Wikipedia?
“Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you’re getting the
best possible information.” – Steve Carell, The Office
25
![Page 26: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/26.jpg)
What would make you trust Wikipedia more?
Nothing
26
![Page 27: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/27.jpg)
What would make you trust Wikipedia more?
“Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed.”
27
![Page 28: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/28.jpg)
WikiDashboard Transparency of social dynamics can reduce conflict and coordination
issues Attribution encourages contribution
– WikiDashboard: Social dashboard for wikis – Prototype system: http://wikidashboard.parc.com
Visualization for every wiki page showing edit history timeline and top individual editors
Can drill down into activity history for specific editors and view edits to see changes side-by-side
28
Citation: Suh et al. CHI 2008 Proceedings
Crowdsourcing Meetup (Stanford 2011)
![Page 29: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/29.jpg)
Hillary Clinton
29 Crowdsourcing Meetup (Stanford 2011) 29
![Page 30: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/30.jpg)
Top Editor -‐ Wasted Time R
30 Crowdsourcing Meetup (Stanford 2011)
![Page 31: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/31.jpg)
Surfacing information
• Numerous studies mining Wikipedia revision history to surface trust-relevant information – Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
Viegas et al., 2004; Zeng et al., 2006
• But how much impact can this have on user perceptions in a system which is inherently mutable?
Suh, Chi, Kittur, & Pendleton, CHI2008
31
![Page 32: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/32.jpg)
Hypotheses
1. Visualization will impact perceptions of trust 2. Compared to baseline, visualization will
impact trust both positively and negatively 3. Visualization should have most impact when
high uncertainty about article • Low quality • High controversy
32
![Page 33: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/33.jpg)
Design
• 3 x 2 x 2 design
Abortion
George Bush
Volcano
Shark
Pro-life feminism
Scientology and celebrities
Disk defragmenter
Beeswax
Controversial Uncontroversial
High quality
Low quality
Visualization • High stability • Low stability • Baseline (none)
33
![Page 34: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/34.jpg)
Example: High trust visualization
34
![Page 35: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/35.jpg)
Example: Low trust visualization
35
![Page 36: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/36.jpg)
Summary info
• % from anonymous users
36
![Page 37: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/37.jpg)
Summary info
• % from anonymous users
• Last change by anonymous or established user
37
![Page 38: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/38.jpg)
Summary info
• % from anonymous users
• Last change by anonymous or established user
• Stability of words
38
![Page 39: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/39.jpg)
Graph
• Instability
39
![Page 40: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/40.jpg)
Method
• Users recruited via Amazon’s Mechanical Turk – 253 participants – 673 ratings – 7 cents per rating – Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
• To ensure salience and valid answers, participants answered: – In what time period was this article the least stable? – How stable has this article been for the last month? – Who was the last editor? – How trustworthy do you consider the above editor?
40
![Page 41: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/41.jpg)
Results
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031)
41
![Page 42: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/42.jpg)
Results
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial.
42
![Page 43: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/43.jpg)
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction of visualization with either quality or controversy – Robust across conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
43
![Page 44: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/44.jpg)
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction of visualization with either quality or controversy – Robust across conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
44
![Page 45: Crowdsourcing for HCI Research with Amazon Mechanical Turk](https://reader033.fdocuments.in/reader033/viewer/2022052820/54700ee4af79599f0a8b471e/html5/thumbnails/45.jpg)
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction effect of visualization with either quality or controversy – Robust across conditions
1
2
3
4
5
6
7
Low qual High qual Low qual High qual
Uncontroversial Controversial
Trus
twor
thin
ess r
atin
g
High stability Baseline Low stability
45