Crowdsourcing Software Engineering Studies:
Opportunities and Perils
Sebastian Elbaum (based on work performed with Kathryn Stolee)
\
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
\
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
Who are the workers (“Conducting behavioral research on Amazon’s Mechanical Turk”,
Winter Mason & Siddharth Suri, Behavioral Research Journal, 2011) Winter Mason & Siddharth Suri
• Median 30 years old and $30K salary
• 69% of U.S. workers: “Mechanical Turk is a fruitful way to spend free time and get some cash”
• Majority from US and India
• Work for at least $1.4/hour, average $4.5/hour
• Completion time is correlated with pay, but not linear
ICSE!Researchers
Software EngineersCrowdsource our studies
• Access to population of software engineers
• Low cost
• Speedy / adaptive experimentation
Potential
ICSE!Researchers
Software EngineersCrowdsource our studies
Initial Try• I may have solved the SE Empirical Challenge! Look
how many answers I am getting with a few dollars!
• oh… some of those answers are not that useful. Are these real software engineers?
• They are completing my exercise in seconds. How? Damn… they are gaming the system.
• ouch, I need to check thousands of answers.
• Ok, now let’s give them a “real” SE task.
Kathryn T. Stolee, Sebastian G. Elbaum: Exploring the use of crowdsourcing to support empirical studies in software engineering. ESEM 2010
Goal: evaluate the impact of smells/refactoring on end user programers preferences and understanding.
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
Workers:
Searchfor Tasks
SelectTask
CompleteTask
SubmitTask
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 6 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Task in Mechanical Turk
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 16 / 18
• 22 participants, 188 tasks completed, 2 weeks, $42
• Supported hypothesis
ICSE!Researchers
Software EngineersCrowdsource our studies
Potential
• Access to population of software engineers
• Low cost
• Speedy / adaptive experimentation
“… Academics are now taking advantage of Turk, and, from my own experience with the difficulties of recruiting students to experiments, I
suspect Turk’s use will only increase.” Scientific American 2011
Venue Task Soft Engs CostESEM 2010 /
TSE 2013Compare two mashups, determine
outcome (10min)~25 (30% SE),
188 tasks $42
FSE NIER 2012 / ESEM 2013
Write small program specification as input/ouput (compare with class
exercise, 15 min)
~25 ~100 tasks $25
TOSEM 2014Rank code search results from various tools, provide qualified feedback (10
min)
~50, ~300 tasks $300
… Survey on competing scenarios for emerging technology (15min)
+1000, +1000 tasks $600
Cost per task under a dollar!
Can we get …
• X software engineers to participate?
• the X kind of software engineers?
• software engineers to do X?
• software engineers to do a X seriously?
• …
Can we get …
• X software engineers to participate?
• the K kind of software engineers?
• X software engineers to do T?
• X software engineers to do a T seriously?
• …
Yes, $, T
Yes, QA, $.
Some Ts
Some Ts, QA, $
• Setting a baseline pay
• Market forces
• Enough to motivate software engineers
• Enough not to motivate others
• Too high perceived as “baiting” requesters
• Ethical concerns
• Multiple deployments
• Still IRB “cost” (with good reason)
$
• Decompose SE problem into tasks
• Small enough to attract participants / keep low cost
• Large enough to be valuable to SE Research
• Provide motivation other than money
• Design of experiments
• Tasks are small, need many to test hypothesis, need even more to study SE problem
• Bundled tasks help attract subjects
• Control for learning, gaming, …
$
Task
• Qualifications checks and Pre-tests
• Embed obvious/repeated mostly verifiable questions to check for robots, gamers, and level of attention
• Performance threshold to pay, or pay little to all but use good workers to seed next study
• Controlling revision costs with tasks on tasks
• Multiple deployments
• Compare performance of subjects in/out MTurk
$
Task
Quality
Limitations• Accidental
• Assignment of subjects to tasks (self-selection bias)
• Single users (small tasks, no collaboration)
• Rewards are mainly monetary
• Essential
• Separation from subjects (cannot observe/interact much)
• Very limited context information
• Mismatch with many SE problems
Alterna
tive In
frastr
uctur
es
Tempered enthusiasm• Not all SE problems can be broken into small tasks
• Many SE problems require a team and communication
• Many SE problems require time to develop
• Proof by Mturking
• Balancing task design, $, and thresholds is tricky
• Lack of contact and context with subjects
Charge• Great as initial empirical vehicle (better than ugrads :)
• Could be better
• Pool or pre-qualified workers
• Capabilities to design of more complex studies
• Connection stream to worker for follow-up (+context)
• Ability to control development environment
• …
MTurk for Software Engineering Studies?
Top Related