Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered...
Transcript of Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered...
![Page 1: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/1.jpg)
Human-Powered Database Operations:
Part 1
Dongwon Lee
Penn State University, USA [email protected]!
!Slide available @ http://goo.gl/4pNUhB
SBBD 2014 Tutorial
![Page 2: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/2.jpg)
POSTECH Colloquium 2010
2
Where Am I From?
![Page 3: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/3.jpg)
Penn State University
l State College, PA l Out of nowhere,
but close to everywhere
l West: 2.5 hours to Pittsburgh
l East: 4 hours to New York
l South: 3 hours to Washington DC
l North: 3 hours to Niagara Fall
3
![Page 4: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/4.jpg)
Penn State i-School l College of Information Sciences and
Technology (IST) l http://ist.psu.edu/
l 40+ tenure-track faculty on diverse areas l CompSci & EE
l MIS & LIS l Design l Law
l Psychology l Medical Infomatics
4
![Page 5: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/5.jpg)
Other Tutorials on Crowdsourcing 5
![Page 6: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/6.jpg)
The Focus of This Tutorial
l Part 1 on basics of crowdsourcing
l Part 2 on DB
operations that exploit crowdsourcing
6
http://istc-bigdata.org/index.php/crowdsourcing-big-data/
![Page 7: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/7.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
7
![Page 8: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/8.jpg)
Eg, Francis Galton, 1906 8
Weight-judging competition: 1,197 (mean of 787 crowds) vs. 1,198 pounds (actual measurement)
![Page 9: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/9.jpg)
Eg, StolenSidekick, 2006 l A woman lost a cellphone in a taxi l A 16-year-old girl ended up having the phone
l Refused to return the phone
l Evan Guttman, the woman’s friend, sets up a blog site about the incident l http://stolensidekick.blogspot.com/ l http://www.evanwashere.com/StolenSidekick/
l Attracted a growing amount of attention à the story appeared in Digg main page à NY Times and CNN coverage à Crowds pressure on police …
l NYPD arrested the girl and re-possessed the phone
9
http://www.nytimes.com/2006/06/21/nyregion/21sidekick.html?_r=0
![Page 10: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/10.jpg)
Eg, Finding “Jim Gray”, 2007 10
![Page 11: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/11.jpg)
Eg, Threadless.com
l Sells t-shirts, designed/voted by crowds l Artists whose designs are chosen get paid
11
![Page 12: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/12.jpg)
Eg, 12
l Crowdfunding, started in 2009 l Project creators choose a deadline and a
minimum funding goal l Creators only from US, UK, and Canada
l Donors pledge money to support projects, in exchange of non-monetary values l Eg, t-shirt, thank-u-note, dinner with creators
l Donors can be from anywhere
l Eg, Pebble, smartwatch l 68K people pledged 10M
![Page 13: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/13.jpg)
Eg, reCAPCHA 13
As of 2012
Captcha: 200M every day
ReCaptcha: 750M to date
![Page 14: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/14.jpg)
Eg, DARPA Challenge, 2009
l To locate 10 red balloons in arbitrary locations of US
l Winner gets $40K l MIT team won the race with
the strategy: l 2K per balloon to the first
person, A, to send the correct coordinates
l 1K to the person, B, who invited A
l 0.5K to the person, C, who invited B, …
14
![Page 15: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/15.jpg)
Eg, Berkeley Mobile Millennium 15
![Page 16: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/16.jpg)
Eg, Who Wants to be a Millionaire? 16
Asking the audience usually works è Audience members have diverse knowledge that can be coordinated to provide a correct answer in sum
![Page 17: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/17.jpg)
Eg, Who Wants to be a Millionaire? 17
![Page 18: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/18.jpg)
Eg, Game-With-A-Purpose: GWAP l Term coined by Luis
von Ahn @ CMU l Eg,
l ESP Game à Google Image Labeler: image recognition
l Foldit: protein folding
l Duolingo: language
translation
18
![Page 19: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/19.jpg)
19
http://www.resultsfromcrowds.com/features/crowdsourcing-landscape/
![Page 20: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/20.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
20
![Page 21: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/21.jpg)
James Surowiecki, 2004
“Collective intelligence can be brought to bear on a wide variety of problems, and complexity is no bar… conditions that are necessary for the crowd to be wise: diversity, independence, and … decentralization”
21
![Page 22: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/22.jpg)
Jeff Howe, WIRED, 2006
“Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call. … The crucial prerequisite is the use of the open call format and the large network of potential laborers…”
22
http://www.wired.com/wired/archive/14.06/crowds.html
![Page 23: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/23.jpg)
“Human Computation”, 2011
“Human computation is simply computation that is carried out by humans… Crowdsourcing can be considered a method or a tool that human computation systems can use…”
By Edith Law & Luis von Ahn
23
![Page 24: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/24.jpg)
Daren Brabhan, 2013
“Crowdsourcing as an online, distributed problem-solving and production model that leverages the collective intelligence of online communities to serve specific organizational goals… top-down and bottom-up …”
24
![Page 25: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/25.jpg)
What is Crowdsourcing? l Many defini*ons
l A few characteris*cs l Outsourced to human workers l Online and distributed l Open call & right incen2ve l Diversity and independence l Top-‐down & bo=om-‐up
25
![Page 26: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/26.jpg)
What is Computational Crowdsourcing?
l Focus on computational aspect of crowdsourcing l Algorithmic aspect l Non-linear optimization problem
l Mainly use micro-tasks
l When to use Computational Crowdsourcing? 1. Machine cannot do the task well
2. Large crowds can probably do it well 3. Task can be split to many micro-tasks
26
![Page 27: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/27.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
27
![Page 28: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/28.jpg)
Three Parties
l Requesters l People submit some tasks l Pay rewards to workers
l Marketplaces l Provide crowds with tasks
l Crowds l Workers perform tasks
Submit tasks Collect answers
Find tasks Return answers
28
![Page 29: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/29.jpg)
Notable Marketplaces l Mechanical Turk
l CrowdFlower
l CloudCrowd
l Clickworker
l SamaSource
29
![Page 30: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/30.jpg)
AMT: mturk.com 30
Workers Requesters
![Page 31: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/31.jpg)
AMT: Workers vs. Requesters l Workers
l Register w. credit account (only US workers can register as of 2013)
l Bid to do tasks for earning money
l Requesters l First deposit money to account l Post tasks
o Task can specify a qualification for workers
l Gather results l Pay to workers if results are satisfactory
31
![Page 32: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/32.jpg)
AMT: HIT l Tasks
l Called HIT (Human Intelligence Task) l Micro-task
l Eg l Data cleaning l Tagging / labeling l Sentiment analysis
l Categorization l Surveying
l Photo moderation l Transcription
32
Translation task
![Page 33: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/33.jpg)
Micro- vs. Macro-task: Eg, oDesk 33
Workers
Requesters
![Page 34: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/34.jpg)
AMT: HIT List 34
Workers qualification
![Page 35: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/35.jpg)
AMT: HIT Example 35
![Page 36: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/36.jpg)
AMT: HIT Example 36
![Page 37: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/37.jpg)
Open-Source Marketplace S/W 37
![Page 38: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/38.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries l Transcription l Sorting
l Demo
38
![Page 39: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/39.jpg)
Three Computational Factors l Latency (or execution time)
l Worker pool size l Job attractiveness
l Monetary cost l Cost per question l # of questions (ie, HITs) l # of workers
l Quality of answers l Worker maliciousness
l Worker skills l Task difficulty
Latency
Cost
Quality
How much $$ does we spend?
How long do we wait for?
How much is the quality of
answers satisfied?
39
![Page 40: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/40.jpg)
#1: Latency l Some crowdsourcing tasks finish faster than
others l Eg, easier, or more rewarding tasks are popular
l Dependency among tasks
40
![Page 41: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/41.jpg)
#2: Cost l Cost per question l # of HITs
41
Remaining cost to pay: $0.03 X 2075 = $62.25
![Page 42: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/42.jpg)
#3: Quality of Answers l Avoid spam workers l Use workers with reputation
l Ask the same question to multiple workers to get consensus (eg, majority voting)
l Assign more number of (better-skilled) workers to more difficult questions
42
![Page 43: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/43.jpg)
Size of Comparison l Diverse forms of questions in a HIT l Different sizes of comparisons in a question
43
Which is better?
Which is the best?
. . .
Accuracy Cost Latency
Binary question
N-ary question
![Page 44: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/44.jpg)
Size of Batch l Repetitions of questions within a HIT l Eg, two n-ary questions (batch factor b=2)
44
Which is the best?
. . .
Which is the best?
. . .
Accuracy
Smaller b
Cost Latency
Larger b
![Page 45: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/45.jpg)
Response (r) l # of human responses seeked for a HIT
45
Which is better?
W1
Which is better?
W1
W2
W3
r = 1 r = 3
Accuracy Cost, Latency
Larger r Smaller r
![Page 46: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/46.jpg)
Round (= Step) l Algorithms are executed in rounds l # of rounds ≈ latency
46
Which is better?
Which is better?
Round #1
Which is better?
Round #2
Parallel Execution
Sequential Execution
![Page 47: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/47.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
47
![Page 48: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/48.jpg)
Eg, Text Transcription [Miller-13]
l Problem: one person cannot do a good transcription
l Key idea: iterative improvement by many workers
Greg Li:le et al. “Exploring itera*ve and parallel human computa*on processes.” HCOMP 2010
48
![Page 49: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/49.jpg)
Eg, Text Transcription [Miller-13] 49
improvement $0.05
![Page 50: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/50.jpg)
Eg, Text Transcription [Miller-13] 50
3 votes @ $0.01
![Page 51: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/51.jpg)
Eg, Text Transcription [Miller-13] 51
After 9 iterations
![Page 52: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/52.jpg)
Eg, Text Transcription [Miller-13] 52
I had intended to hit the nail, but I’m not a very good aim it seems and I
ended up hitting my thumb. This is a common occurrence I know, but it
doesn’t make me feel any less ridiculous having done it myself. My
new strategy will involve lightly tapping the nail while holding it until it is
embedded into the wood enough that the wood itself is holding it straight and
then I’ll remove my hand and pound carefully away. We’ll see how this
goes.
After 8 iterations with thousands of crowds
![Page 53: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/53.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
53
![Page 54: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/54.jpg)
Human-Powered Sort l Rank N items using crowdsourcing with
respect to the constraint C l Often C is subjective, fuzzy, ambiguous,
and/or difficult-for-machines-to-compute
l Eg, l Which image is the most “representative” one
of Brazil?
l Which animal is the most “dangerous”? l Which actress is the most “beautiful”?
54
![Page 55: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/55.jpg)
Human-Powered Sort 55
SELECT !*!FROM! !SoccerPlayers AS P!WHERE !P.WorldCupYear = ‘2014’ !ORDER BY !CrowdOp(‘most-valuable’)!
. . .
![Page 56: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/56.jpg)
Naïve Sort l Eg, “Which of two players is better?” l Naïve all pair-wise comparisons takes
comparisons l Optimal # of comparison is O(N log N)
.
56
N2
Who is better? Who is better?
. . . Who is better?
Who is better? Who is better?
. . . Who is better?
. . . . . . . . .
![Page 57: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/57.jpg)
Naïve Sort l Conflicting opinions may occur
o Cycle: A > B, B > C, and C > A
l If no cycle occurs l Naïve all pair-wise comparisons takes
comparisons
l If cycle exists l More comparisons are required
57
N2
A
C
B
![Page 58: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/58.jpg)
Sort [Marcus-VLDB11] l N=5, S=3
58
W1
W2
W3
W4
✖
A
C
B E
D
2
1
1
1
1
1
1
1
1
1
1
A
B
C
D
E
> >
> >
> >
> >
1
![Page 59: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/59.jpg)
Sort [Marcus-VLDB11] l N=5, S=3
59
A
C
B E
D
Topological Sort
DAG
A
B
C
D
E
A
B
C
E
D
Sorted Result
∨
∨
∨
∨
![Page 60: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/60.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
60
![Page 61: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/61.jpg)
Demo: Human-Powered Sorting l From your smartphone or laptop, access the
following URL or QR code:
http://goo.gl/3tw7b5
61
![Page 62: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/62.jpg)
Part 1 Conclusion l Crowdsourcing ≈ Human Computation l Academia: novel paradigm to solve the
challenging problems in Computer Science l Industry: novel entrepreneurial opportunities
l Eg, Brazil-version Mechanical Turk?
62
This slide is available at
http://goo.gl/4pNUhB
![Page 63: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/63.jpg)
Reference l [Brabham-13] Crowdsourcing, Daren Brabham, 2013 l [Franklin-SIGMOD11] CrowdDB: answering queries with crowdsourcing,
Michael J. Franklin et al, SIGMOD 2011
l [Howe-08] Crowdsourcing, Jeff Howe, 2008 l [LawAhn-11] Human Computation, Edith Law and Luis von Ahn, 2011 l [Li-HotDB12] Crowdsourcing: Challenges and Opportunities, Guoliang Li,
HotDB 2012
l [Marcus-VLDB11] Human-powered Sorts and Joins, Adam Marcus et al., VLDB 2011
l [Miller-13] Crowd Computing and Human Computation Algorithms, Rob Miller, 2013
l [Shirky-08] Here Comes Everybody, Clay Shirky, 2008
63
![Page 64: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/64.jpg)
Human-Powered Database Operations:
Part 2
Dongwon Lee
Penn State University, USA [email protected]!
!Slide available @ http://goo.gl/UEUEBh
SBBD 2014 Tutorial
![Page 65: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/65.jpg)
Part 1: Crowdsourcing Basics l Examples l Definitions
l Marketplaces l Computational Crowdsourcing
l Preliminaries
l Transcription l Sorting
l Demo
2
![Page 66: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/66.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
3
![Page 67: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/67.jpg)
New Challenges l Open-world
assumption (OWA)
l Non-deterministic algorithmic behavior
l Trade-off among cost,
latency, and accuracy
4
Latency
Cost
Accuracy
http://www.info.teradata.com
![Page 68: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/68.jpg)
Crowdsourcing DB Projects l CDAS @ NUS
l CrowdDB @ UC Berkeley & ETH Zurich
l MoDaS @ Tel Aviv U.
l Qurk @ MIT
l sCOOP @ Stanford & UCSC
5
![Page 69: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/69.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort l Select l Count l Top-1 l Top-k l Join
6
![Page 70: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/70.jpg)
Sort Operation l Rank N items using crowdsourcing w.r.t some
criteria
l Assuming pair-wise comparison of 2 items l Eg, “Which of two images is better?”
l Cycle: A > B, B > C, and C > A
l If no cycle occurs l Naïve all pair-wise comparisons takes
comparisons
l If cycle exists l More comparisons are required
7
N2
A
C
B
![Page 71: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/71.jpg)
Sort [Marcus-VLDB11] l Proposed 3 crowdsourced sort algorithms l #1: Comparison-based Sort
l Workers rank S items ( ) per HIT l Each HIT yields pair-wise comparisons
l Build a directed graph using all pair-wise comparisons from all workers o If i > j, then add an edge from i to j
l Break a cycle in the graph: “head-to-head” o Eg, If i > j occurs 3 times and i < j occurs 2 times, keep
only i > j
l Perform a topological sort in the DAG
8
S ⊂ NS2
!
"#
$
%&
![Page 72: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/72.jpg)
Sort [Marcus-VLDB11] 9
5 4 3 1 2
2 1 3 5 4
Error
![Page 73: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/73.jpg)
Sort [Marcus-VLDB11] l N=5, S=3
10
W1
W2
W3
W4
✖
A
C
B E
D
2
1
1
1
1
1
1
1
1
1
1
A
B
C
D
E
> >
> >
> >
> >
1
![Page 74: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/74.jpg)
Sort [Marcus-VLDB11] l N=5, S=3
11
A
C
B E
D
Topological Sort
DAG
A
B
C
D
E
A
B
C
E
D
Sorted Result
∨
∨
∨
∨
![Page 75: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/75.jpg)
Sort [Marcus-VLDB11]
l #2: Rating-based Sort l W workers rate each item along a numerical scale
l Compute the mean of W ratings of each item l Sort all items using their means
l Requires W*N HITs: O(N)
12
. . .
1.3
3.6
8.2
Worker Rating
W1 4
W2 3
W3 4
Worker Rating
W1 1
W2 2
W3 1
. . .
Mean rating
![Page 76: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/76.jpg)
Sort [Marcus-VLDB11] 13
![Page 77: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/77.jpg)
Sort [Marcus-VLDB11] l #3: Hybrid Sort
l First, do rating-based sort à sorted list L l Second, do comparison-based sort on S ( )
l How to select the size of S o Random o Confidence-based
o Sliding window
14
S ⊂ L
![Page 78: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/78.jpg)
Sort [Marcus-VLDB11] 15
Worker agreement Rank correlation btw. Comparison vs. rating
![Page 79: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/79.jpg)
Sort [Marcus-VLDB11] 16
![Page 80: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/80.jpg)
Part II: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
17
![Page 81: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/81.jpg)
Select Operation l Given N items, select k items that satisfy a
predicate P
l ≈ Filter, Find, Screen, Search
18
![Page 82: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/82.jpg)
Select Operation l Examples
l [Yan-MobiSys10] uses crowds to search an image relevant to a query
l [Parameswaran-SIGMOD12] develops human-powered filtering algorithms
l [Franklin-ICDE13] efficiently enumerates items satisfying conditions via crowdsourcing
l [Sarma-ICDE14] finds a bounded number of items satisfying predicates using the optimal solution by the skyline of cost and time
19
![Page 83: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/83.jpg)
Select [Yan-MobiSys10] l Improving mobile image search using
crowdsourcing
20
![Page 84: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/84.jpg)
Select [Yan-MobiSys10] l Ensuring
accuracy with majority voting
l Given accuracy, optimize cost and latency
l Deadline as latency in mobile phones
21
![Page 85: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/85.jpg)
Select [Yan-MobiSys10] l Goal: For a query image Q, find the first
relevant image I with min cost before the deadline
22
![Page 86: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/86.jpg)
Select [Yan-MobiSys10] l Parallel crowdsourced validation
23
![Page 87: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/87.jpg)
Select [Yan-MobiSys10] l Sequential crowdsourced validation
24
![Page 88: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/88.jpg)
Select [Yan-MobiSys10] l CrowdSearch: using early prediction on the
delay and outcome to start the validation of next candidate early
25
![Page 89: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/89.jpg)
Select [Yan-MobiSys10] 26
![Page 90: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/90.jpg)
Select [Parameswaran-SIGMOD12] l Novel grid-based visualization
27
No
Yes
A
B
D
C
Same person?
Yes No
![Page 91: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/91.jpg)
Select [Parameswaran-SIGMOD12] l Common strategies
l Always ask X questions, return most likely answer à Triangular strategy
l If X YES return “Pass”, Y NO return “Fail”, else keep asking à Rectangular strategy
l Ask until |#YES - #NO| > X, or at most Y questions à Chopped off triangle
28
![Page 92: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/92.jpg)
Select [Parameswaran-SIGMOD12] l What is the best strategy? Find strategy with
minimum overall expected cost s.t. 1. Overall expected error is less than threshold
2. # of questions per item never exceeds m
29
6 5 4 3 2 1
6
5
4
3
2
1
NOs
YESs
![Page 93: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/93.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
30
![Page 94: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/94.jpg)
Count Operation l Given N items, estimate a fraction of items M
that satisfy a predicate P
l Selectivity estimation in DB à crowd-powered query optimizers
l Evaluating queries with GROUP BY + COUNT/AVG/SUM operators
l Eg, “Find photos of females with red hairs” l Selectivity(“female”) ≈ 50%
l Selectivity(“red hair”) ≈ 2% l Better to process predicate(“red hair”) first
31
![Page 95: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/95.jpg)
Count Operation 32
l Q: “How many teens are participating in the Hong Kong demonstration?”
![Page 96: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/96.jpg)
Count Operation 33
http://www.faceplusplus.com/demo-detect/
10 - 56 20 - 30 15 - 29
l Using Face++, guess the age of a person
![Page 97: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/97.jpg)
Count [Marcus-VLDB13] l Hypothesis: Humans can estimate the
frequency of objects’ properties in a batch without having to explicitly label each item
l Two approaches l #1: Label Count
o Sampling based o Have workers label samples explicitly
l #2: Batch Count o Have workers estimate the frequency in a batch
34
![Page 98: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/98.jpg)
Count [Marcus-VLDB13] 35
l Label Count (via sampling)
![Page 99: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/99.jpg)
Count [Marcus-VLDB13] 36
l Batch Count
![Page 100: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/100.jpg)
Count [Marcus-VLDB13] l Findings on accuracy
l Images: Batch count > Label count l Texts: Batch count < Label count
l Further Contributions l Detecting spammers l Avoiding coordinated attacks
37
![Page 101: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/101.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
38
![Page 102: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/102.jpg)
Top-1 Operation l Find the top-1, either MAX or MIN, among N
items w.r.t. some criteria
l Objective l Avoid sorting all N items to find top-1
39
![Page 103: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/103.jpg)
Top-1 Operation l Examples
l [Venetis-WWW12] introduces the bubble max and tournament-based max in a parameterized framework
l [Guo-SIGMOD12] studies how to find max using pair-wise questions in the tournament-like setting and how to improve accuracy by asking more questions
40
![Page 104: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/104.jpg)
Max [Venetis-WWW12]
l Introduced two Max algorithms l Bubble Max l Tournament Max
l Parameterized framework l si: size of sets compared at the i-th round l ri: # of human responses at the i-th round
41
Which is better?
si = 2 ri = 3
si = 3 ri = 2
Which is the best?
![Page 105: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/105.jpg)
Max [Venetis-WWW12]
l Bubble Max Case #1
42
s1 = 2 r1 = 3
s2 = 3 r2 = 3
s3 = 2 r3 = 5
• N = 5 • Rounds = 3 • # of questions =
r1 + r2 + r3 = 11
![Page 106: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/106.jpg)
Max [Venetis-WWW12]
l Bubble Max Case #2
43
s1 = 4 r1 = 3
• N = 5 • Rounds = 2 • # of questions =
r1 + r2 = 8
s2 = 2 r2 = 5
![Page 107: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/107.jpg)
Max [Venetis-WWW12]
l Tournament Max
44
• N = 5 • Rounds = 3 • # of questions
= r1 + r2 + r3 + r4 = 10
s1 = 2 r1 = 1
s3 = 2 r3 = 3
s4 = 2 r4 = 5
s2 = 2 r2 = 1
![Page 108: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/108.jpg)
Max [Venetis-WWW12] l How to find optimal parameters?: si and ri l Tuning Strategies (using Hill Climbing)
l Constant si and ri l Constant si and varying ri
l Varying si and ri
45
![Page 109: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/109.jpg)
Max [Venetis-WWW12] l Bubble Max
l Worst case: with si=2, O(N) comparisons needed
l Tournament Max l Worst case: with si=2, O(N) comparisons needed
l Bubble Max is a special case of Tournament Max
46
![Page 110: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/110.jpg)
Max [Venetis-WWW12] 47
![Page 111: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/111.jpg)
Max [Venetis-WWW12] 48
![Page 112: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/112.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
49
![Page 113: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/113.jpg)
Top-k Operation l Find top-k items among N items w.r.t. some
criteria
l Top-k list vs. top-k set
l Objective l Avoid sorting all N items to find top-k
50
![Page 114: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/114.jpg)
Top-k Operation l Examples
l [Davidson-‐ICDT13] inves&gates the variable user error model in solving top-‐k list problem
l [Polychronopoulous-‐WebDB13] proposes tournament-‐based top-‐k set solu&on
51
![Page 115: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/115.jpg)
Top-k Operation l Naïve solution is to “sort” N items and pick
top-k items l Eg, N=5, k=2, “Find two best Bali images?”
l Ask = 10 pair-wise questions to get a total order
l Pick top-2 images
52
52
![Page 116: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/116.jpg)
l Phase 1: Building a tournament tree l For each comparison, only winners are promoted
to the next round
Top-k: Tournament Solution (k = 2) 53
Round 1
Round 2
Round 3
Total, 4 questions with 3 rounds
![Page 117: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/117.jpg)
l Phase 2: Updating a tournament tree l Iteratively asking pair-wise questions from the
bottom level
Top-k: Tournament Solution (k = 2) 54
Round 1
Round 2
Round 3
![Page 118: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/118.jpg)
l Phase 2: Updating a tournament tree l Iteratively asking pair-wise questions from the
bottom level
Top-k: Tournament Solution (k = 2) 55
Round 4
Round 5
Total, 6 questions With 5 rounds
![Page 119: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/119.jpg)
l This is a top-k list algorithm l Analysis
l If there is no constraint for the number of
rounds, this tournament sort based top-k scheme yields the optimal result
k = 1 k ≥ 2
# of questions O(n)
# of rounds
Top-k: Tournament Solution 56
![Page 120: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/120.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Top-k set algorithm
l Top-k items are “better” than remaining items l Capture NO ranking among top-k items
l Tournament-based approach
l Can become a Top-k list algorithm l Eg, Top-k set algorithm, followed by [Marcus-
VLDB11] to sort k items
57
K items
![Page 121: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/121.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Algorithm
l Input: N items, integer k and s (ie, s > k) l Output: top-k set l Procedure:
o O ß N items o While |O| > k
§ Partition O into disjoint subsets of size s
§ Identify top-k items in each subset of size s: s-rank(s)
§ Merge all top-k items into O
o Return O
l More effective when s and k are small l Eg, s-rank(20) with k=10 may give poor accuracy
58
![Page 122: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/122.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Eg, N=10, s=4, k=2
59
s-rank() s-rank()
s-rank()
s-rank()
Top-2 items
s-rank()
s-rank()
![Page 123: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/123.jpg)
Top-k [Polychronopoulous-‐WebDB13] l s-rank(s)
// workers rank s items and aggregate l Input: s items, integer k (ie, s > k), w workers l Output: top-k items among s items
l Procedure: o For each of w workers
§ Rank s items ≈ comparison-based sort [Marcus-VLDB11]
o Merge w rankings of s items into a single ranking § Use median-rank aggregation [Dwork-WWW01]
o Return top-k item from the merged ranking of s items
60
![Page 124: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/124.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Eg, s-rank(): s=4, k=2, w=3
61
W1
4 1 2 3
W2
4 2 1 3
W3
3 2 3 4
Median Ranks 4 2 2 3
Top-2
![Page 125: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/125.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Comparison to Sort [Marcus-VLDB11]
62
![Page 126: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/126.jpg)
Top-k [Polychronopoulous-‐WebDB13] l Comparison to Max [Venetis-WWW12]
63
![Page 127: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/127.jpg)
Part 2: Crowdsourced Algo. in DB l Preliminaries l Sort
l Select l Count l Top-1 l Top-k l Join
64
![Page 128: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/128.jpg)
Join Operation l Identify matching records or entities within or
across tables l ≈ similarity join, entity resolution (ER), record
linkage, de-duplication, … l Beyond the exact matching
l [Chaudhuri-ICDE06] similarity join l R JOINp S, where p=sim(R.A, S.A) > t l sim() can be implemented as UDFs in SQL
l Often, the evaluation is expensive o DB applies UDF-based join predicate after Cartesian
product of R and S
65
![Page 129: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/129.jpg)
Join Operation l Examples
l [Marcus-VLDB11] proposes 3 types of joins l [Wang-VLDB12] generates near-optimal
cluster-based HIT design to reduce join cost
l [Wang-SIGMOD13] reduces join cost further by exploiting transitivity among items
l [Whang-VLDB13] selects right questions to ask to crowds to improve join accuracy
l [Gokhale-SIGMOD14] proposes the hands-off crowdsourcing for join workflow
66
![Page 130: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/130.jpg)
Join [Marcus-VLDB11] l To join tables R and S l #1: Simple Join
l Pair-wise comparison HIT l |R||S| HITs needed
l #2: Naïve Batching Join l Repetition of #1 with a batch factor b l |R||S|/b HITs needed
l #3: Smart Batching Join l Show r and s images from R and S l Workers pair them up
l |R||S|/rs HITs needed
67
![Page 131: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/131.jpg)
Join [Marcus-VLDB11] 68
#1 Simple Join
![Page 132: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/132.jpg)
Join [Marcus-VLDB11] 69
#2 Naïve Batching
Join
Batch factor b = 2
![Page 133: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/133.jpg)
Join [Marcus-VLDB11] 70
#3 Smart Batching
Join
r images from R
s images from S
![Page 134: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/134.jpg)
Join [Marcus-VLDB11] 71
MV: Majority Voting QA: Quality Adjustment
![Page 135: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/135.jpg)
Join [Marcus-VLDB11] 72
Last 50% of wait time is spent completing
the last 5% of tasks
![Page 136: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/136.jpg)
Join [Wang-VLDB12] l [Marcus-VLDB11] proposed two batch joins
l More efficient smart batch join still generates |R||S|/rs # of HITs
l Eg, (10,000 X 10,000) / (20 x 20) = 250,000 HITs à Still too many !
l [Wang-VLDB12] contributes CrowdER: 1. A hybrid human-machine join
o #1 machine-join prunes obvious non-matches
o #2 human-join examines likely matching cases § Eg, candidate pairs with high similarity scores
2. Algorithm to generate min # of HITs for step #2
73
![Page 137: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/137.jpg)
Join [Wang-VLDB12] l Hybrid idea: generate candidate pairs
using existing similarity measures (eg, Jaccard)
74
Main Issue: HIT Generation Problem
![Page 138: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/138.jpg)
Join [Wang-VLDB12] 75
Pair-based HIT Generation ≈ Naïve Batching in [Marcus-VLDB11]
Cluster-based HIT Generation ≈ Smart Batching in [Marcus-VLDB11]
![Page 139: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/139.jpg)
Join [Wang-VLDB12] l HIT Generation Problem
l Input: pairs of records P, # of records in HIT k l Output: minimum # of HITs s.t.
1. All HITs have at most k records
2. Each pair (pi, pj) P must be in at least one HIT
1. Pair-based HIT Generation l Trivial: P/k # of HITs s.t. each HIT contains k pairs
in P
2. Cluster-based HIT Generation l NP-hard problem à approximation solution
76
∈
![Page 140: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/140.jpg)
Join [Wang-VLDB12] 77
Cluster-based HIT #1
r1, r2, r3, r7
Cluster-based HIT #2
r3, r4, r5, r6
Cluster-based HIT #3
r4, r7, r8, r9
k = 4
This is the minimal # of cluster-based HITs satisfying previous two conditions
![Page 141: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/141.jpg)
Join [Wang-VLDB12] l Two-tiered Greedy Algorithm
l Build a graph G from pairs of records in P l CC ß connected components in G
o LCC: large CC with more than k nodes
o SCC: small CC with no more than k nodes
l Step 1: Partition LCC into SCCs l Step 2: Pack SCCs into HITs with k nodes
o Integer programming based
78
![Page 142: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/142.jpg)
Join [Wang-VLDB12] l Eg, Generate cluster-based HITs (k = 4)
1. Partition the LCC into 3 SCCs o {r1, r2, r3, r7}, {r3, r4, r5, r6}, {r4, r7}
2. Pack SCCs into HITs o A single HIT per {r1, r2, r3, r7} and {r3, r4, r5, r6}
o Pack {r4, r7} and {r8, r9} into a HIT
79
![Page 143: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/143.jpg)
Join [Wang-VLDB12] l Step 1: Partition
l Input: LCC, k Output: SCCs l rmax ß node in LCC with the max degree l scc ß {rmax}
l conn ß nodes in LCC directly connected to rmax
l while |scc| < k and |conn| > 0 o rnew ß node in conn with max indegree (# of edges to
scc) and min outdegree (# of edges to non-scc) if tie o move rnew from conn to scc
o update conn using new scc
l add scc into SCC
80
![Page 144: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/144.jpg)
Join [Wang-VLDB12] 81
![Page 145: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/145.jpg)
Join [Wang-VLDB12] 82
![Page 146: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/146.jpg)
Join [Wang-VLDB12] 83
![Page 147: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/147.jpg)
Join [Wang-SIGMOD13] l Use the same hybrid machine-human
framework as [Wang-VLDB12] l Aim to reduce # of HITs further
l Exploit transitivity among records
84
http://blogs.oc.edu/ece/transitivity/
![Page 148: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/148.jpg)
Join [Wang-SIGMOD13] l Positive transitive relation
l If a=b, and b=c, then a=c
l Negative transitive relation l If a = b, b ≠ c, then a ≠ c
85
iPad 2nd Gen = iPad Two
iPad Two = iPad 2 iPad 2nd Gen = iPad 2
iPad 2nd Gen = iPad Two
iPad Two ≠ iPad 3 iPad 2nd Gen ≠ iPad 3
![Page 149: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/149.jpg)
Join [Wang-SIGMOD13] l Three transitive relations
l If there exists a path from o to o’ which only consists of matching pairs, then (o, o’) can be deduced as a matching pair
l If there exists a path from o to o’ which only contains a single non-matching pair, then (o, o’) can be deduced as a non-matching pair
l If any path from o to o’ contains more than one non-matching pairs, (o, o’) cannot be deduced.
86
![Page 150: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/150.jpg)
Join [Wang-SIGMOD13] 87
(o3, o5) à match
(o5, o7) à non-match
(o1, o7) à ?
![Page 151: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/151.jpg)
Join [Wang-SIGMOD13] l Given a pair (oi, oj), to check the transitivity
l Enumerate path from oi to oj à exponential ! l Count # of non-matching pairs in each path
l Solution: Build a cluster graph l Merge matching pairs to a cluster l Add inter-cluster edge for non-matching pairs
88
(o5, o6) à ?
(o1, o5) à ?
![Page 152: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/152.jpg)
Join [Wang-SIGMOD13] l Problem Definition:
l Given a set of pairs that need to be labeled, minimize the # of pairs requested to crowd workers based on transiCve relaCons
89
?
![Page 153: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/153.jpg)
Join [Wang-SIGMOD13] 90
(o1, o2), (o1, o6), (o2, o6)
vs.
(o1, o6), (o2, o6), (o1, o2)
N
l Labeling order matters !
è Given a set of pairs to label, how to order them affects the # of pairs to deduce using the transitivity
![Page 154: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/154.jpg)
Join [Wang-SIGMOD13] l Theorem: Optimal labeling order
w = <p1, …, pi-1, pi, pi+1, …, pn> w’ = <p1, …, pi-1, pi+1, pi, …, pn>
l If pi is a matching pair and pi+1 is a non-matching pair, then C(w) ≤ C(w’) o C(w): # of crowdsourced pairs required for w
l That is, always better to first label a matching pair and then a non-matching pair
l In reality, optimal label order cannot be achieved
91
![Page 155: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/155.jpg)
Join [Wang-SIGMOD13] l Expected optimal labeling order
l w = <p1, p2, …, pn> l C(w) = # of crowdsourced pairs required for w
l P(pi = crowdsourced) o Enumerate all possible labels of <p1, p2, …, pi-‐1>, and for each possibility, derive whether pi is crowdsourced or not
o Sum of the probability of each possibility that whether pi is crowdsourced
92
![Page 156: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/156.jpg)
Join [Wang-SIGMOD13] l Expected optimal labeling order
l w1 = <p1, p2, p3> l E[C(w1)] = 1 + 1 + 0.05 = 2.05
o P1: P(P1 = crowdsourced) = 1 o P2: P(P2 = crowdsourced) = 1 o P3: P(P3 = crowdsourced) = P(both P1 and P2 are non-‐matching) = (1-‐0.9)(1-‐0.5) = 0.05
93
o1
o2
o3
p1 p2
p3
Probability of matching
P1 0.9
P2 0.5
P3 0.1
Expected value
w1 = <p1, p2, p3> 2.05
w2 = <p1, p3, p2> 2.09
w3 = <p2, p3, p1> 2.45
w4 = <p2, p1, p3> 2.05
… …
![Page 157: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/157.jpg)
Join [Wang-SIGMOD13] l Theorem: Expected optimal labeling order
l Label the pairs in the decreasing order of the probability that they are a matching pair
l Eg, p1, p2, p3, p4, p5, p6, p7, p8
94
High
![Page 158: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/158.jpg)
Join [Wang-SIGMOD13] l Two data sets
l Paper: 997 (author, &tle, venue, date, and pages) l Product: 1081 product (abt.com), 1092 product (buy.com)
95
![Page 159: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/159.jpg)
Join [Wang-SIGMOD13] 96
l Transitivity
![Page 160: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/160.jpg)
Machine vs. Human l Human-Powered Crowdsourcing à “Human-
in-the-loop” Crowdsourcing l Should use machine to process majority of big
data l Should use human to process a small fraction of
challenging cases in big data
l How to split tasks and combine results for machines and human automatically is an open issue
97
http://www.theoddblog.us/2014/ 02/21/damienwaltershumanloop/
![Page 161: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/161.jpg)
Conclusion l New opportunities
l Open-world assumption l Non-deterministic algorithmic behavior l Trade-off among cost, latency, and accuracy
l Crowdsourcing for Big Data?
98
This slide is available at
http://goo.gl/UEUEBh
![Page 162: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/162.jpg)
Reference l [Brabham-13] Crowdsourcing, Daren Brabham, 2013 l [Cao-VLDB12] Whom to Ask? Jury Selection for Decision Making Tasks on
Microblog Services, Caleb Chen Cao et al., VLDB 2012
l [Chaudhuri-ICDE06] A Primitive Operator for Similarity Join in Data Cleaning, Surajit Chaudhuri et al., ICDE 2006
l [Davidson-ICDT13] Using the crowd for top-k and group-by queries, Susan Davidson et al., ICDT 2013
l [Dwork-WWW01] Rank Aggregation Methods for the Web, Cynthia Dwork et al., WWW 2001
l [Franklin-SIGMOD11] CrowdDB: answering queries with crowdsourcing, Michael J. Franklin et al, SIGMOD 2011
l [Franklin-ICDE13] Crowdsourced enumeration queries, Michael J. Franklin et al, ICDE 2013
l [Gokhale-SIGMOD14] Corleone: Hands-Off Crowdsourcing for Entity Matching, Chaitanya Gokhale et al., SIGMOD 2014
l [Guo-SIGMOD12] So who won?: dynamic max discovery with the crowd, Stephen Guo et al., SIGMOD 2012
l [Howe-08] Crowdsourcing, Jeff Howe, 2008
99
![Page 163: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/163.jpg)
Reference l [LawAhn-11] Human Computation, Edith Law and Luis von Ahn, 2011 l [Li-HotDB12] Crowdsourcing: Challenges and Opportunities, Guoliang Li,
HotDB 2012
l [Liu-VLDB12] CDAS: A Crowdsourcing Data Analytics System, Xuan Liu et al., VLDB 2012
l [Marcus-VLDB11] Human-powered Sorts and Joins, Adam Marcus et al., VLDB 2011
l [Marcus-VLDB12] Counting with the Crowd, Adam Marcus et al., VLDB 2012
l [Miller-13] Crowd Computing and Human Computation Algorithms, Rob Miller, 2013
l [Parameswaran-SIGMOD12] CrowdScreen: Algorithms for Filtering Data with Humans, Aditya Parameswaran et al., SIGMOD 2012
l [Polychronopoulous-WebDB13] Human-Powered Top-k Lists, Vassilis Polychronopoulous et al., WebDB 2013
l [Sarma-ICDE14] Crowd-Powered Find Algorithms, Anish Das Sarma et al., ICDE 2014
l [Shirky-08] Here Comes Everybody, Clay Shirky, 2008
100
![Page 164: Human-Powered Database Operations: Part 1pike.psu.edu/publications/tutorial-sbbd14.pdfHuman-Powered Database Operations: Part 1 Dongwon Lee Penn State University, USA ... Part 1 on](https://reader035.fdocuments.in/reader035/viewer/2022070704/5e8a66849f78eb188775a83a/html5/thumbnails/164.jpg)
Reference l [Surowiecki-04] The Wisdom of Crowds, James Surowiecki, 2004 l [Venetis-WWW12] Max Algorithms in Crowdsourcing Environments, Petros
Venetis et al., WWW 2012
l [Wang-VLDB12] CrowdER: Crowdsourcing Entity Resolution, Jiannan Wang et al., VLDB 2012
l [Wang-SIGMOD13] Leveraging Transitive Relations for Crowdsourced Joins, Jiannan Wang et al., SIGMOD 2013
l [Whang-VLDB13] Question Selection for Crowd Entity Resolution, Steven Whang et al., VLDB 2013
l [Yan-MobiSys10] CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones, T. Yan et al., MobiSys 2010
101