Post on 21-May-2015
description
ALFRED: Crowd Assisted Data Extraction
Valter Crescenzi, Paolo Merialdo, Disheng Qiu
Dipartimento di IngegneriaUniversità degli Studi Roma TreVia della Vasca Navale, 79, Rome
disheng@dia.uniroma3.it
Extracting data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
1/7
Extracting data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
DB#Wrapper!
1/7
Extracting data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
Inference algorithm!
DB#Wrapper!
1/7
Extracting data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
Inference algorithm!
DB#Wrapper!
1/7
Extracting data
2M pages from IMDB, and we want to extract ... titles, directors etc ....
Inference algorithm!
DB#Wrapper!
1/7
Scaling Wrapper Inference
Scaling the number of workers with Crowdsourcing platforms opens new challenges:
Issues: Contributions:
2/7
Scaling Wrapper Inference
Scaling the number of workers with Crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert workers
• Simple interactions to reduce the worker error rate• Membership Query (yes/no answer)
2/7
Scaling Wrapper Inference
Scaling the number of workers with Crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert workers
• Simple interactions to reduce the worker error rate• Membership Query (yes/no answer)
• Active Learning to carefully select queries
Costs
2/7
Scaling Wrapper Inference
Scaling the number of workers with Crowdsourcing platforms opens new challenges:
Issues: Contributions:
Non-expert workers
• Simple interactions to reduce the worker error rate• Membership Query (yes/no answer)
• Active Learning to carefully select queries
Costs
2/7
Quality
• Bayesian Model to evaluate the expected wrapper quality• Sampling algorithms• Tolerant to inaccurate workers
Architecture
ALFRED is a wrapper inference system supervised by workers from a crowdsourcing platform.
*Research Track: A Framework for Learning Web Wrappers from the Crowd WWW 2013 3/7
Input and Rules Generation
4/7
Sample Set and Extracted Values
5/7
Sample Set and Extracted Values
page0 page1 page2
r1
r2
r3
Inception City of God Oblivion
Inception City of God null
Inception null Oblivion
6/7
Sample Set and Extracted Values
page0 page1 page2
r1
r2
r3
Inception City of God Oblivion
Inception City of God null
Inception null Oblivion
6/7
Probability and Noisy
7/7