UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

U N I V E R S I TY O F M A RYL A N DG O O G L E I N C .

AT& T L A BS -R E S E A RC H

TH E O D O R O S R E KATS I N A SX I N LU N A D O N GD I V E S H S R I VA S TAVA

C H A RAC T E R I Z I N G A N D S E LE C T I N G F R E S H DATA S O U RC E S

DATA I S A C O M MO DI T Ymyriads of data sources

F R E ELY AVA I L A B L E S O U RC E Sopen data

initiativeworld data bankcrawling the web

F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R Sdatasiftmicrosoft azure

marketplacedatamarket.cominfochimps

F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R S

H E T E R O G EN E O U S DATA S O U RC E Sdifferent quality - costcover different topicsstatic or dynamicexhibit different update

patterns

– LEO TOLS TOY

“Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is

not gold.”

Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013]

So, can we find gold in a systematic and automated

fashion?

Techniques agnostic to time and focus on accuracy of static sources

matters

Select sources before actual data integration

When do we need to usethe integration result ?

Data in the world and the sources changes

I T I S A DY N A MI C W O R L D

Data Sources

Updated every 2 time points

Updated every 3 time points

C H A L L E N G E S A N D OP P ORT U N I T I E S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

Q U AL I TY C H A N G E S OV E R T IM E

The optimal set of sources changes over time

LO W E R C O S T O P PO RT U N I T I E S

Integrate updates in lower frequency to lower cost

UpToDate EntriesOutOfDate EntriesNonDeleted Entries

T I M E -BAS E D S O U RC E Q U AL I T Y

Coverage( , ) = Entries ( , ) / Entries ( , )

Freshness( , ) = UpToDate Entries ( , ) / Entries ( , )

Coverage ~ RecallFreshness ~ Precision

Combine Accuracy( , )

S E L E C T I N G F R E S H S OU RC E S

Time-aware source selection

E XT E N S I O N S

Optimal frequency

Subset ofprovided data

E XT E N S I O N S

Optimal frequency

Subset ofprovided dataTime-aware source

selection with many more

sources

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

Pre-processing

S OU RC E S

Source selection

Tradeoff

W OR L D E VO LU T I O N M O D E L S

Poisson Random ProcessExponentially distributed changes

Integrate available data source snapshotsto extract the evolution of the world

Ensemble of parametric models

S OU RC E U P DAT E M OD E L S

Shall we consider only the update frequency?

High update frequency does not imply high freshness

Update frequency of the source

Empirical Effectiveness distributions

Ensemble of non-parametric models

Pre-processing

S OU RC E S

Source selection

Tradeoff

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

OldQuality ( , ; )

NewQuality ( , ; ) as a function of

ΔQuality ( , ; )

?Entries ( , )

Coverage( , ) =Entries ( , )

Estimating Entries ( , ):

use the intensity rates λ of the Poisson models

Entries ( , ) +

Estimating :Entries ( , )

Entries( , )Pr (Exist ( , ))+

New Entries( , )Pr (Exist ( , ))

Estimating :Entries ( , )

Entries( , )

Pre-processing

S OU RC E S

Source selection

Tradeoff

SOLV I NG SOU RC E SEL EC T I ON

Maximize marginal gain

SOLV I NG SOU RC E SEL EC T I ON

Greedy

Start with an empty solution and add sources greedily

No quality guarantees with arbitrarily bad solutions

Highly efficient

ARB I TRA RY OB J EC T I VE F U NC T I ONS

GRASP (k,r) [used in Dong et al., `13]

Local-search and randomized hill-climbingRun r times and keep best solution

Empirically high-quality solutions

Very expensive

A large family of benefit functions are monotone submodular (e.g., functions of coverage)

I N S I G H T S F O R Q U A L I T Y G U A RA N T E E S

Under a linear cost function the marginal gainis submodular

f(A U {x})f(A) f(B U {x})f(B)

S U B M O D U L A R O B JE C T IV E F U N C T IO N S

Start by selecting the best sourceExplore local neighborhood: add/delete sourcesEither selected set or complement is a local optimum

Constant factor approximation [Feige, `11]

Submodular Maximization (MaxSub)

Highly efficient

Empirically high-quality even for non-sub functions

S E L E C T E D E X P E R I M E N T S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

World-wide Event listings GDELT @gdeltproject.org

15,275 sources, 1 month236 event types, 242 locations

W OR L D C H A N G E E S T I M AT I O N

Small relative error even with little training dataExpected increasing trend over time

S OU RC E C H A N G E E S T I M AT I O N

Small relative error for source quality

S E L E C T I O N QU A L I T Y

B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P

L I N E A R

cov.best 16.7% 50% 100% (5,20)

acc.best 0.0% 33.3% 83.3% (2,100)

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

acc.best 50% 66.7% 83.3% (5,100)

Grasp finds the best solution most of the times

perc. of times finding the best

solution

MaxSub solutions are mostly comparable to Grasp

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

avg. and worst quality loss

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

but there are cases when Grasp is significantly worse

I NC R E A S I NG NU M B E R O F S O U RC E S

43 2282 4522 6761 9000 112391E+00

Scalability of Source selectioN

Greedy MaxSubGrasp (5,20) Grasp (10,100)

# Data sources

MaxSub is one to two orders of

magnitude faster

S E L E C T I O N C H A RAC T E R I S T I C S

Accuracy selects fewer more focused sources

C O N C LU S I O N S

Thank you!thodrek@cs.umd.edu

Source selection before data integration to increase quality and reduce cost

Collection of statistical models to describe the evolution of the world and the updates of sources

Exploiting submodularity gives more efficient solutions with rigorous guarantees

UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

Documents

Transcript of UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

Divesh Mittal

THEODOROS ALBANIS AND EVDOXIA KLADOPOULOU, HELLENIC ...

Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s Lecture 1. Intro

Dimitrios Kateris, Ioannis Gravalos, Theodoros Gialamas ......Dimitrios Kateris, Ioannis Gravalos, Theodoros Gialamas, Panagiotis Xyradakis, Anastasios Georgiadis and Roxana Agarici,

Presenter Divesh Kumar Ph.D. Scholar Department of Management Studies

SLiMFast: Guaranteed Results for Data Fusion and Source ...SLiMFast: Guaranteed Results for Data Fusion and Source Reliability Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina,

Divesh GULATI, Aditya Nath AGGARWAL, Sudhir KUMAR, Anil … · 2012-05-14 · Divesh GULATI, Aditya Nath AGGARWAL, Sudhir KUMAR, Anil AGARWAL Department of Orthopaedics, University

Lecture 2 Stats - GitHub Pages · Lecture 2: Statistical Inference and Exploratory Data Analysis Theodoros Rekatsinas 1. 2 Announcements • Waiting list: you receive invitations

Divesh Srivastava AT&T Labs-Research. The Web is Great.

PC GAMES Media Technologies Theodoros Nikitopoulos Chamilothoris.

Lecture 17 ML supp - GitHub Pages · Lecture 17: Linear Classifiers and Support Vector Machines Theodoros Rekatsinas (lecture by AnkurGoswamimany slides from David Sontag) 1. Today’s

Theodoros Aristodemou Business Profile

Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian 1.

1 Advanced Database Dr Theodoros Manavis tmanavis@ist.edu.gr.

Theodoros Chailis Sample Portfolio

Divesh khanna study of consumer buying behaviour

Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Divesh Mehta

Anonymized Data: Generation, Models, Usage Graham Cormode Divesh Srivastava {graham,divesh}@research.att.com Slides: Part 1.

Photopets - Theodoros