UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

43
UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES

Transcript of UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

Page 1: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

U N I V E R S I TY O F M A RYL A N DG O O G L E I N C .

AT& T L A BS -R E S E A RC H

TH E O D O R O S R E KATS I N A SX I N LU N A D O N GD I V E S H S R I VA S TAVA

C H A RAC T E R I Z I N G A N D S E LE C T I N G F R E S H DATA S O U RC E S

Page 2: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

DATA I S A C O M MO DI T Ymyriads of data sources

Page 3: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

DATA I S A C O M MO DI T Ymyriads of data sources

F R E ELY AVA I L A B L E S O U RC E Sopen data

initiativeworld data bankcrawling the web

Page 4: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

DATA I S A C O M MO DI T Ymyriads of data sources

F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R Sdatasiftmicrosoft azure

marketplacedatamarket.cominfochimps

Page 5: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

DATA I S A C O M MO DI T Ymyriads of data sources

F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R S

H E T E R O G EN E O U S DATA S O U RC E Sdifferent quality - costcover different topicsstatic or dynamicexhibit different update

patterns

Page 6: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

– LEO TOLS TOY

“Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is

not gold.”

Page 7: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013]

So, can we find gold in a systematic and automated

fashion?

Techniques agnostic to time and focus on accuracy of static sources

Page 8: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

matters

Select sources before actual data integration

When do we need to usethe integration result ?

Data in the world and the sources changes

Page 9: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

I T I S A DY N A MI C W O R L D

time

time

time

World

Data Sources

Updated every 2 time points

Updated every 3 time points

Page 10: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

C H A L L E N G E S A N D OP P ORT U N I T I E S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

Page 11: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

Q U AL I TY C H A N G E S OV E R T IM E

The optimal set of sources changes over time

Page 12: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

LO W E R C O S T O P PO RT U N I T I E S

Integrate updates in lower frequency to lower cost

Page 13: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

UpToDate EntriesOutOfDate EntriesNonDeleted Entries

T I M E -BAS E D S O U RC E Q U AL I T Y

Coverage( , ) = Entries ( , ) / Entries ( , )

Freshness( , ) = UpToDate Entries ( , ) / Entries ( , )

Coverage ~ RecallFreshness ~ Precision

Combine Accuracy( , )

Page 14: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T I N G F R E S H S OU RC E S

Time-aware source selection

Page 15: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

E XT E N S I O N S

Optimal frequency

Subset ofprovided data

Page 16: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

E XT E N S I O N S

Optimal frequency

Subset ofprovided dataTime-aware source

selection with many more

sources

Page 17: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

Page 18: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

Page 19: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

W OR L D E VO LU T I O N M O D E L S

Poisson Random ProcessExponentially distributed changes

Integrate available data source snapshotsto extract the evolution of the world

Ensemble of parametric models

Page 20: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E U P DAT E M OD E L S

Shall we consider only the update frequency?

Page 21: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E U P DAT E M OD E L S

High update frequency does not imply high freshness

Page 22: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E U P DAT E M OD E L S

Update frequency of the source

Empirical Effectiveness distributions

Ensemble of non-parametric models

Page 23: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

Page 24: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

OldQuality ( , ; )

NewQuality ( , ; ) as a function of

ΔQuality ( , ; )

and

Page 25: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

?Entries ( , )

Coverage( , ) =Entries ( , )

Page 26: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

Estimating Entries ( , ):

use the intensity rates λ of the Poisson models

Entries ( , ) +

Page 27: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

Estimating :Entries ( , )

Entries( , )Pr (Exist ( , ))+

New Entries( , )Pr (Exist ( , ))

Page 28: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

Estimating :Entries ( , )

Entries( , )

Page 29: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

Page 30: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

SOLV I NG SOU RC E SEL EC T I ON

Maximize marginal gain

Page 31: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

SOLV I NG SOU RC E SEL EC T I ON

Greedy

Start with an empty solution and add sources greedily

No quality guarantees with arbitrarily bad solutions

Highly efficient

Page 32: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

ARB I TRA RY OB J EC T I VE F U NC T I ONS

GRASP (k,r) [used in Dong et al., `13]

Local-search and randomized hill-climbingRun r times and keep best solution

Empirically high-quality solutions

Very expensive

Page 33: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

A large family of benefit functions are monotone submodular (e.g., functions of coverage)

I N S I G H T S F O R Q U A L I T Y G U A RA N T E E S

Under a linear cost function the marginal gainis submodular

A B

x

f(A U {x})f(A) f(B U {x})f(B)

Page 34: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S U B M O D U L A R O B JE C T IV E F U N C T IO N S

Start by selecting the best sourceExplore local neighborhood: add/delete sourcesEither selected set or complement is a local optimum

Constant factor approximation [Feige, `11]

Submodular Maximization (MaxSub)

Highly efficient

Empirically high-quality even for non-sub functions

Page 35: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T E D E X P E R I M E N T S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

World-wide Event listings GDELT @gdeltproject.org

15,275 sources, 1 month236 event types, 242 locations

Page 36: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

W OR L D C H A N G E E S T I M AT I O N

Small relative error even with little training dataExpected increasing trend over time

Page 37: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S OU RC E C H A N G E E S T I M AT I O N

Small relative error for source quality

Page 38: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T I O N QU A L I T Y

B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff.

acc.best 0.0% 33.3% 83.3% (2,100)

diff.

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff.

acc.best 50% 66.7% 83.3% (5,100)

diff.

Grasp finds the best solution most of the times

perc. of times finding the best

solution

Page 39: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T I O N QU A L I T Y

MaxSub solutions are mostly comparable to Grasp

B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

avg. and worst quality loss

Page 40: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T I O N QU A L I T Y

B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

but there are cases when Grasp is significantly worse

Page 41: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

I NC R E A S I NG NU M B E R O F S O U RC E S

43 2282 4522 6761 9000 112391E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

Scalability of Source selectioN

Greedy MaxSubGrasp (5,20) Grasp (10,100)

# Data sources

Tim

e (

mse

c)

MaxSub is one to two orders of

magnitude faster

Page 42: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

S E L E C T I O N C H A RAC T E R I S T I C S

Accuracy selects fewer more focused sources

Page 43: UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

C O N C LU S I O N S

Thank [email protected]

Source selection before data integration to increase quality and reduce cost

Collection of statistical models to describe the evolution of the world and the updates of sources

Exploiting submodularity gives more efficient solutions with rigorous guarantees