UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

U N I V E R S I TY O F M A RYL A N DG O O G L E I N C .

AT& T L A BS -R E S E A RC H

TH E O D O R O S R E KATS I N A SX I N LU N A D O N GD I V E S H S R I VA S TAVA

C H A RAC T E R I Z I N G A N D S E LE C T I N G F R E S H DATA S O U RC E S

DATA I S A C O M MO DI T Ymyriads of data sources


F R E ELY AVA I L A B L E S O U RC E Sopen data

initiativeworld data bankcrawling the web


F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R Sdatasiftmicrosoft azure

marketplacedatamarket.cominfochimps


F R E ELY AVA I L A B L E S O U RC E S

DATA M A R K E T S : S E L L YO U R DATA T O O T H E R S

H E T E R O G EN E O U S DATA S O U RC E Sdifferent quality - costcover different topicsstatic or dynamicexhibit different update

patterns

– LEO TOLS TOY

“Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is

not gold.”

Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013]

So, can we find gold in a systematic and automated

fashion?

Techniques agnostic to time and focus on accuracy of static sources

matters

Select sources before actual data integration

When do we need to usethe integration result ?

Data in the world and the sources changes

I T I S A DY N A MI C W O R L D

time

time

time

World

Data Sources

Updated every 2 time points

Updated every 3 time points

C H A L L E N G E S A N D OP P ORT U N I T I E S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

Q U AL I TY C H A N G E S OV E R T IM E

The optimal set of sources changes over time

LO W E R C O S T O P PO RT U N I T I E S

Integrate updates in lower frequency to lower cost

UpToDate EntriesOutOfDate EntriesNonDeleted Entries

T I M E -BAS E D S O U RC E Q U AL I T Y

Coverage( , ) = Entries ( , ) / Entries ( , )

Freshness( , ) = UpToDate Entries ( , ) / Entries ( , )

Coverage ~ RecallFreshness ~ Precision

Combine Accuracy( , )

S E L E C T I N G F R E S H S OU RC E S

Time-aware source selection

E XT E N S I O N S

Optimal frequency

Subset ofprovided data

E XT E N S I O N S

Optimal frequency

Subset ofprovided dataTime-aware source

selection with many more

sources

P R OP O S E D F RA M E W O R K

H I S T O R I C AL S N A P S H O T S OF

AVA I L A B L E S O U RC E S

Pre-processing

Statistical Modeling

U P DAT E M OD E L S F O R

S OU RC E S

E VOLU T I ON M OD E L S F O R DATA D OM A I N

Source selection

U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E

Q U A L I T Y OF I N T E G RAT E D DATA

I N T E G RAT I O N C O S T M O D E L

MaximizeQuality - Cost

Tradeoff

S E L E C T O P T I M A L S U BS E T O F S O U RC E S

W OR L D E VO LU T I O N M O D E L S

Poisson Random ProcessExponentially distributed changes

Integrate available data source snapshotsto extract the evolution of the world

Ensemble of parametric models

S OU RC E U P DAT E M OD E L S

Shall we consider only the update frequency?


High update frequency does not imply high freshness


Update frequency of the source

Empirical Effectiveness distributions

Ensemble of non-parametric models




Pre-processing



S OU RC E S


Source selection





Tradeoff


S OU RC E Q U A L I T Y E S T I M AT I O N

Combine statistical models

time

OldQuality ( , ; )

NewQuality ( , ; ) as a function of

ΔQuality ( , ; )

and



time

?Entries ( , )

Coverage( , ) =Entries ( , )



time

Estimating Entries ( , ):

use the intensity rates λ of the Poisson models

Entries ( , ) +



time

Estimating :Entries ( , )

Entries( , )Pr (Exist ( , ))+

New Entries( , )Pr (Exist ( , ))



time

Estimating :Entries ( , )

Entries( , )




Pre-processing



S OU RC E S


Source selection





Tradeoff


SOLV I NG SOU RC E SEL EC T I ON

Maximize marginal gain

SOLV I NG SOU RC E SEL EC T I ON

Greedy

Start with an empty solution and add sources greedily

No quality guarantees with arbitrarily bad solutions

Highly efficient

ARB I TRA RY OB J EC T I VE F U NC T I ONS

GRASP (k,r) [used in Dong et al., `13]

Local-search and randomized hill-climbingRun r times and keep best solution

Empirically high-quality solutions

Very expensive

A large family of benefit functions are monotone submodular (e.g., functions of coverage)

I N S I G H T S F O R Q U A L I T Y G U A RA N T E E S

Under a linear cost function the marginal gainis submodular

A B

x

f(A U {x})f(A) f(B U {x})f(B)

S U B M O D U L A R O B JE C T IV E F U N C T IO N S

Start by selecting the best sourceExplore local neighborhood: add/delete sourcesEither selected set or complement is a local optimum

Constant factor approximation [Feige, `11]

Submodular Maximization (MaxSub)

Highly efficient

Empirically high-quality even for non-sub functions

S E L E C T E D E X P E R I M E N T S

Business listings (BL)~40 sources, 2 years

~1,400 categories, 51 locations

World-wide Event listings GDELT @gdeltproject.org

15,275 sources, 1 month236 event types, 242 locations

W OR L D C H A N G E E S T I M AT I O N

Small relative error even with little training dataExpected increasing trend over time

S OU RC E C H A N G E E S T I M AT I O N

Small relative error for source quality

S E L E C T I O N QU A L I T Y

B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P

L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff.

acc.best 0.0% 33.3% 83.3% (2,100)

diff.

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff.

acc.best 50% 66.7% 83.3% (5,100)

diff.

Grasp finds the best solution most of the times

perc. of times finding the best

solution


MaxSub solutions are mostly comparable to Grasp


L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

avg. and worst quality loss



L I N E A R

cov.best 16.7% 50% 100% (5,20)

diff. .005 (.01)% .001 (.007)% -

acc.best 0.0% 33.3% 83.3% (2,100)

diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%

S T E P

cov.best 50.0% 66.7% 83.3%

(10,100)

diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%

acc.best 50% 66.7% 83.3% (5,100)

diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%

but there are cases when Grasp is significantly worse

I NC R E A S I NG NU M B E R O F S O U RC E S

43 2282 4522 6761 9000 112391E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

Scalability of Source selectioN

Greedy MaxSubGrasp (5,20) Grasp (10,100)

# Data sources

Tim

e (

mse

c)

MaxSub is one to two orders of

magnitude faster

S E L E C T I O N C H A RAC T E R I S T I C S

Accuracy selects fewer more focused sources

C O N C LU S I O N S

Thank [email protected]

Source selection before data integration to increase quality and reduce cost

Collection of statistical models to describe the evolution of the world and the updates of sources

Exploiting submodularity gives more efficient solutions with rigorous guarantees

mailto:[email protected]

UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...

Documents

Transcript of UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...