UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...
-
Upload
loren-cecily-riley -
Category
Documents
-
view
213 -
download
0
Transcript of UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH...
U N I V E R S I TY O F M A RYL A N DG O O G L E I N C .
AT& T L A BS -R E S E A RC H
TH E O D O R O S R E KATS I N A SX I N LU N A D O N GD I V E S H S R I VA S TAVA
C H A RAC T E R I Z I N G A N D S E LE C T I N G F R E S H DATA S O U RC E S
DATA I S A C O M MO DI T Ymyriads of data sources
DATA I S A C O M MO DI T Ymyriads of data sources
F R E ELY AVA I L A B L E S O U RC E Sopen data
initiativeworld data bankcrawling the web
DATA I S A C O M MO DI T Ymyriads of data sources
F R E ELY AVA I L A B L E S O U RC E S
DATA M A R K E T S : S E L L YO U R DATA T O O T H E R Sdatasiftmicrosoft azure
marketplacedatamarket.cominfochimps
DATA I S A C O M MO DI T Ymyriads of data sources
F R E ELY AVA I L A B L E S O U RC E S
DATA M A R K E T S : S E L L YO U R DATA T O O T H E R S
H E T E R O G EN E O U S DATA S O U RC E Sdifferent quality - costcover different topicsstatic or dynamicexhibit different update
patterns
– LEO TOLS TOY
“Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is
not gold.”
Yes! Use source selection to reason about the benefits and costs of acquiring and integrating data sources [Dong et al., 2013]
So, can we find gold in a systematic and automated
fashion?
Techniques agnostic to time and focus on accuracy of static sources
matters
Select sources before actual data integration
When do we need to usethe integration result ?
Data in the world and the sources changes
I T I S A DY N A MI C W O R L D
time
time
time
World
Data Sources
Updated every 2 time points
Updated every 3 time points
C H A L L E N G E S A N D OP P ORT U N I T I E S
Business listings (BL)~40 sources, 2 years
~1,400 categories, 51 locations
Q U AL I TY C H A N G E S OV E R T IM E
The optimal set of sources changes over time
LO W E R C O S T O P PO RT U N I T I E S
Integrate updates in lower frequency to lower cost
UpToDate EntriesOutOfDate EntriesNonDeleted Entries
T I M E -BAS E D S O U RC E Q U AL I T Y
Coverage( , ) = Entries ( , ) / Entries ( , )
Freshness( , ) = UpToDate Entries ( , ) / Entries ( , )
Coverage ~ RecallFreshness ~ Precision
Combine Accuracy( , )
S E L E C T I N G F R E S H S OU RC E S
Time-aware source selection
E XT E N S I O N S
Optimal frequency
Subset ofprovided data
E XT E N S I O N S
Optimal frequency
Subset ofprovided dataTime-aware source
selection with many more
sources
P R OP O S E D F RA M E W O R K
H I S T O R I C AL S N A P S H O T S OF
AVA I L A B L E S O U RC E S
Pre-processing
Statistical Modeling
U P DAT E M OD E L S F O R
S OU RC E S
E VOLU T I ON M OD E L S F O R DATA D OM A I N
Source selection
U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E
Q U A L I T Y OF I N T E G RAT E D DATA
I N T E G RAT I O N C O S T M O D E L
MaximizeQuality - Cost
Tradeoff
S E L E C T O P T I M A L S U BS E T O F S O U RC E S
P R OP O S E D F RA M E W O R K
H I S T O R I C AL S N A P S H O T S OF
AVA I L A B L E S O U RC E S
Pre-processing
Statistical Modeling
U P DAT E M OD E L S F O R
S OU RC E S
E VOLU T I ON M OD E L S F O R DATA D OM A I N
Source selection
U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E
Q U A L I T Y OF I N T E G RAT E D DATA
I N T E G RAT I O N C O S T M O D E L
MaximizeQuality - Cost
Tradeoff
S E L E C T O P T I M A L S U BS E T O F S O U RC E S
W OR L D E VO LU T I O N M O D E L S
Poisson Random ProcessExponentially distributed changes
Integrate available data source snapshotsto extract the evolution of the world
Ensemble of parametric models
S OU RC E U P DAT E M OD E L S
Shall we consider only the update frequency?
S OU RC E U P DAT E M OD E L S
High update frequency does not imply high freshness
S OU RC E U P DAT E M OD E L S
Update frequency of the source
Empirical Effectiveness distributions
Ensemble of non-parametric models
P R OP O S E D F RA M E W O R K
H I S T O R I C AL S N A P S H O T S OF
AVA I L A B L E S O U RC E S
Pre-processing
Statistical Modeling
U P DAT E M OD E L S F O R
S OU RC E S
E VOLU T I ON M OD E L S F O R DATA D OM A I N
Source selection
U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E
Q U A L I T Y OF I N T E G RAT E D DATA
I N T E G RAT I O N C O S T M O D E L
MaximizeQuality - Cost
Tradeoff
S E L E C T O P T I M A L S U BS E T O F S O U RC E S
S OU RC E Q U A L I T Y E S T I M AT I O N
Combine statistical models
time
OldQuality ( , ; )
NewQuality ( , ; ) as a function of
ΔQuality ( , ; )
and
S OU RC E Q U A L I T Y E S T I M AT I O N
Combine statistical models
time
?Entries ( , )
Coverage( , ) =Entries ( , )
S OU RC E Q U A L I T Y E S T I M AT I O N
Combine statistical models
time
Estimating Entries ( , ):
use the intensity rates λ of the Poisson models
Entries ( , ) +
S OU RC E Q U A L I T Y E S T I M AT I O N
Combine statistical models
time
Estimating :Entries ( , )
Entries( , )Pr (Exist ( , ))+
New Entries( , )Pr (Exist ( , ))
S OU RC E Q U A L I T Y E S T I M AT I O N
Combine statistical models
time
Estimating :Entries ( , )
Entries( , )
P R OP O S E D F RA M E W O R K
H I S T O R I C AL S N A P S H O T S OF
AVA I L A B L E S O U RC E S
Pre-processing
Statistical Modeling
U P DAT E M OD E L S F O R
S OU RC E S
E VOLU T I ON M OD E L S F O R DATA D OM A I N
Source selection
U S E S TAT I S T I C AL M OD E L S T O E S T I M AT E
Q U A L I T Y OF I N T E G RAT E D DATA
I N T E G RAT I O N C O S T M O D E L
MaximizeQuality - Cost
Tradeoff
S E L E C T O P T I M A L S U BS E T O F S O U RC E S
SOLV I NG SOU RC E SEL EC T I ON
Maximize marginal gain
SOLV I NG SOU RC E SEL EC T I ON
Greedy
Start with an empty solution and add sources greedily
No quality guarantees with arbitrarily bad solutions
Highly efficient
ARB I TRA RY OB J EC T I VE F U NC T I ONS
GRASP (k,r) [used in Dong et al., `13]
Local-search and randomized hill-climbingRun r times and keep best solution
Empirically high-quality solutions
Very expensive
A large family of benefit functions are monotone submodular (e.g., functions of coverage)
I N S I G H T S F O R Q U A L I T Y G U A RA N T E E S
Under a linear cost function the marginal gainis submodular
A B
x
f(A U {x})f(A) f(B U {x})f(B)
S U B M O D U L A R O B JE C T IV E F U N C T IO N S
Start by selecting the best sourceExplore local neighborhood: add/delete sourcesEither selected set or complement is a local optimum
Constant factor approximation [Feige, `11]
Submodular Maximization (MaxSub)
Highly efficient
Empirically high-quality even for non-sub functions
S E L E C T E D E X P E R I M E N T S
Business listings (BL)~40 sources, 2 years
~1,400 categories, 51 locations
World-wide Event listings GDELT @gdeltproject.org
15,275 sources, 1 month236 event types, 242 locations
W OR L D C H A N G E E S T I M AT I O N
Small relative error even with little training dataExpected increasing trend over time
S OU RC E C H A N G E E S T I M AT I O N
Small relative error for source quality
S E L E C T I O N QU A L I T Y
B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P
L I N E A R
cov.best 16.7% 50% 100% (5,20)
diff.
acc.best 0.0% 33.3% 83.3% (2,100)
diff.
S T E P
cov.best 50.0% 66.7% 83.3%
(10,100)
diff.
acc.best 50% 66.7% 83.3% (5,100)
diff.
Grasp finds the best solution most of the times
perc. of times finding the best
solution
S E L E C T I O N QU A L I T Y
MaxSub solutions are mostly comparable to Grasp
B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P
L I N E A R
cov.best 16.7% 50% 100% (5,20)
diff. .005 (.01)% .001 (.007)% -
acc.best 0.0% 33.3% 83.3% (2,100)
diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%
S T E P
cov.best 50.0% 66.7% 83.3%
(10,100)
diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%
acc.best 50% 66.7% 83.3% (5,100)
diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%
avg. and worst quality loss
S E L E C T I O N QU A L I T Y
B E N E F I T M E T R I C M S R. G R E E DY M AX S U B G RAS P
L I N E A R
cov.best 16.7% 50% 100% (5,20)
diff. .005 (.01)% .001 (.007)% -
acc.best 0.0% 33.3% 83.3% (2,100)
diff. 9.5 (53.7)% .39 (2.31)% 8.9% (53.7)%
S T E P
cov.best 50.0% 66.7% 83.3%
(10,100)
diff. 7.45 (27.8)% .012 (.06)% .7 (4.2)%
acc.best 50% 66.7% 83.3% (5,100)
diff. 6 (23.98)% 1.76 (10.6)% 3.99 (23.98)%
but there are cases when Grasp is significantly worse
I NC R E A S I NG NU M B E R O F S O U RC E S
43 2282 4522 6761 9000 112391E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
Scalability of Source selectioN
Greedy MaxSubGrasp (5,20) Grasp (10,100)
# Data sources
Tim
e (
mse
c)
MaxSub is one to two orders of
magnitude faster
S E L E C T I O N C H A RAC T E R I S T I C S
Accuracy selects fewer more focused sources
C O N C LU S I O N S
Thank [email protected]
Source selection before data integration to increase quality and reduce cost
Collection of statistical models to describe the evolution of the world and the updates of sources
Exploiting submodularity gives more efficient solutions with rigorous guarantees