VLDB 2006, Seoul1 Indexing For Function Approximation Biswanath Panda Mirek Riedewald, Stephen B....

Post on 14-Jan-2016

217 views 0 download

Transcript of VLDB 2006, Seoul1 Indexing For Function Approximation Biswanath Panda Mirek Riedewald, Stephen B....

VLDB 2006, Seoul 1

Indexing For Function Approximation

Biswanath PandaMirek Riedewald, Stephen B. Pope, Johannes

Gehrke, L. Paul Chew

Cornell University

VLDB 2006, Seoul 2

Motivation

• Simulations are important in science

• Large simulations computationally infeasible– Driven by complex mathematical models – Require solution to complex differential equations

• Approximation techniques speed up simulations– Bounded error in the simulation – Approximate simulation steps using information from

previous steps

VLDB 2006, Seoul 3

Outline

• Example scientific application– Combustion simulation

• Function approximation problem– Formulation– Hardness– Algorithm

• Indexing problem

VLDB 2006, Seoul 4

Combustion SimulationHigh Dimensional

Composition Vector

Inflow

Outflow

Mixing &

Reaction

Air

Methane

Air + Methane

VLDB 2006, Seoul 5

Properties Of Simulation

• Composition dimensionality– 9 for simple hydrogen simulations– >50 for complex methane simulations

• Cost of reaction function evaluation: 30ms• Number of function evaluations: 108 to 1010

• Total simulation time– 108 function evaluations ≈ 35 days

VLDB 2006, Seoul 6

Function Approximation

• Approximate the reaction function• Approach

– Use previous function evaluations to approximate future function evaluations

– ISAT (In Situ Adaptive Tabulation) [Pope’ 97]

• Definition: ε-approximation of f(x)– Let f: Rm → Rn be a function, let x Rm and ε R. f*(x)

is an ε-approximation of f(x) if || f*(x) –f(x)|| < ε

VLDB 2006, Seoul 8

Example

Cost

f

VLDB 2006, Seoul 9

Example

x2x1

ε

ε

f*(x2) = f(x) + s * (x2 - x)

( x, f(x) )

An ε-Local Region Rf,f*(x, ε) Rm

Original Cost

Cost

f

VLDB 2006, Seoul 10

x1 x2 x3 x4 x5 x6

Original Cost

Cost

Example

f

f1*

f2*

f3*

VLDB 2006, Seoul 11

x1 x2 x3 x4 x5 x6

Example

f

f1*

f2*

f3*

When should a local region be added?

VLDB 2006, Seoul 12

Example

Each query point can be covered by several Local Regions

x1 x2 x3 x4 x5 x6x7 x8

f

f1*

f2*

f3*

f4*

VLDB 2006, Seoul 15

Challenges

• Finding good f* s and corresponding Local Regions

• Computing a set of Local Regions• Data management: storing Local Regions for

future use

• Problem: Minimize total simulation time by computing and storing a set of Local Regions

VLDB 2006, Seoul 17

Finding The Optimal Set Of Local Regions

• Simplified cost model– Both the function value and Local Region at a point can be

obtained at some constant cost equal across all regions– Approximations have zero cost

• Offline Problem– Given a set X={ x1, x2, … xn } of query points, find the smallest

set L={ l1, l2, … lk } of Local Regions, such that for each xi X there is an lj L which contains xi

– NP-Complete: Reduction from Geometric Covering By Discs

• Online Problem– No online algorithm is competitive

VLDB 2006, Seoul 19

Algorithm Illustration

x1 x2 x3 x4 x5 x6x7 x8

f

f1*

f2*

f3*

f4*

VLDB 2006, Seoul 20

Algorithm

Initialize S

Lookup x in S

Local Region Found?

Return Approximation

Y N

Add new region containing x to S

Evaluate function at x

Retrieve

Add

Simulation

VLDB 2006, Seoul 21

Possible Instantiation Of Local Regions

• Local Regions can be approximated using high dimensional ellipsoids [Pope ‘97]– Based on Taylor Expansion of function

• Two step approach– Initial conservative approximation

– Grow

x x1

VLDB 2006, Seoul 22

Example

x2x1

x ε’ < ε

VLDB 2006, Seoul 23

Example

x’2

x

x’1

ε’ < ε

VLDB 2006, Seoul 24

Example

x’1 x’2

x

ε

ε’ < ε

VLDB 2006, Seoul 26

Updating Existing RegionsN

Evaluate function at x

Can existing region

contain x?

Update existing regions to contain x

Add new region containing x to S

GrowNY

VLDB 2006, Seoul 28

Outline

• Example scientific application– Combustion Simulation

• Function Approximation Problem– Formulation– Hardness– Algorithm

• Indexing problem

VLDB 2006, Seoul 29

Indexing Problem

• Workload– Retrieve: Find ellipsoid

containing query point

VLDB 2006, Seoul 30

Indexing Problem

• Workload– Retrieve: Find ellipsoid

containing query point– Grow

• Find ellipsoids to be grown

• Update grown ellipsoids

VLDB 2006, Seoul 31

Indexing Problem

• Workload– Retrieve: Find ellipsoid

containing query point– Grow

• Find ellipsoids to be grown

• Update grown ellipsoids

– Add: Insert a new ellipsoid

VLDB 2006, Seoul 32

New Indexing Problem• Shape of regions• Updates and queries interleaved • Additional costs: ellipsoid maintenance costs

• Overall aim: Reduce total simulation time• Retrieve/grow/add are all optional

– Tuning parameters at each step

Operation Cost

Evaluation 2000

Addition 1200

Grow 10

Approximation 1

Search 1

VLDB 2006, Seoul 34

Outline

• Example scientific application– Combustion simulation

• Function approximation problem– Formulation– Hardness– Algorithm

• Indexing problem– Cost structure, tuning parameters and effects– Index structures and experiments

VLDB 2006, Seoul 35

Grow Effects

Cmiss = tf + tgrowsearch + Igrow * Cgrow + (1-Igrow)*Cadd

• Tuning Parameter: Ellg – Limit on number of ellipsoids examined for growing– No pruning criteria – Affects

• tgrowsearch

• Chance of finding a growable ellipsoid

• Tuning Parameter: Ngrown – Number of ellipsoids grown per step– Affects

• Cgrow

• Structure of the index (overlapping ellipsoids)

VLDB 2006, Seoul 36

Retrieve Effects

Ctot = tsearch + Iret * tla + (1-Iret) * Cmiss

• Tuning Parameter: Ellr – Limit on number of ellipsoids examined during retrieve– Limits how much of the index is searched

– Affects• tsearch

• Chances of a current retrieve and also future retrieves

VLDB 2006, Seoul 38

Add Effects

Cmiss = tf + tgrowsearch + Igrow * Cgrow + (1-Igrow)*Cadd

• Tuning parameter: Indirectly controlled by retrieves and grows– Affects

• Should query point be covered by an add or grow?

(-) Computing new ellipsoids is expensive

(-) New ellipsoids cover smaller part of the domain

(+) May lead to better ellipsoid distribution

VLDB 2006, Seoul 39

Candidate Index Structures

• Bounding Box Rtree• Point Rtree• Ellipsoid Rtree• Random Projection Rtree• Binary Tree• MRU List + Rtree

VLDB 2006, Seoul 40

Binary Tree

Primary Retrieve

A

C

B

1

2A

B C

21

q

VLDB 2006, Seoul 41

Binary Tree

Secondary Retrieve

A

C

B

1

2A

B C

21

q

VLDB 2006, Seoul 42

Binary Tree

A

C

B

1

2A

B C

2

1

VLDB 2006, Seoul 43

Binary Tree

Secondary Retrieve now Primary Retrieve

A

C

B

1

2A

1

2

3

3DB

D C

C

VLDB 2006, Seoul 44

Effects In Action: Binary Tree

• 32 dimensional Methane simulation• 6 x 106 queries• Windows XP machine (2.4 Ghz, 2GB)

VLDB 2006, Seoul 45

MRU List + Rtree

• MRU List for retrieving– High locality

• Rtree for searching growable ellipsoids

MRU List

Rtree

VLDB 2006, Seoul 46

Effects In Action: MRU List + Rtree

• Effects very different from Binary Tree

VLDB 2006, Seoul 47

Total Simulation TimesIndex Type Error Tolerance

0.005 0.00005 0.00004

Binary Tree (tuned)

1073 10181 13100

MRU List + Rtree 1125 14000 19920

Bbox Rtree 1201 14700 20850

Random Projection Rtree

1378 15800 22051

Binary Tree(default)

1344 29186 31200

FIFO List + Rtree 2154 33770 42900

Point Rtree 10431 >44000 -

Ellipsoidal Rtree 14328 >44000 -

VLDB 2006, Seoul 48

Conclusion & Future Work

• Formulated the function approximation problem• New class of applications for high dimensional indexing• Understand index selection for function approximation

• Future work– Dynamic parameter settings– New benchmark for index structures– Evaluation of other index structures– Comparison with other function approximation techniques

VLDB 2006, Seoul 49

Questions?