Download - Efficient Allocation Algorithms For OLAP Over Imprecise Data

WILD PROJECT REVIEWWILD PROJECT REVIEW

Efficient Allocation Algorithms For OLAP Over Imprecise Data

Doug Burdick

University of Wisconsin – Madison

Raghu Ramakrishnan

Yahoo! Research

Prasad Deshpande

IBM India Research Lab, SIRC

Shivakumar Vaithyanathan

IBM Almaden Research Center

T.S. Jayram

IBM Almaden Research Center

2

MA

NY

TX

CAW

est

Eas

t

ALL

LOC

AT

ION

Civic SierraF150Camry

TruckSedan

ALL

AUTOMOBILE

Model

Category

Re

gio

n

Sta

te

ALL

AL

L

1

3

2

2 1 3

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

Multidimensional Data

p3

p1

p4

p2


p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

p5

[BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005

Imprecise Data

3

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Sierra is … but their F150 has …

4



FactID Text



p3 The F150 has…

p4 The Sierra is…


5



Auto

F150

Sierra

F150

Sierra

{Sierra,F150}

FactID Text



p3 The F150 has…

p4 The Sierra is…


More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal

6



FactID Text



p3 The F150 has…

p4 The Sierra is

p5 cust’s Truck has…

7



FactID Text



p3 The F150 has…

p4 The Sierra is…


8



Auto

F150

Sierra

F150

Sierra

Truck

FactID Text



p3 The F150 has…

p4 The Sierra is…


9

Sources of Imprecision Data Integration

Fact table constructed by integrating multiple data sources Different sources record same dimension attribute at

different granularities


p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

Call Center Mailing List


p5 Truck NY 100


p5 Truck NY 100

Civic SierraF150Camry

TruckSedan

ALL

AUTOMOBILE

Model

Category

ALL

1

2

3


p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

10

Imprecision In Real Data

Obtained real-world dataset from auto manufacturer Fact table entries from several source relations Integrated fact table contained 798,570 facts

Real data has many imprecise facts

11

Querying Imprecise Facts

p3

p1

p4

p2

p5

MA

NY

SierraF150

Truck

East


p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

Auto = F150Loc = MASUM(Repair) = ???

12

Solution: Allocation Intuitively: Replace each imprecise fact

r with set of precise facts, one for each possible completion of r Each completion is assigned an allocation

weight Refer to the resulting fact table as the

Extended Database (EDB)

Queries operate over this Extended Database

13

p3

p1

p4

p2

p5

MA

NY

SierraF150FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

Truck

East

Handle Imprecision With Allocation

ID FactID Auto Loc Repair Weight

1 p1 F150 NY 100 1.0

2 p2 Sierra NY 500 1.0

3 p3 F150 MA 100 1.0

4 p4 Sierra MA 200 1.0

5 p5 F150 MA 100 0.5


p5

14

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

Querying The Extended Database

p5 p5

Auto = F150Loc = MASUM(Repair) = ???


1 p1 F150 NY 100 1.0


3 p3 F150 MA 100 1.0


5 p5 F150 MA 100 0.5


15

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

Querying The Extended Database

p5 p5

Auto = F150Loc = MASUM(Repair) = 150


1 p1 F150 NY 100 1.0


3 p3 F150 MA 100 1.0


5 p5 F150 MA 100 0.5


Procedure for assigning allocation weights is referred to as an

allocation policy

16

Contributions Propose generalized template for allocation

policies presented in [BDJ+05] Present operational framework for allocation

Allocation graph formalism Used to derive Independent, Block, Transitive Algorithms

Propose Extended Database Maintenance Algorithm Update EDB to reflect changes to given fact table

Experimental Evaluation

17

Allocation Policy Template

r

MA

NY

SierraF150

Truck

East

c1 c2

)(

)(

)'(

)(

)('

, rQsum

cQ

cQ

cQp

rregionc

rc

)2()1(

)2(

)2()1(

)1(

,2

,1

cQcQ

cQp

cQcQ

cQp

rc

rc

18

p4

p1

p5

p2

p6

MA

NY

SierraF150

Truck

East

p7

Interactions between overlapping facts Allocation weights for

imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa)

Would like assigned weights to capture these interactions

Idea: Repeatedly allocate p6 and p7 until allocation weights converge

19

Iterative Allocation Policies

' ( )

( ) ( ')t t

c region r

Qsum r Q c

1) Initialize each Q each Q00(c) in cell c (using precise facts) (c) in cell c (using precise facts)

2) For each iteration t until all Qt(c) converged

For each cell c

For each imprecise fact r overlapping c

)(

)()()(

1

rQsum

rQcQcQ

t

ttt

)(

)(, rQsum

cQp

t

t

rc

3) For each imprecise fact r

For each imprecise fact r

For each cell c in region(r)

20

Benefits of Iterative Allocation Imprecise facts can be allocated in any order

and same allocation weights are obtained Leverage this idea to obtain scalable allocation

algorithms

Leads to Expectation Maximization (EM) framework for allocation Final allocation weights have pleasing

mathematical properties See [BDJ+05] for details

21

Allocation Graph

<MA,Truck>

Imprecise Facts

Precise Cells

Cell(NY,F150)

Cell(NY,Sierra)

Cell(MA,F150)

Cell(MA,Sierra)

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p5

p6c1 c2

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p5

p6c1 c2

22

Processing WithAllocation Graph

<MA,Truck>

Imprecise Facts

Precise Cells

Cell(NY,F150)

Cell(NY,Sierra)

Cell(MA,F150)

Cell(MA,Sierra)

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East c1 c2

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p6c1 c2

12 3

2 / 3

1 / 3

' ( )

( ) ( ')t t

c region r

Qsum r Q c

)(

)(, rQsum

cQp

t

t

rc

p5 p5p5

Initialize each Q each Q00(c) in cell c(c) in cell c

23

Efficient Allocation Algorithms Independent Algorithm

Requires multiple sorts of precise cells for each iteration

Optimizations based on re-using each sort as much as possible

Block Algorithm Reduces the number of required sorts for precise

cells to 1 Optimizations based on increasing buffer

utilization

24

<MA,Sedan>p6

<MA,Truck>p7

<CA,ALL>p8

<East,Truck>p9

<West,Sedan>p10

<ALL,Civic>p11

<ALL,Sierra>p12

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<MA,Sierra>

<NY,F150>

<CA,Civic>

<CA,Sierra>

p1

p2

p3

p4

p5

S1:<State,Category>

S2 :<State, ALL>

S3 :<Region,Category>

S4 :<ALL,Model>

S5 :<Region,Model>

25

Iteration aware allocation

Optimizations for Independent and Block reduce work for single iteration

Problem: Each iteration of allocation is still expensive Involves multiple scans of entire fact table Not feasible for real data warehouses!

Can we do better?

26

Required Data For Allocating A Fact <MA,Sedan>p6

<MA,Truck>p7

<CA,ALL>p8

<East,Truck>p9

<West,Sedan>p10

<ALL,Civic>p11

<ALL,Sierra>p12

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<MA,Sierra>

<NY,F150>

<CA,Civic>

<CA,Sierra>

c1

c2

c3

c4

c5

`

27

<MA,Sedan>p6

<CA,ALL>p8

<West,Sedan>p10

<ALL,Civic>p11

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<CA,Civic>

<CA,Sierra>

c1

c4

c5

<MA,Truck>p7

<East,Truck>p9

<ALL,Sierra>p12

<MA,Sierra>

<NY,F150>

c2

c3Connected components in allocation graph can be

processed independently

Required Data For Allocating A Fact

28

Transitive Algorithm Transitive Algorithm has two steps:

1) Connected component identification step 2) Process each connected component

Read component into memory Perform all iterations of allocation for facts in component

If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! Components larger than buffer processed using Block

algorithm In real datasets, all components were memory resident

Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm

29

Experimental Setup

Algorithms evaluated on several datasets Real-world dataset: 798K facts , 4 dimensions Used several synthetic datasets

Vary level of imprecision in the data Percentage of imprecise facts Severity of imprecision

Scalability (up to 5 million tuples)

Important parameter: Ratio of input table size to available memory Memory limited to restricted buffer pool

30

Experiment 1a: Memory Resident

0

50

100

150

200

250

300

1 3 5 7

Iterations (until converged)

Tim

e (

se

c)

IndependentBlockTransitive

Real Dataset

31

Experiment: Memory Resident (2)

0

100

200

300

400

500

0 5 10Iterations (until converged)

Tim

e (

se

c)


Synthetic Dataset (more imprecision)

32

Experiment: Algorithm Scalability

ε = 0.1 (3 iterations)

0

200

400

600800

1000

1200

1400

600KB 1MB 6MB 12MB

Buffer Size

Tim

e (s

ec)


33

Experiment 1b: Algorithm Scalability

ε = 0.005 (10 iterations)

01000200030004000500060007000

600 KB 1MB 6MB 12MB

Buffer Size

Tim

e (

se

c)


34

Conclusions Imprecision is a compelling real-world

problem Propose allocation as a solution

Allocation graph formalism Basis for 3 scalable allocation algorithms Independent, Block, Transitive

Transitive algorithm is quite intriguing Performance is stable as number of iterations

increase Connected components algorithm identifies can

be used in proposed EDB maintenance algorithm