WILD PROJECT REVIEWWILD PROJECT REVIEW
Efficient Allocation Algorithms For OLAP Over Imprecise Data
Doug Burdick
University of Wisconsin – Madison
Raghu Ramakrishnan
Yahoo! Research
Prasad Deshpande
IBM India Research Lab, SIRC
Shivakumar Vaithyanathan
IBM Almaden Research Center
T.S. Jayram
IBM Almaden Research Center
2
MA
NY
TX
CAW
est
Eas
t
ALL
LOC
AT
ION
Civic SierraF150Camry
TruckSedan
ALL
AUTOMOBILE
Model
Category
Re
gio
n
Sta
te
ALL
AL
L
1
3
2
2 1 3
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
Multidimensional Data
p3
p1
p4
p2
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
p5
[BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005
Imprecise Data
3
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is…
p5 cust’s Sierra is … but their F150 has …
4
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is…
p5 cust’s Sierra is … but their F150 has …
5
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
Auto
F150
Sierra
F150
Sierra
{Sierra,F150}
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is…
p5 cust’s Sierra is … but their F150 has …
More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal
6
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is
p5 cust’s Truck has…
7
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is…
p5 cust’s Truck has…
8
Sources of Imprecision Dimensions extracted from free text
Assume given extractor for Auto dimension values
Auto
F150
Sierra
F150
Sierra
Truck
FactID Text
p1 Brakes on F150…
p2 Rotors on the Sierra are…
p3 The F150 has…
p4 The Sierra is…
p5 cust’s Truck has…
9
Sources of Imprecision Data Integration
Fact table constructed by integrating multiple data sources Different sources record same dimension attribute at
different granularities
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
Call Center Mailing List
FactID Auto Loc Repair
p5 Truck NY 100
FactID Auto Loc Repair
p5 Truck NY 100
Civic SierraF150Camry
TruckSedan
ALL
AUTOMOBILE
Model
Category
ALL
1
2
3
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
10
Imprecision In Real Data
Obtained real-world dataset from auto manufacturer Fact table entries from several source relations Integrated fact table contained 798,570 facts
Real data has many imprecise facts
11
Querying Imprecise Facts
p3
p1
p4
p2
p5
MA
NY
SierraF150
Truck
East
FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
Auto = F150Loc = MASUM(Repair) = ???
12
Solution: Allocation Intuitively: Replace each imprecise fact
r with set of precise facts, one for each possible completion of r Each completion is assigned an allocation
weight Refer to the resulting fact table as the
Extended Database (EDB)
Queries operate over this Extended Database
13
p3
p1
p4
p2
p5
MA
NY
SierraF150FactID Auto Loc Repair
p1 F150 NY 100
p2 Sierra NY 500
p3 F150 MA 100
p4 Sierra MA 200
p5 Truck MA 100
Truck
East
Handle Imprecision With Allocation
ID FactID Auto Loc Repair Weight
1 p1 F150 NY 100 1.0
2 p2 Sierra NY 500 1.0
3 p3 F150 MA 100 1.0
4 p4 Sierra MA 200 1.0
5 p5 F150 MA 100 0.5
6 p5 Sierra MA 100 0.5
p5
14
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East
Querying The Extended Database
p5 p5
Auto = F150Loc = MASUM(Repair) = ???
ID FactID Auto Loc Repair Weight
1 p1 F150 NY 100 1.0
2 p2 Sierra NY 500 1.0
3 p3 F150 MA 100 1.0
4 p4 Sierra MA 200 1.0
5 p5 F150 MA 100 0.5
6 p5 Sierra MA 100 0.5
15
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East
Querying The Extended Database
p5 p5
Auto = F150Loc = MASUM(Repair) = 150
ID FactID Auto Loc Repair Weight
1 p1 F150 NY 100 1.0
2 p2 Sierra NY 500 1.0
3 p3 F150 MA 100 1.0
4 p4 Sierra MA 200 1.0
5 p5 F150 MA 100 0.5
6 p5 Sierra MA 100 0.5
Procedure for assigning allocation weights is referred to as an
allocation policy
16
Contributions Propose generalized template for allocation
policies presented in [BDJ+05] Present operational framework for allocation
Allocation graph formalism Used to derive Independent, Block, Transitive Algorithms
Propose Extended Database Maintenance Algorithm Update EDB to reflect changes to given fact table
Experimental Evaluation
17
Allocation Policy Template
r
MA
NY
SierraF150
Truck
East
c1 c2
)(
)(
)'(
)(
)('
, rQsum
cQ
cQ
cQp
rregionc
rc
)2()1(
)2(
)2()1(
)1(
,2
,1
cQcQ
cQp
cQcQ
cQp
rc
rc
18
p4
p1
p5
p2
p6
MA
NY
SierraF150
Truck
East
p7
Interactions between overlapping facts Allocation weights for
imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa)
Would like assigned weights to capture these interactions
Idea: Repeatedly allocate p6 and p7 until allocation weights converge
19
Iterative Allocation Policies
' ( )
( ) ( ')t t
c region r
Qsum r Q c
1) Initialize each Q each Q00(c) in cell c (using precise facts) (c) in cell c (using precise facts)
2) For each iteration t until all Qt(c) converged
For each cell c
For each imprecise fact r overlapping c
)(
)()()(
1
rQsum
rQcQcQ
t
ttt
)(
)(, rQsum
cQp
t
t
rc
3) For each imprecise fact r
For each imprecise fact r
For each cell c in region(r)
20
Benefits of Iterative Allocation Imprecise facts can be allocated in any order
and same allocation weights are obtained Leverage this idea to obtain scalable allocation
algorithms
Leads to Expectation Maximization (EM) framework for allocation Final allocation weights have pleasing
mathematical properties See [BDJ+05] for details
21
Allocation Graph
<MA,Truck>
Imprecise Facts
Precise Cells
Cell(NY,F150)
Cell(NY,Sierra)
Cell(MA,F150)
Cell(MA,Sierra)
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East
p5
p6c1 c2
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East
p5
p6c1 c2
22
Processing WithAllocation Graph
<MA,Truck>
Imprecise Facts
Precise Cells
Cell(NY,F150)
Cell(NY,Sierra)
Cell(MA,F150)
Cell(MA,Sierra)
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East c1 c2
p3
p1
p4
p2
MA
NY
SierraF150
Truck
East
p6c1 c2
12 3
2 / 3
1 / 3
' ( )
( ) ( ')t t
c region r
Qsum r Q c
)(
)(, rQsum
cQp
t
t
rc
p5 p5p5
Initialize each Q each Q00(c) in cell c(c) in cell c
23
Efficient Allocation Algorithms Independent Algorithm
Requires multiple sorts of precise cells for each iteration
Optimizations based on re-using each sort as much as possible
Block Algorithm Reduces the number of required sorts for precise
cells to 1 Optimizations based on increasing buffer
utilization
24
<MA,Sedan>p6
<MA,Truck>p7
<CA,ALL>p8
<East,Truck>p9
<West,Sedan>p10
<ALL,Civic>p11
<ALL,Sierra>p12
<West,Civic>p13
<West,Sierra>p14
<MA,Civic>
<MA,Sierra>
<NY,F150>
<CA,Civic>
<CA,Sierra>
p1
p2
p3
p4
p5
S1:<State,Category>
S2 :<State, ALL>
S3 :<Region,Category>
S4 :<ALL,Model>
S5 :<Region,Model>
25
Iteration aware allocation
Optimizations for Independent and Block reduce work for single iteration
Problem: Each iteration of allocation is still expensive Involves multiple scans of entire fact table Not feasible for real data warehouses!
Can we do better?
26
Required Data For Allocating A Fact <MA,Sedan>p6
<MA,Truck>p7
<CA,ALL>p8
<East,Truck>p9
<West,Sedan>p10
<ALL,Civic>p11
<ALL,Sierra>p12
<West,Civic>p13
<West,Sierra>p14
<MA,Civic>
<MA,Sierra>
<NY,F150>
<CA,Civic>
<CA,Sierra>
c1
c2
c3
c4
c5
`
27
<MA,Sedan>p6
<CA,ALL>p8
<West,Sedan>p10
<ALL,Civic>p11
<West,Civic>p13
<West,Sierra>p14
<MA,Civic>
<CA,Civic>
<CA,Sierra>
c1
c4
c5
<MA,Truck>p7
<East,Truck>p9
<ALL,Sierra>p12
<MA,Sierra>
<NY,F150>
c2
c3Connected components in allocation graph can be
processed independently
Required Data For Allocating A Fact
28
Transitive Algorithm Transitive Algorithm has two steps:
1) Connected component identification step 2) Process each connected component
Read component into memory Perform all iterations of allocation for facts in component
If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! Components larger than buffer processed using Block
algorithm In real datasets, all components were memory resident
Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm
29
Experimental Setup
Algorithms evaluated on several datasets Real-world dataset: 798K facts , 4 dimensions Used several synthetic datasets
Vary level of imprecision in the data Percentage of imprecise facts Severity of imprecision
Scalability (up to 5 million tuples)
Important parameter: Ratio of input table size to available memory Memory limited to restricted buffer pool
30
Experiment 1a: Memory Resident
0
50
100
150
200
250
300
1 3 5 7
Iterations (until converged)
Tim
e (
se
c)
IndependentBlockTransitive
Real Dataset
31
Experiment: Memory Resident (2)
0
100
200
300
400
500
0 5 10Iterations (until converged)
Tim
e (
se
c)
IndependentBlockTransitive
Synthetic Dataset (more imprecision)
32
Experiment: Algorithm Scalability
ε = 0.1 (3 iterations)
0
200
400
600800
1000
1200
1400
600KB 1MB 6MB 12MB
Buffer Size
Tim
e (s
ec)
IndependentBlockTransitive
33
Experiment 1b: Algorithm Scalability
ε = 0.005 (10 iterations)
01000200030004000500060007000
600 KB 1MB 6MB 12MB
Buffer Size
Tim
e (
se
c)
IndependentBlockTransitive
34
Conclusions Imprecision is a compelling real-world
problem Propose allocation as a solution
Allocation graph formalism Basis for 3 scalable allocation algorithms Independent, Block, Transitive
Transitive algorithm is quite intriguing Performance is stable as number of iterations
increase Connected components algorithm identifies can
be used in proposed EDB maintenance algorithm
Top Related