Post on 01-Apr-2015
MDL Summarization with Holes
Shaofeng Bu Laks V.S. Lakshmanan
Raymond T. Ng
University of British Columbia, Canada
VLDB 05 Shaofeng Bu UBC 2
Introduction Multi-dimensional OLAP queries typically produce
data intensive answers Often the question is: how to express the large
answer set of cells that satisfy the OLAP query conditions: Simple enumeration: accurate but not necessarily the most
intuitive; Summaries: not (necessarily) 100% accurate but can be
more intuitive and informative. Summarized answers can be more easily understood
3
OLAP Data Cube Example
clothes
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
nort
hw
est
mid
wes
tnort
heast
locationja
ckets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’s Each dimension is associated with a hierarchical tree
4
OLAP Data Cube Example
clothes
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
nort
hw
est
mid
wes
tnort
heast
locationja
ckets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’s
Data Cell: (c1,c2), c1,c2 are leaf-nodes
in axis-trees, e.g. (Vancouver, ties) Data Region: describes all data cells
covered by given nodes in the axis-trees, (x1, y1), e.g.:
(Vancouver, ties) (Vancouver, women’s) (northwest, women’s)
5
OLAP Data Cube Example
clothes
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
nort
hw
est
mid
wes
tnort
heast
locationja
ckets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’s
Blue cells: the cells that satisfy the query conditions;
How to find a summary of the blue cells in a data cube?
VLDB 05 Shaofeng Bu UBC 6
MDL Summarization
MDL: Minimum Description Length Use regions to cover the blue cells; Length of an MDL description is the number of
included regions and cells; MDL is to find the description with the
minimum length.
7
R9
R5R6
R7 R8
R1
An Example of MDL Summarizationclothes
R2 R3 R4
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
mid
wes
tnort
heast
location
jack
ets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’snort
hw
est
8
?R9
R10
R11
R12
R13
R5
10 regions
8 single blue cells
Total length = 18
MDL Summarization
R6R7 R8
A Motivating Example: A New Caseclothes
R2 ?R3 R4
?R1
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
nort
hw
est
mid
wes
tnort
heast
loca
tion
jack
ets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’sNot blue cells any more
VLDB 05 Shaofeng Bu UBC 9
Can we do better?
Yes! We present a new compression approach: MDL with Holes:
Identify regions with blue cells, even if they contain non-blue cells;
Express the included blue cells by using regions with the exception of the covered non-blue cells;
Non-blue cells are called holes.
10
R5R6
R7 R8
R2 R4 Plus other 6 regions?R1
R1-(Vancouver,Skirts)
?R9
R9-(Boston,ties) -(New York, dress skirts)
?R3
R3-(Vancouver,Skirts)
A Motivating Example: MDL with Holesclothes
New York
Vancouver
Edmonton
San Jose
San Francisco
Chicago
MinneapolisBoston
Summit
Albany
nort
hw
est
mid
west
nort
heast
loca
tion
jack
ets
tops
wom
en’s
jeans
blo
use
s
skir
ts
form
al w
ear
men
’s jean
s
dre
ss p
an
ts
ties
dre
ss s
kirt
s
women’s men’s
R1+R3-(Vancouver,Skirts)
MDL with Holes: Length = 6+3+3=12
MDL Approach: Length is 18
VLDB 05 Shaofeng Bu UBC 11
Problem Statements
MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit.
In practice, we can drill down on regions to get additional details.
VLDB 05 Shaofeng Bu UBC 12
Definitions: Length & Benefit
Given a set B of data cells (blue cells), an MDLH description for B:
D=S – H , S is a set of data regions, H is a set of data cells, also called ‘holes’, D covers exactly the data cells in B.
Length: total number of the included regions and cells in the description.
|D|=|S|+|H| Benefit : how much shorter is the MDLH
summary than the enumeration of B.
Benefit (D) = |B| – | D|
B1={a, b, c} D1= s – d
|D1|=2
Benefit(D1) = |B1| - |D1| = 1
B2={e, g} D2= t – f – h
|D2| = 3
Benefit(D2)= |B2| - |D2| = -1
a b c d e f
s t
x
g h
13
Related Work The Generalized MDL Approach for Summarization, Laks
V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 Reduce description length by allowing non-blue cells to be covered
in the regions The regions are not pure.
Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003
Allow Cartesian products to be formed; Not purely hierarchical: NP Completeness result is less surprising; What about the pure hierarchical?
Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001
Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.
VLDB 05 Shaofeng Bu UBC 14
Outline Introduction to MDL with Holes
A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Complete Heuristics
A Greedy Heuristic Dynamic Programming Quadratic Programming
Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
15
‘x’ D1= x – d – f – j
Benefit(D1) = 7 – 4 = 3
D2=(s – d ) + e + ( u – j )
Beneift(D2) = 7 – 5 = 2
‘y’ D3 = y – m – p – q – r
Benefit(D3) = 4 – 5 = -1
D4 = ( v – m ) + o ,
Benefit(D4) = 4 – 3 = 1
‘z’ D5 = z – d – f – j – m – p – q – r
Benefit(D5) = 11 – 8 = 3
D6=(x – d – f – j)+( v – m + o ) Benefit(D6) = 11 – 7 = 4
1-D Case: MDLH is Tractable
a b c d e f g h i j k l m n o p q r
s t u v w
yx
z
MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.
VLDB 05 Shaofeng Bu UBC 16
Outline Introduction to MDL with Holes
A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics
A Greedy Heuristic Dynamic Programming Quadratic Programming
Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
17
1 2 3 4 5 6 7
8
abcdefg
i
(c,8),(d,8),(e,8) 4 0
rows length benefit
(f,8),(g,8) 3 2
(a,8),(b,8) 5 -2
columns length benefit
(i,1) 3 2
(i,5) 5 -2
(i,2),(i,3),(i,4)
(i,6),(i,7)
4 0
2-D Case: Optimality is not Preserved Any More
Optimal Solution:{(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)}-{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4)
+(e,2)+(e,3)+(e,4)}+(f,1)+(g,1)+(f,6)+(g,7)Length = 19 Benefit = 28-19 = 9
VLDB 05 Shaofeng Bu UBC 18
MDLH is NP-Hard in 2-D Case
It is NP-Hard to find the optimal MDLH description in 2-D data cube;
Not a Trivial Proof: Details are in the paper; Reduction Strategy:
Clique
Maximum Induced Subgraph inComplete Edge-Weighted(CEW) Bipartite Graph
MDL with Holes
VLDB 05 Shaofeng Bu UBC 19
Outline Introduction to MDL with Holes
A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics
A Greedy Heuristic Dynamic Programming Quadratic Programming
Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
VLDB 05 Shaofeng Bu UBC 20
Heuristics for MDLH
Greedy Each time, choose the row/column with the most
benefit Dynamic Programming
A bottom-up method to get the description of a region from the descriptions of its children regions
Quadratic Programming Using a quadratic function to represent the benefit of a
2-d data cube
VLDB 05 Shaofeng Bu UBC 21
Example for Comparison with Heuristics
The optimal description for this example:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+
(d,4)+(b,5)
+(e,6)+(e,8)+(a,11)-(a,8)
Length = 12
Benefit = 8
1 2 3 4 5 6 7 8 9
a
b
c
d
10 11
12
e
VLDB 05 Shaofeng Bu UBC 22
Heuristics: A Greedy Heuristic
1 2 3 4 5 6 7 8 9abcd
10 11
12
e
region length benefit holes
(e,6) 1 3 -(d,10) 2 2 (d,5)
(e,1) 2 1 (a,1)(e,2) 2 1 (b,2)(e,3) 2 1 (b,3)
(a,11) 2 1 (a,8)(e,8) 2 1 (a,8)
(c,10) 3 0 (c,4)(c,5)
Description by Greedy:(e,6)+(a,11)+(e,8)-(a,8)+(d,10)-(d,5)+(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3)
The length is 13 The benefit is 20-13 = 7
VLDB 05 Shaofeng Bu UBC 23
Greedy: Why it is not optimal?
1 2 3 4 5 6 7 8 9abcd
10 11
12
e
Description from Greedy
1 2 3 4 5 6 7 8 9abcd
10 11
12
e
Optimal Description
A selection of row/column may reduce more total benefit
24
Heuristics: Dynamic Programming
1 2 3 4 5 10 6 7 8 9 11 12
a 2 2 4
b 2 2 4
c 3 2 5
d 2 2 4
e 2 2 2 1 1 8 1 1 2 1 5 13
1 2 3 4 5 6 7 8 9
a
b
c
d
10 11
12
e
1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
L: The Length of a Region
S: Selection of Rows & Columns (a,10) : (a,2) + (a,3)
L(a,10)=2, S(a,10)=‘t2’ (e,4) : (d,4)
L(e,4)=1, S(e,4)=‘t1’ (d,10): (d,10) – (d,5)
L(d,10)=2, S(d,10)=‘g’
t1
t2
25
Heuristics: Dynamic Programming(2)
1 2 3 4 5 6 7 8 9
a
b
c
d
10 11
12
e
S 1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
S (e,12)=‘t2’
S (e,11)=‘t2’
D(e,6)+D(e,7)+D(e,8)+D(e,9)
S (e,10)=‘t2’
D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5)
D(e,12)=D(e,10)+D(e,11)
(e,1)-(a,1) (e,2)-(b,2) (e,3)-(b,3) (d,4) (b,5) (e,6) (a,7) (e,8)-(a,8) (a,9)Generated Description:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9)The length is 13 and the benefit is 20-13 = 7
D(x1,x2):description for region (x1,x2)
t1
t2
VLDB 05 Shaofeng Bu UBC 26
Dynamic Programming: Why it is not optimal?
Description by Dynamic Programming
Optimal Description
1 2 3 4 5 6 7 8 9abcd
10 11
12
e
1 2 3 4 5 6 7 8 9abcd
10 11
12
e
Misses the combination of rows and columns
VLDB 05 Shaofeng Bu UBC 27
Use variables to represent rows/columns; for a variable v: v=1: the corresponding row/column is selected; v=0: the corresponding row/column is not selected;
f = – Benefit( D) Maximizing the benefit is to minimize the value of f
For the previous example, quadratic programming generates the optimal description;
Optimality is not guaranteed.
Heuristics: Quadratic Programming
VLDB 05 Shaofeng Bu UBC 28
Outline Introduction to MDL with Holes
A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics
A Greedy Heuristic Dynamic Programming Quadratic Programming
Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
VLDB 05 Shaofeng Bu UBC 29
Experiments
We ran a set of experiments on the TPC-H benchmark data set;
We compared the three MDLH heuristics with MDL and GMDL.
30
Experimental Results: Comparison of All Methods Compression Ratio:
MDLH-Quadratic generates the most concise descriptions: a yardstick of quality;
MDLH-Dynamic is a very close second.
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3916(25%)
4701(30%)
5088(33%)
5971(38%)
6414(41%)
6655(43%)
7422(48%)
7906(51%)
8436(54%)
8944(57%)
9459(61%)
9984(64%)
10537(67%)
10787(69%)
11307(72%)
Number of Blue Cells ( Blue Density)
Co
mp
ress
ion
Rat
io
MDL
MDLH-Greedy
MDLH-Dynamic
MDLH-Quadratic
GMDL-5%
GMDL-10%
31
Experimental Results: Compression Ratio
1
1.5
2
2.5
3
3.5
4
4.5
10000 (20%)
15000 (30%)
20000 (40%)
25000 (50%)
30000 (60%)
35000 (70%)
40000 (80%)
Number of Blue Cells (Blue Density)
Co
mp
res
sio
n R
ati
o
MDLMDLH-GreedyMDLH-DynamicGMDL-5%GMDL-10%
The more children per parent node, the greater the benefit
VLDB 05 Shaofeng Bu UBC 32
Experimental Results: Summary Running time & Scalability:
MDLH-Greedy is the fastest; MDLH-Dynamic runs slower than MDLH-Greedy, but
it is still scalable w.r.t. the number of cells;379 secs
0
20
40
60
80
100
3-d 3-level datacube 3-d 4-level datacube 5-d 4-level datacube
Ru
n T
ime(
secs
)
MDLGMDL
MDLH-GreedyMDLH-Dynamic
VLDB 05 Shaofeng Bu UBC 33
Outline Introduction to MDL with Holes
A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics
A Greedy Heuristic Dynamic Programming Quadratic Programming
Experimental Results Summarization on Holes: An Extension Conclusions & Contributions
34
As the blue density becomes high, a large part of the MDLH description is made up of holes.
Can we further reduce the total length by summarizing ‘Holes’? MDLH description is:
(a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) Total length is 10.
Summarization on holes: (a,6)+(a,8)+(a,9) = (a,10)-(a,7) (d,6)+(d,7)+(d,8) = (d,10)-(d,9)
After summarization on holes: (a,11) - { (a,10) - (a,7)}
+(d,11) - { (d,10) - (d,9)}+(b,6) + (c,8)
Total length is 8.
Extension: Summarization on holes
1 2 3 4 5 6 7 8 9abcd
e
10
11
VLDB 05 Shaofeng Bu UBC 35
Conclusions & Contributions We present a new method, MDLH, to compress the
answers of OLAP queries; We present a bottom-up algorithm for 1-d cube; We proved the NP-Hardness of the MDLH problem; We provided three heuristics for MDLH: greedy, dynamic
programming, and quadratic programming; We extended the summarization on holes to further
reduce the total length; We did a set of experiments on the TPC-H benchmark
data to compare the heuristics.
VLDB 05 Shaofeng Bu UBC 36
On going work
Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: Return summarized answers to user’s queries; Provide drill down operation for users:
Browse details on blue cells Browse details on holes
Design k-approximation algorithm for MDLH: What is the best quality we can guarantee?