MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British...

MDL Summarization with Holes

Shaofeng Bu Laks V.S. Lakshmanan

Raymond T. Ng

University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 2

Introduction Multi-dimensional OLAP queries typically produce

data intensive answers Often the question is: how to express the large

answer set of cells that satisfy the OLAP query conditions: Simple enumeration: accurate but not necessarily the most

intuitive; Summaries: not (necessarily) 100% accurate but can be

more intuitive and informative. Summarized answers can be more easily understood

OLAP Data Cube Example

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

locationja

en’s

’s jean

women’s men’s Each dimension is associated with a hierarchical tree

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

locationja

en’s

’s jean

women’s men’s

Data Cell: (c1,c2), c1,c2 are leaf-nodes

in axis-trees, e.g. (Vancouver, ties) Data Region: describes all data cells

covered by given nodes in the axis-trees, (x1, y1), e.g.:

(Vancouver, ties) (Vancouver, women’s) (northwest, women’s)

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

locationja

en’s

’s jean

women’s men’s

Blue cells: the cells that satisfy the query conditions;

How to find a summary of the blue cells in a data cube?

MDL Summarization

MDL: Minimum Description Length Use regions to cover the blue cells; Length of an MDL description is the number of

included regions and cells; MDL is to find the description with the

minimum length.

An Example of MDL Summarizationclothes

R2 R3 R4

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

location

en’s

’s jean

women’s men’snort

10 regions

8 single blue cells

Total length = 18

MDL Summarization

R6R7 R8

A Motivating Example: A New Caseclothes

R2 ?R3 R4

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

en’s

’s jean

women’s men’sNot blue cells any more

Can we do better?

Yes! We present a new compression approach: MDL with Holes:

Identify regions with blue cells, even if they contain non-blue cells;

Express the included blue cells by using regions with the exception of the covered non-blue cells;

Non-blue cells are called holes.

R2 R4 Plus other 6 regions?R1

R1-(Vancouver,Skirts)

R9-(Boston,ties) -(New York, dress skirts)

R3-(Vancouver,Skirts)

A Motivating Example: MDL with Holesclothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

en’s

’s jean

women’s men’s

R1+R3-(Vancouver,Skirts)

MDL with Holes: Length = 6+3+3=12

MDL Approach: Length is 18

Problem Statements

MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit.

In practice, we can drill down on regions to get additional details.

Definitions: Length & Benefit

Given a set B of data cells (blue cells), an MDLH description for B:

D=S – H , S is a set of data regions, H is a set of data cells, also called ‘holes’, D covers exactly the data cells in B.

Length: total number of the included regions and cells in the description.

|D|=|S|+|H| Benefit : how much shorter is the MDLH

summary than the enumeration of B.

Benefit (D) = |B| – | D|

B1={a, b, c} D1= s – d

|D1|=2

Benefit(D1) = |B1| - |D1| = 1

B2={e, g} D2= t – f – h

|D2| = 3

Benefit(D2)= |B2| - |D2| = -1

a b c d e f

Related Work The Generalized MDL Approach for Summarization, Laks

V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 Reduce description length by allowing non-blue cells to be covered

in the regions The regions are not pure.

Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003

Allow Cartesian products to be formed; Not purely hierarchical: NP Completeness result is less surprising; What about the pure hierarchical?

Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001

Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Complete Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

‘x’ D1= x – d – f – j

Benefit(D1) = 7 – 4 = 3

D2=(s – d ) + e + ( u – j )

Beneift(D2) = 7 – 5 = 2

‘y’ D3 = y – m – p – q – r

Benefit(D3) = 4 – 5 = -1

D4 = ( v – m ) + o ,

Benefit(D4) = 4 – 3 = 1

‘z’ D5 = z – d – f – j – m – p – q – r

Benefit(D5) = 11 – 8 = 3

D6=(x – d – f – j)+( v – m + o ) Benefit(D6) = 11 – 7 = 4

1-D Case: MDLH is Tractable

a b c d e f g h i j k l m n o p q r

s t u v w

MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics

1 2 3 4 5 6 7

abcdefg

(c,8),(d,8),(e,8) 4 0

rows length benefit

(f,8),(g,8) 3 2

(a,8),(b,8) 5 -2

columns length benefit

(i,1) 3 2

(i,5) 5 -2

(i,2),(i,3),(i,4)

(i,6),(i,7)

2-D Case: Optimality is not Preserved Any More

Optimal Solution:{(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)}-{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4)

+(e,2)+(e,3)+(e,4)}+(f,1)+(g,1)+(f,6)+(g,7)Length = 19 Benefit = 28-19 = 9

MDLH is NP-Hard in 2-D Case

It is NP-Hard to find the optimal MDLH description in 2-D data cube;

Not a Trivial Proof: Details are in the paper; Reduction Strategy:

Clique

Maximum Induced Subgraph inComplete Edge-Weighted(CEW) Bipartite Graph

MDL with Holes

Heuristics for MDLH

Greedy Each time, choose the row/column with the most

benefit Dynamic Programming

A bottom-up method to get the description of a region from the descriptions of its children regions

Quadratic Programming Using a quadratic function to represent the benefit of a

2-d data cube

Example for Comparison with Heuristics

The optimal description for this example:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+

(d,4)+(b,5)

+(e,6)+(e,8)+(a,11)-(a,8)

Length = 12

Benefit = 8

1 2 3 4 5 6 7 8 9

Heuristics: A Greedy Heuristic

1 2 3 4 5 6 7 8 9abcd

region length benefit holes

(e,6) 1 3 -(d,10) 2 2 (d,5)

(e,1) 2 1 (a,1)(e,2) 2 1 (b,2)(e,3) 2 1 (b,3)

(a,11) 2 1 (a,8)(e,8) 2 1 (a,8)

(c,10) 3 0 (c,4)(c,5)

Description by Greedy:(e,6)+(a,11)+(e,8)-(a,8)+(d,10)-(d,5)+(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3)

The length is 13 The benefit is 20-13 = 7

Greedy: Why it is not optimal?

1 2 3 4 5 6 7 8 9abcd

Description from Greedy

1 2 3 4 5 6 7 8 9abcd

Optimal Description

A selection of row/column may reduce more total benefit

Heuristics: Dynamic Programming

1 2 3 4 5 10 6 7 8 9 11 12

a 2 2 4

b 2 2 4

c 3 2 5

d 2 2 4

e 2 2 2 1 1 8 1 1 2 1 5 13

1 2 3 4 5 6 7 8 9

1 2 3 4 5 10 6 7 8 9 11 12

a t2 g t2

b t2 t2 t2

c t2 t2 t2

d g t2 t2

e g g g t1 t1 t2 g t1 g t1 t2 t2

L: The Length of a Region

S: Selection of Rows & Columns (a,10) : (a,2) + (a,3)

L(a,10)=2, S(a,10)=‘t2’ (e,4) : (d,4)

L(e,4)=1, S(e,4)=‘t1’ (d,10): (d,10) – (d,5)

L(d,10)=2, S(d,10)=‘g’

Heuristics: Dynamic Programming(2)

1 2 3 4 5 6 7 8 9

S 1 2 3 4 5 10 6 7 8 9 11 12

a t2 g t2

b t2 t2 t2

c t2 t2 t2

d g t2 t2

e g g g t1 t1 t2 g t1 g t1 t2 t2

S (e,12)=‘t2’

S (e,11)=‘t2’

D(e,6)+D(e,7)+D(e,8)+D(e,9)

S (e,10)=‘t2’

D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5)

D(e,12)=D(e,10)+D(e,11)

(e,1)-(a,1) (e,2)-(b,2) (e,3)-(b,3) (d,4) (b,5) (e,6) (a,7) (e,8)-(a,8) (a,9)Generated Description:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9)The length is 13 and the benefit is 20-13 = 7

D(x1,x2):description for region (x1,x2)

Dynamic Programming: Why it is not optimal?

Description by Dynamic Programming

Optimal Description

1 2 3 4 5 6 7 8 9abcd

Misses the combination of rows and columns

Use variables to represent rows/columns; for a variable v: v=1: the corresponding row/column is selected; v=0: the corresponding row/column is not selected;

f = – Benefit( D) Maximizing the benefit is to minimize the value of f

For the previous example, quadratic programming generates the optimal description;

Optimality is not guaranteed.

Heuristics: Quadratic Programming

Experiments

We ran a set of experiments on the TPC-H benchmark data set;

We compared the three MDLH heuristics with MDL and GMDL.

Experimental Results: Comparison of All Methods Compression Ratio:

MDLH-Quadratic generates the most concise descriptions: a yardstick of quality;

MDLH-Dynamic is a very close second.

3916(25%)

4701(30%)

5088(33%)

5971(38%)

6414(41%)

6655(43%)

7422(48%)

7906(51%)

8436(54%)

8944(57%)

9459(61%)

9984(64%)

10537(67%)

10787(69%)

11307(72%)

Number of Blue Cells ( Blue Density)

MDLH-Greedy

MDLH-Dynamic

MDLH-Quadratic

GMDL-5%

GMDL-10%

Experimental Results: Compression Ratio

10000 (20%)

15000 (30%)

20000 (40%)

25000 (50%)

30000 (60%)

35000 (70%)

40000 (80%)

Number of Blue Cells (Blue Density)

MDLMDLH-GreedyMDLH-DynamicGMDL-5%GMDL-10%

The more children per parent node, the greater the benefit

Experimental Results: Summary Running time & Scalability:

MDLH-Greedy is the fastest; MDLH-Dynamic runs slower than MDLH-Greedy, but

it is still scalable w.r.t. the number of cells;379 secs

3-d 3-level datacube 3-d 4-level datacube 5-d 4-level datacube

MDLGMDL

MDLH-GreedyMDLH-Dynamic

As the blue density becomes high, a large part of the MDLH description is made up of holes.

Can we further reduce the total length by summarizing ‘Holes’? MDLH description is:

(a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) Total length is 10.

Summarization on holes: (a,6)+(a,8)+(a,9) = (a,10)-(a,7) (d,6)+(d,7)+(d,8) = (d,10)-(d,9)

After summarization on holes: (a,11) - { (a,10) - (a,7)}

+(d,11) - { (d,10) - (d,9)}+(b,6) + (c,8)

Total length is 8.

Extension: Summarization on holes

1 2 3 4 5 6 7 8 9abcd

Conclusions & Contributions We present a new method, MDLH, to compress the

answers of OLAP queries; We present a bottom-up algorithm for 1-d cube; We proved the NP-Hardness of the MDLH problem; We provided three heuristics for MDLH: greedy, dynamic

programming, and quadratic programming; We extended the summarization on holes to further

reduce the total length; We did a set of experiments on the TPC-H benchmark

data to compare the heuristics.

On going work

Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: Return summarized answers to user’s queries; Provide drill down operation for users:

Browse details on blue cells Browse details on holes

Design k-approximation algorithm for MDLH: What is the best quality we can guarantee?

MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British...

Documents

Transcript of MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British...

Miniseminar laks 21. juni 2012 - Amund Bråthen

CPSC 404, Laks V.S. Lakshmanan1 Welcome to CPSC 404 Advanced Relational Databases Instructor: Laks V.S. Lakshmanan Email: laks@cs.ubc.ca Office: ICICS/CICSR.

Compressed Accessibility Map: Efficient Access Control for XML Ting Yu : University of Illinois Divesh Srivastava : AT&T Labs Laks V.S. Lakshmanan : University.

Revenue Maximization in Incentivized Social Advertising · Revenue Maximization in Incentivized Social Advertising Cigdem Aslay Francesco Bonchi Laks V.S. Lakshmanan Wei Lu ISI Foundation

#MarketingShake – Valli Lakshmanan - Get, Set, Mobile.

Laks V.S. Lakshmanan University of British Columbia Vancouver, Canada laks Joint work with Zeinab Abbassi. Recommender Systems Revisited.

By Balachander Lakshmanan, Principal, Deloitte & Touche ...€¦ · By Balachander Lakshmanan, Principal, Deloitte & Touche LLP and Christopher Hansert, Product Manager, ACTICO Introduction

Vibration Based Fuzzy-Neural System for Structural Health Monitoring Lakshmanan Meyyappan (Laks)

Discovering Social Networks from Enterprise Data Laks V.S. Lakshmanan Based on: Wil M.P. van der Aalst, Hajo A. Reijers, Minseok Song. Discovering Social.

by Vivek Lakshmanan - Computer Engineering Research …ashvin/publications/thesis-vivek.pdf · Vivek Lakshmanan Master of Science ... 2009 File systems are tasked with storing, organizing,

LOGOSOL LAKS

On Testing Satisfiability of Tree Pattern Queries Laks V.S. Lakshmanan, Ganesh Ramesh, Hui (Wendy) Wang, Zheng (Jessica) Zhao Department of Computer Science.

TAX: A Tree Algebra for XML H.V. Jagadish Laks V.S. Lakshmanan Univ. of Michigan Univ. of British Columbia Divesh Srivastava Keith Thompson AT&T Labs –

Answering Tree Pattern Queries Using Views Laks V.S. Lakshmanan, Hui (Wendy) Wang, and Zheng (Jessica) Zhao University of British Columbia Vancouver, BC.

Efficient Secure Query Evaluation over Encrypted XML Databases Wendy Hui Wang Laks V.S. Lakshmanan University of British Columbia, Canada.

Inﬂuence Maximization with Bandits · 2016. 4. 28. · Inﬂuence Maximization with Bandits Sharan Vaswani, Laks V.S. Lakshmanan, Mark Schmidt University of British Columbia fsharanv,laks,schmidtmg@cs.ubc.ca

Espen Hanson - Norsk laks i Spania - Norges sjømatråd

View-Based Tree-Language Rewritings - UVic.cawebhome.cs.uvic.ca/~thomo/papers/TreeLanguageRewritingsFoIKS2… · View-Based Tree-Language Rewritings Laks Lakshmanan, Alex Thomo University

CURRICULUM VITAE OF PROFESSOR M. LAKSHMANAN · CURRICULUM VITAE OF PROFESSOR M. LAKSHMANAN 1. ... 6. Fudan University, Shanghai, China ... Mavelikara (2001) 16.

Scientometric Portrait of Professor M. Lakshmanan: A Study ...14.139.186.108/jspui/bitstream/123456789/29160/1/2016 Lakshmanan.pdfScientometric Portrait of Professor M. Lakshmanan: