A Multidimensional and Multiversion Structure for OLAP Applications
Graph Cube: On Warehousing and OLAP Multidimensional...
Transcript of Graph Cube: On Warehousing and OLAP Multidimensional...
Graph Cube: On Warehousing and OLAPMultidimensional Networks
Peixiang Zhao†, Xiaolei Li‡, Dong Xin§, Jiawei Han†
†Department of Computer Science, UIUC‡Groupon Inc.
§Google Cooperation
†[email protected], [email protected]‡[email protected], §[email protected]
June 16th, 2011
SIGMOD 2011 Athens, Greece 1 / 24
Outline
1 Introduction
2 The Graph Cube Model
3 OLAP on Graph Cube
Cuboid Query
Crossboid Query
4 Implementing Graph Cube
5 Experiment
6 Conclusion
SIGMOD 2011 Athens, Greece 2 / 24
Introduction
Recent years have seen an astounding growth of networks in awide spectrum of application domains
Communication networks
Social networks
Biological networks
The Web
Multidimensional networks1 An underlying graph structure comprising entities and
relationships
2 Multidimensional attributes are specified and associated withentities of the network
There exist considerable technology gaps in managing,querying and summarizing multidimensional networkseffectively
SIGMOD 2011 Athens, Greece 3 / 24
A Sample Multidimensional Network
1
2 3
4
5
6
7 8 9
10
(a) Graph
ID Gender Location Profession Income
1 Male CA Teacher $70, 000
2 Female WA Teacher $65, 000
3 Female CA Engineer $80, 000
4 Female NY Teacher $90, 000
5 Male IL Lawyer $80, 000
6 Female WA Teacher $90, 000
7 Male NY Lawyer $100, 000
8 Male IL Engineer $75, 000
9 Female CA Lawyer $120, 000
10 Male IL Engineer $95, 000
(b) Vertex Attribute Table
Figure: A Multidimensional Network Comprising a Graph Structure and aMultidimensional Vertex Attribute Table
SIGMOD 2011 Athens, Greece 4 / 24
Introduction
Motivation: Can we extend decision support facilities onmultidimensional networks?
Data warehouses and OLAP are advantageous in themultidimensional network scenario
Summarizing the massive networks into different levels ofgranularity for more effective analysis and exploration
Business Intelligence: in Facebook and Twitter, advertisersand marketers take advantage of social networks withindifferent multidimensional spaces to better promote theirproducts via social targeting or viral marketing
However, in multidimensional networks, much of the valuationand interest lies in the network itself!
Simple numeric value based group-by’s in traditional datawarehouses are no longer insightful and of limited usage,because the structural information of the networks is simplyignored
SIGMOD 2011 Athens, Greece 5 / 24
Network Aggregation v.s. Traditional Group-by
5 59
3
Male Female
(a) Aggregate Network
Gender COUNT(*)
Male 5Female 5
(b) Aggregate Table
Figure: Multidimensional Network Aggregation v.s. Traditional RDBAggregation (Group by Gender)
2
3
1
2
1 1
5
(Female, CA)
(Male, IL)
(Male, CA)
(Female, WA)
(Female, NY)
(Male, NY)
(a) Aggregate Network
Gender Location COUNT(*)
Male CA 1Female CA 2Female WA 2Male IL 3Male NY 1
Female NY 1
(b) Aggregate Table
Figure: Multidimensional Network Aggregation v.s. Traditional RDBAggregation (Group by Gender and Location)
SIGMOD 2011 Athens, Greece 6 / 24
Introduction
Graph CubeA multidimensional network can be summarized to aggregatenetworks in coarser levels of granularity within differentmultidimensional spaces
Vertex coalescence
Structure summarization
Different query models and OLAP solutions are proposed formultidimensional networks
Cuboid Queries
Crossboid Queries
Efficient implementation is based on a combination of
Well-studied data cube implementation techniques
Special characteristics of multidimensional networks
The first to systematically address warehousing and OLAPissues on large multidimensional networks
SIGMOD 2011 Athens, Greece 7 / 24
The Graph Cube Model
Multidimensional Network
A multidimensional network, N , is a graph denoted asN = (V ,E ,A), where V is a set of vertices, E ⊆ V ×V is a set ofedges and A = {A1,A2, . . . ,An} is a set of n vertex-specificattributes, i.e., ∀u ∈ V , there is a tuple A(u) of u, denoted asA(u) = (A1(u),A2(u), . . . ,An(u)), where Ai (u) is the value of uon i-th attribute, 1 ≤ i ≤ n. A is called the dimensions of thenetwork N .
Some (or all) dimension Ai could be ∗ (ALL), representing asuper-aggregation along Ai
Given a set of n dimensions of a network, there exist 2n
multidimensional spaces (aggregations)
The measure within each possible space is no longer a simplenumeric value, but an aggregate network
SIGMOD 2011 Athens, Greece 8 / 24
The Graph Cube Model
Graph Cube
Given a multidimensional network N = (V ,E ,A), the graph cubeis obtained by restructuring N in all possible aggregations of A.For each possible aggregation A′ of A, the grouping measure is anaggregate network G ′ w.r.t. A′.
2
5 12 8
15 16 19
23
Apex
(Gender) (Location) (Profession)
(Gender, Location) (Gender, Profession) (Location, Profession)
Base
Figure: The Graph Cube Lattice
SIGMOD 2011 Athens, Greece 9 / 24
OLAP on Graph Cubes
Cuboid Query: return as output the aggregate networkcorresponding to a specific aggregation of the dimensions ofthe multidimensional network
What is the network structure between various genders?
What is the network structure between the various gender andlocation combinations?
5 59
3
Male Female
2
3
1
2
1 1
5
(Female, CA)
(Male, IL)
(Male, CA)
(Female, WA)
(Female, NY)
(Male, NY)
SIGMOD 2011 Athens, Greece 10 / 24
OLAP on Graph Cubes
A cuboid query is within a single multidimensional space,which follows the traditional OLAP model
A crossboid query crosses multiple multidimensional spacesof the network, i.e., more than one cuboid is involved in aquery
What is the network structure between the user with ID = 3and various locations?
What is the network structure between users grouped bygender v.s. users grouped by location?.
1
1 3
1 1
ID: 3
WA IL
CA NY
3
5
Male
5
Female
CA IL WA NY
6 2
2
3
3 3 2 2
264
SIGMOD 2011 Athens, Greece 11 / 24
Cuboid Queries v.s. Crossboid Queries
Apex
(Gender)
(Gender, Location, Profession)
(Gender, Profession)
(Location)
(Profession)
(Gender, Location)
(a) Traditional Cuboid Queries
(Gender)
"What is the network structure
"What is the network structure between
(Location)
users grouped by gender andusers grouped by location?"
between users and the locations?"
(Gender, Location, Profession)
(b) Crossboid Queries StraddlingMultiple Cuboids
SIGMOD 2011 Athens, Greece 12 / 24
Graph Cube Implementation
Objective: compute the aggregate networks of differentcuboids grouping on all possible dimension combinations of amultidimensional network
1 Full materialization: Best query response time, worst spacecost
2 No materialization: Best space cost, worst query responsetime
3 Partial materialization: A small portion of cuboids ismaterialized in order to balance the tradeoff between queryresponse time and cube resource requirement
SIGMOD 2011 Athens, Greece 13 / 24
Graph Cube Implementation: Partial Materialization
Problem: To select a set S of k cuboids in the graph cubefor materialization, such that the average time taken toevaluate the queries can be minimized
The partial materialization problem is NP-complete, reducedfrom set-cover
Greedy Algorithm: Selecting k cuboids with the highestsize-reduction benefit
Theorem
Let Bgreedy be the benefit of k cuboids chosen by the greedyalgorithm and let Bopt be the benefit of any optimal set of kcuboids. Then Bgreedy ≤ (1− 1/e)× Bopt and this bound is tight
MinLevel Algorithm: Materializing cuboids c , wheredim(c) = l0 indicating the level in the cube lattice at whichwe start materializing cuboids
SIGMOD 2011 Athens, Greece 14 / 24
Experimental Evaluation
DBLP data set
A co-authorship graph with 28, 702 authors as vertices and66, 832 coauthor relationships as edges
Three dimensions: name, area, productivity
area: DB, DM, AI, IRproductivity: Excellent, Good, Fair, Poor
IMDB data set
A movie rating network with 116, 164 vertices and 5, 452, 350edges
Seven dimensions: Title, Year, Length, Budget, Rating,MPAA and Type
MPAA: G, PG, PG-13, R, NC-17, NRType: action, animation, comedy, drama, documentary,romance, short
SIGMOD 2011 Athens, Greece 15 / 24
Effectiveness Evaluation
7752 4590
11329 5031
DB DM
AI IR
22490
18729
1182
7116
8010
2220
1229
1550
2307 1999
(c) (Area)
26170 2165
321 46
Poor Fair
Good Excellent
31587
682
5787
3520
139
15877
872
496
1744 2584
(d) (Productivity)
Figure: Cuboid Queries of the Graph Cube on DBLP Data Set
SIGMOD 2011 Athens, Greece 16 / 24
Effectiveness Evaluation
6825
(DB, Poor)
732
(DB, Fair)
161(DB, Good)
34
(DB, Excellent)4209
(DM, Poor)
331
(DM, Fair)
43
(DM, Good)
7
(DM, Excellent)
10498
(AI, Poor)
747
(AI, Fair)
83(AI, Good)
1(AI, Excellent)4638
(IR, Poor)
355
(IR, Fair)
34
(IR, Good)
4
(IR, Excellent)
8887
1148
410
105
4182 252
32
4
10975
838
76
4590
478
31
1
5276
28771270
1422
670425
396 290
170
361
253
679
292
333
523244
203
(a) (Area, Productivity)
Figure: Cuboid Queries of the Graph Cube on DBLP Data Set
SIGMOD 2011 Athens, Greece 17 / 24
Effectiveness Evaluation
7752 4590
DB DM
21591
11329 5031
AI IR
26170 2165 321 46
Poor Fair Good Excellent
10193
5816
2596
7166
1857 1511719
20355
7639 2158
148
9778
4394 1420 414
(a) Area ./ Productivity
Figure: Crossboid Queries of the Graph Cube on DBLP Data Set
SIGMOD 2011 Athens, Greece 18 / 24
Effectiveness Evaluation
97
DB4
DM3
AI11
IR
52
Poor
33
Fair
24
Good
6
Excellent
1 Hector Garcia-Molina
97 4 3 11
52 33 24 6
(a) Area ./ Base ./ Productivityfor “Hector Garcia-Molina”
66
DB71
DM4
AI13
IR
71
Poor
52
Fair
12
Good
13
Excellent
1 Philip S. Yu
66 71 4 13
71 52 12 13
(b) Area ./ Base ./ Productivityfor “Philip S. Yu”
Figure: Crossboid Queries of the Graph Cube on DBLP Data Set
SIGMOD 2011 Athens, Greece 19 / 24
Efficiency Evaluation
0
2
4
6
8
10
12
14
1 2 3
Run
time
(sec
onds
)
Number of Dimensions
Raw TableGraph Cube
(a) Time v.s. # Dimensions
0
2
4
6
8
10
12
14
1 2 3 4 5 6
Run
time
(sec
onds
)
Number of Edges (*10K)
Raw TableGraph Cube
(b) Time v.s. # Edges
Figure: Full Materialization of Graph Cube for DBLP Data Set
SIGMOD 2011 Athens, Greece 20 / 24
Efficiency Evaluation
0
200
400
600
800
1000
1 2 3 4 5 6
Run
time
(sec
onds
)
Number of Dimensions
Graph CubeRaw Table
(a) Time v.s. # Dimensions
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5
Run
time
(sec
onds
)
Number of Edges (*1M)
Graph CubeRaw Table
(b) Time v.s. # Edges
Figure: Full Materialization of Graph Cube for IMDB Data Set
SIGMOD 2011 Athens, Greece 21 / 24
Efficiency Evaluation
0
5
10
15
20
25
30
35
40
45
6 8 10 12 14 16
Run
time
(sec
onds
)
Number of Materialized Cuboids
GreedyMinLevel
(a) Cuboid Queries
0
10
20
30
40
50
60
70
6 8 10 12 14 16
Run
time
(sec
onds
)
Number of Materialized Cuboids
GreedyMinLevel
(b) Crossboid Queries
Figure: Average Query Respond Time w.r.t. Different PartialMaterialization Algorithms
SIGMOD 2011 Athens, Greece 22 / 24
Conclusion
1 This work seeks to enhance decision-support functionality onlarge multidimensional networks
2 Graph cube: A new data warehousing model is designedspecifically for efficient aggregation on multidimensionalnetworks
3 Different query models and OLAP solutions for Graph Cubeare proposed and studied
Crossboid queries break the boundary of the traditional OLAPmodel by straddling multiple cuboids of the Graph Cube
4 The implementation of Graph Cube is discussed and theexperimental results have demonstrated the power and efficacyof Graph Cube as the first, to the best of our knowledge, toolfor warehousing and OLAP large multidimensional networks
SIGMOD 2011 Athens, Greece 23 / 24
Thank you
SIGMOD 2011 Athens, Greece 24 / 24