Post on 31-Dec-2015
1
TOWARDS DATA ANALYTICS ON ATTRIBUTED GRAPHSNGS QE Oral Presentation
Student : Qi FanSupervisor: Prof. Kian-lee Tan
2
Outline
• Attributed Graph Analytic
• Graph Window Query
• Graph Window Query Processing
• Experiments
• Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
3
Outline
Attributed Graph Analytic
• Graph Window Query
• Graph Window Query Processing
• Experiments
• Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
4
Data Analytics
• Data Analytics plays an important part in business [1]• Web analytics for advertising and recommendation• Customer analytics for market optimization• Portfolio analytics for risk control
• Analytics on data yield:• Data products• Data-driven decision support• Insights of data model
[1] Analytics Examples: http://en.wikipedia.org/wiki/Analytics
Graph Analytic Window Query Query Processing Experiments Future Work
5
Relational Data Analytic
• Table as data representation, SQL as the query language
• Analytic SQL:• Ranking• Windowing• LAG/LEAD• FIRST/LAST• SKYLINE • TOP-K• … …
Graph Analytic Window Query Query Processing Experiments Future Work
6
Emerging of Large Linked Data
• In real world, linked data are becoming emerging:• Facebook, LinkedIn, Biological network, Phone Call
network, Twitter, etc.
• Modeling linked data in relational way and querying using SQL is inefficient:• Graph queries are often traverse based• SQL based traversal is 100 times slower than adjacent
list based [1]
• Graph model is more fit for linked data!!![1] http://java.dzone.com/articles/mysql-vs-neo4j-large-scale
Graph Analytic Window Query Query Processing Experiments Future Work
8
Graph Data Model
Vertex Edge
G = (V, E, A)
Attributed Graph Vertices Edges
Graph
Vertex Attr1 Attr2 Attr3
… …
Attribute Table
Attributes
Graph Structure + attribute dimensions
Graph Analytic Window Query Query Processing Experiments Future Work
9
Graph Data Model
• Graph Data:• Vertex – entities, i.e. User, Webpage, Molecule, etc.• Edge – relationships, i.e. follow, cite, depends-on,
friends-of, etc.• Attribute – profile information for vertex/edge
• Specific model depends on data, thus:• Edge – directed / undirected • Attribute – homogeneous, inhomogeneous
Graph Analytic Window Query Query Processing Experiments Future Work
10
Graph Data Model Example
People and follow relationships...
People and friends relationships…
Bimolecules and depends-on relationships...
Attributed Graph models a wealth of information
Graph Analytic Window Query Query Processing Experiments Future Work
11
Graph Data Analytics
• Graph Database environment is growing:• Neo4j, Titan, SPARQL, Pregel etc.
• Graph Data Analytics are becoming popular:• Graph Summarization[1], Graph OLAP [2] etc.
• In our research, we focus on:• Discover needs of native graph analytical queries• Process graph analytical query efficiently
[2] C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu, “Graph olap: Towards online analytical processing on graphs,” in Data Mining, 2008. ICDM’08
[1] Tian, Y., Hankins, R. A., & Patel, J. M. (2008, June). Efficient aggregation for graph summarization. In Proceedings of the 2008 ACM SIGMOD
Graph Analytic Window Query Query Processing Experiments Future Work
12
Outline
• Attributed Graph Analytic
Graph Window Query
• Graph Window Query Processing
• Experiments
• Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
13
SQL Window Query• A SQL window query:
• Partitions a table• Sorts each partition• Implicitly forms window of each tuple
Window of Tuple 7
Window of a tuple contains other tuples related to it
Graph Analytic Window Query Query Processing Experiments Future Work
14
Graph Window Query
• In graph, a vertex can also have a set of related vertices to be its window.
• The aggregation on window is a personalized analysis over each vertex.
Graph Analytic Window Query Query Processing Experiments Future Work
15
Graph Window Examples
• These queries focus on the neighborhoods of each user, thus the neighborhoods forms a vertex’s window
Summarizing the age distribution of each user’s friends
Summarizing the activeness of each user’s friends
Analyze the industry distribution of a user potential connections
Graph Analytic Window Query Query Processing Experiments Future Work
16
Graph Window Examples
• These queries focus on the ancestor-descendent relationship of molecules, thus ancestor-descendent is a vertex’s window
Find how many enzymes are in each molecule’s pathway
Find how many molecules are affected by each enzyme in the pathway
Graph Analytic Window Query Query Processing Experiments Future Work
17
Graph Window Queries
• We thus identify two types of graph window queries:
• K-hop window (k-window):• A vertex’s k-hop window contains all the vertices that
are its the k-hop neighbors.
• Topological window (t-window):• A vertex’s topological window contains all the vertices
that are its accentors / descendents
Graph Analytic Window Query Query Processing Experiments Future Work
18
Graph Window Queries
• K-hop Window:• Similar to ego-centric analysis of network analysis
community• For undirected graph:
• all vertices that can connect a vertex
• For directed graph:• In-k-hop, for vertices that reaches a vertex in k-hop• Out-k-hop, for vertices that reached by a vertex in k-hop
• K-hop, union of in-k-hop and out-k-hop
• T-Window:• Requires graph to be DAG
Graph Analytic Window Query Query Processing Experiments Future Work
19
Graph Window Queries
• Graph Window Query:• INPUT: a specific window (k-hop, topological) and an
aggregation function
• OUTPUT: aggregated value over each vertex’s window
Graph Analytic Window Query Query Processing Experiments Future Work
20
Outline
• Attributed Graph Analytic
• Graph Window Query
Graph Window Query Processing
• Experiments
• Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
21
Related Work• In [1] a system EAGr has been proposed to process
neighborhood query• Focuses on 1-hop neighbor
• It uses iterative planning methods to share aggregations results between different vertex’s window
• However, it assumes a large intermediate data to reside in memory, which is not reasonable for k-window () and t-window
[1] J. Mondal and A. Deshpande, “Eagr: Supporting continuous ego-centric aggregate queries overlarge dynamic graphs,” SIGMOD, 2015.
Graph Analytic Window Query Query Processing Experiments Future Work
22
Graph Window Query Processing• Naïve Processing I:
1. Compute vertex’s window sequentially
2. Aggregate each vertex individually
• Advantage:• No large intermediate data generated
• Inefficiencies:• Repeated computation of every vertex’s window:
• k-window is of complexity in arbitrary graph• t-window is of complexity in arbitrary graph
• Slow in individual aggregation:• Each vertex may have window size of • Total aggregation complexity can be
Graph Analytic Window Query Query Processing Experiments Future Work
23
Graph Window Query Processing
• Naïve Processing II:1. Materialize each vertex’s window
2. On query processing, aggregate each vertex’s window individually
• Advantage:• No computation of windows at run time
• Inefficiencies:• Materialize is not memory efficient
• All the vertex’s window can be as large as
• Query processing is still slow as in Naïve Processing I
Graph Analytic Window Query Query Processing Experiments Future Work
25
Overview of our approach
• Two index schemes:• Dense Block Index: for general window and k-hop
window• Parent Index: for topological window
• Indexes achieves:• Completely preserve the window information for each
vertex• Space efficiency• Efficient run-time query processing
Graph Analytic Window Query Query Processing Experiments Future Work
26
Dense Block Index – Matrix View• Window Matrix:
• Records vertex-window mapping• Rows represent vertex• Columns represent window
A B C D E FA 1 1 1 1 1 1B 1 1 0 1 0 1C 1 0 1 1 1 1D 1 1 1 1 0 0E 1 0 1 0 1 0F 1 1 1 0 0 1
Graph Analytic Window Query Query Processing Experiments Future Work
27
Dense Block Index – Matrix View• Window Matrix Properties:
• Boolean matrix• Completely keeps the vertex-
window information
• Equivalent Matrices:• Window matrix can be applied
with row and column permutations
• Invariant: number of non-zero elements ()
A B C D E FA 1 1 1 1 1 1B 1 1 0 1 0 1C 1 0 1 1 1 1D 1 1 1 1 0 0E 1 0 1 0 1 0F 1 1 1 0 0 1
A C B E D FB 1 0 1 0 1 1D 1 1 1 0 1 0F 1 1 1 0 0 1A 1 1 1 1 1 1C 1 1 0 1 1 1E 1 1 0 1 0 0
Graph Analytic Window Query Query Processing Experiments Future Work
28
Dense Block Index – Matrix View
• Window matrix based aggregation:• Similar to Naïve Processing II
1. Traverse the matrix vertically
2. Aggregate the cells with value one, ignore cells with value zero
• Space and Query Complexity:• in sparse matrix format• in matrix format• Note that can be as large as
Graph Analytic Window Query Query Processing Experiments Future Work
29
Dense Block Index• Dense Blocks:
• Given a matrix, dense blocks is the submatrix whose values are all non-zeros
• Properties of Dense Blocks ():• Space complexity
• compared to
• Query complexity• compared to
{𝐴 ,𝐵 }× {𝐴 ,𝐵 ,𝐶 }A B C D
A 1 1 1 0
B 1 1 1 0
C 0 0 0 1
D 1 0 0 1
Store row id and column id i.e. (A,B)(A,B,C) rather than 6 elements
Query: Compute A+B first, then the result is shared for window (A,B,C)
Same asymptotical bounds, thus can optimize both simultaneously
Graph Analytic Window Query Query Processing Experiments Future Work
30
• Dense Block Index:• For every window to be computed, index all the dense
blocks in a window matrix
• A bipartite graph
A B C D E F
A,F,D B A,CC,ED E F
A C B E D F
B 1 0 1 0 1 1
D 1 1 1 0 1 0
F 1 1 1 0 0 1
A 1 1 1 1 1 1
C 1 1 0 1 1 1
E 1 1 0 1 0 0
Dense Block Index
Graph Analytic Window Query Query Processing Experiments Future Work
31
Dense Block Index
• Properties:• Preserves every non-zero entry of window matrix• During query, no need to access original window
matrix
• Query Processing:1. compute partial aggregates for each dense block
2. compute final aggregates for every window
Graph Analytic Window Query Query Processing Experiments Future Work
32
Dense Block Index Query ProcessingSummarizing the activeness of each user’s friends:
Compute On Graph GOver 1-hop Window
A 118B 64C 103D 78E 66F 55
Graph Analytic Window Query Query Processing Experiments Future Work
33
Dense Block Index• Equivalent matrices may have different optimal partitions
• Find best dense block partition out of all equivalent matrices• Fixed size dense block partition is NP-hard [1]• Heuristics need to be applied
A B C D E FA 1 1 1 1 1 1B 1 1 0 1 0 1C 1 0 1 1 1 1D 1 1 1 1 0 0E 1 0 1 0 1 0F 1 1 1 0 0 1
A C B E D FB 1 0 1 0 1 1D 1 1 1 0 1 0F 1 1 1 0 0 1A 1 1 1 1 1 1C 1 1 0 1 1 1E 1 1 0 1 0 0
[1] V. Vassilevska and A. Pinar, “Finding nonoverlapping dense blocks of a sparse matrix,” Lawrence Berkeley National Laboratory, 2004
Graph Analytic Window Query Query Processing Experiments Future Work
34
MinHash Clustering for DBI
• Heuristic• Classifies similar windows together, then mining the
dense blocks in each cluster• Clustering + Mining
• Clustering:• Jaccard coefficient is used to measure the similarity
between windows• Since each window is a set of vertices
• MinHash is an efficient way to perform Jaccard coefficient based clustering
Graph Analytic Window Query Query Processing Experiments Future Work
35
MinHash Clustering for DBI
• Mining:1. Build partial window matrix for each cluster
2. Condense the rows with identical values
3. For uncondensed rows, recursively cluster + mining, until stop condition achieves
Graph Analytic Window Query Query Processing Experiments Future Work
36
MinHash Clustering for DBIA B C D E F
A 0 0 1 1 1 1B 1 1 1 1 1 0C 0 0 1 1 1 1D 1 1 1 1 0 1E 0 0 1 1 0 0F 0 1 1 1 1 1
A BA 0 0B 1 1C 0 0D 1 1E 0 0F 0 1
C D E FA 1 1 1 1B 1 1 1 0C 1 1 1 1D 1 1 0 1E 1 1 0 0F 1 1 1 1
A BB,D 1 1F 0 1
C D E FA,C,F 1 1 1 1
C D E FB 1 1 1 0D 1 1 0 1E 1 1 0 0
MinHash Clustering
{𝐴 ,𝐶 ,𝐹 }× {𝐶 ,𝐷 ,𝐸 ,𝐹 }
OutputsOutputs
Split
Recursive cluster
Graph Analytic Window Query Query Processing Experiments Future Work
37
MinHash Clustering for DBI• DBI generation can be summarized into following steps:
• Clustering Step:1. Min-Hash each vertex, based on its window
• Mining Step:1. Generate partial matrix for each window
2. Group identical rows
3. Recursive clustering
Bottlenecks
MINHASH COST: WINDOW COST: for k-window, for t-windowToo HIGH in practice
Graph Analytic Window Query Query Processing Experiments Future Work
38
Estimated MinHash Clustering
• For K-hop, we developed an estimation scheme to speed up the index creation process.
• The observation is that when hop goes larger, the overlapping between each vertex also goes larger• Thus we can use lower hop window information in the
clustering phase
Graph Analytic Window Query Query Processing Experiments Future Work
39
Comparison• MinHash Clustering
1. Clustering Step:1. Min-Hash each
vertex, based on its window
2. Mining Step:1. Generate partial
matrix for each window
2. Group Identical rows
3. Recursive clustering
• Estimated Clustering1. Clustering Step:
1. Min-Hash each vertex, based on its lower-hop window
2. Mining Step:1. Generate partial
matrix for each window
2. Group Identical rows
3. Recursive clustering
The estimation reduces the indexing time since:1. Lower-hop window has less elements, so MinHash is faster2. Lower-hop window generation requires less time
Graph Analytic Window Query Query Processing Experiments Future Work
40
Topological Window Processing
• Dense Block Index can be used on Topological Window as well• However, more efficient index exists given a T-
window query
• Containment Relationship in T-window• If , then • Thus, when compute window of , ’s result can be
directly used.
Graph Analytic Window Query Query Processing Experiments Future Work
41
Parent Index• Given , in order to use for computing , we need to
materialize the difference between and
• For a given , the vertex with smallest difference must be one of ’s parent
• Thus, for each vertex, we only index its parent which has the smallest different
Graph Analytic Window Query Query Processing Experiments Future Work
42
Parent Index• A parent index is a lookup table of three fields:• Vertex: the index entry• Parent: the closest parent
id• Diff: the difference
vertices between Vertex and Parent
Graph Analytic Window Query Query Processing Experiments Future Work
43
Parent Index based Query Processing
• Topologically process each vertex’ window
• Use the formulae:
• Topological order ensures that when processing a vertex, its parents’ results are ready
Graph Analytic Window Query Query Processing Experiments Future Work
44
Parent Index Creation
• Efficiently creation based on Topological Scan:• During scan, each vertex passes its current ancestor
information to its child• Child on receiving parents’ ancestor information, union
these ancestors• Child on receiving all parents information, record the
portent with smallest difference
Graph Analytic Window Query Query Processing Experiments Future Work
45
Outline
• Attributed Graph Analytic
• Graph Window Query
• Graph Window Query Processing
Experiments
• Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
46
Experiments
• Machine: 2.27GHz CPU with 32 GB memory
• Data Synthetic:• SNAP [1] generator for directed graphs• DAGGR [2] generator for DAGs
[2] H. Yildirim, V. Chaoji, and M. J. Zaki, “Dagger: A scalable index for reachability queries in large dynamic graphs,” arXiv preprint arXiv:1301.0977, 2013.
[1] Stanford Networ Analysis Platform, http://snap.stanford.edu/snap/index.html
Graph Analytic Window Query Query Processing Experiments Future Work
47
Comparing Algorithms• K-hop window:
• MA: materialize ahead algorithm (materialize vertex-window mapping, individual aggregate)
• KBBFS: bounded BFS for computing window of each vertex• MC: MinHash Clustering• EMC: Estimated MinHash Clustering
• Topological window:• MA• DBI: dense block index• TS: Topological Scan to compute window of each vertex• PI: parent index
Graph Analytic Window Query Query Processing Experiments Future Work
48
Effectiveness of Estimation
Hop = 1 Hop = 2
Hop = 3 Hop = 4
Graph Analytic Window Query Query Processing Experiments Future Work
49
Benefit of Estimation
Degree 160
Hop MC_HASH MC_BFS EMC_HASH EMC_BFS EMC/MC
2 157,885 241,072 1,666 120,931 0.307294
3 2,281,794 4,494,853 1,637 2,257,493 0.33337
4 4,355,439 8,633,192 1,631 4,414,207 0.339977
Hop MC_HASH MC_BFS EMC_HASH EMC_BFS EMC/MC
2 33,611 19,559 484 9,974 0.19669
3 417,102 742,502 470 374,489 0.323351
4 964,521 184,3078 471 927,751 0.330611
Degree 40
Graph Analytic Window Query Query Processing Experiments Future Work
50
Index size of MC and EMC
Degree = 40
Graph Analytic Window Query Query Processing Experiments Future Work
51
Scalability of EMC
V = 100k, hop =1
V = 100k, hop = 2
Graph Analytic Window Query Query Processing Experiments Future Work
52
Effectiveness of PI
V = 10k
Graph Analytic Window Query Query Processing Experiments Future Work
53
Index size of PI
Vertex = 10k
Graph Analytic Window Query Query Processing Experiments Future Work
54
Indexing Time of PI
Degree = 20
Graph Analytic Window Query Query Processing Experiments Future Work
55
Scalability of PI
Degree = 10
Graph Analytic Window Query Query Processing Experiments Future Work
56
Outline
• Attributed Graph Analytic
• Graph Window Query
• Graph Window Query Processing
• Experiments
Future Works
Graph Analytic Window Query Query Processing Experiments Future Work
57
Conclusion and Future Work
• Conclusion:• We proposed two graph window queries and two
indexes for efficient processing
• In future:• Extend the query processing to handle large graphs (in
parallel platform / disk resident index)• More complex aggregation processing (include graph
OLAP)• Dynamic graphs (able to handle updates)
Graph Analytic Window Query Query Processing Experiments Future Work
58
Thank you !
Graph Analytic Window Query Query Processing Experiments Future Work