Beyond Triangles: A Distributed Framework for Estimating 3...
Transcript of Beyond Triangles: A Distributed Framework for Estimating 3...
Beyond Triangles: A Distributed Framework forEstimating 3-profiles of Large Graphs
Ethan R. Elenberg, Karthikeyan Shanmugam,Michael Borokhovich, Alexandros G. Dimakis
University of Texas, Austin, USA
August 12, 2015
E. R. Elenberg Beyond Triangles 1/20
Introduction
• Perform analytics on large graphs
- World Wide Web, social networks, bioinformatics
• More descriptive than triangle count, clustering coefficient
• Scalable, distributed algorithms
E. R. Elenberg Beyond Triangles 2/20
3-profile
• Count the induced subgraphs formed by selecting all triples ofvertices
H3
Definition
Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.
- Always sums to(|V |
3
), the total number of 3-subgraphs
E. R. Elenberg Beyond Triangles 3/20
3-profile
• Count the induced subgraphs formed by selecting all triples ofvertices
H3H2H1H0
Definition
Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.
- Always sums to(|V |
3
), the total number of 3-subgraphs
E. R. Elenberg Beyond Triangles 3/20
3-profile
• Count the induced subgraphs formed by selecting all triples ofvertices
H3H2H1H0
Definition
Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.
- Always sums to(|V |
3
), the total number of 3-subgraphs
E. R. Elenberg Beyond Triangles 3/20
Examples
• 4-clique: n(K4) = [0, 0, 0, 4]
H3
E. R. Elenberg Beyond Triangles 4/20
Examples
• 4-clique: n(K4) = [0, 0, 0, 4]
H3
E. R. Elenberg Beyond Triangles 4/20
Examples
• 4-clique: n(K4) = [0, 0, 0, 4]
H3
E. R. Elenberg Beyond Triangles 4/20
Examples
• 4-clique: n(K4) = [0, 0, 0, 4]
H3
E. R. Elenberg Beyond Triangles 4/20
Examples
• 5-cycle: n(C5) = [?, ?, ?, ?]
E. R. Elenberg Beyond Triangles 5/20
Examples
• 5-cycle: n(C5) = [0, ?, ?, ?]
H0
E. R. Elenberg Beyond Triangles 5/20
Examples
• 5-cycle: n(C5) = [0, 5, ?, ?]
H1
E. R. Elenberg Beyond Triangles 5/20
Examples
• 5-cycle: n(C5) = [0, 5, 5, ?]
H2
E. R. Elenberg Beyond Triangles 5/20
Examples
• 5-cycle: n(C5) = [0, 5, 5, 0]
H3
E. R. Elenberg Beyond Triangles 5/20
Related Terms
For each v ∈ V :
Definition
The local 3-profile counts how many times v participates in eachHi with 2 other vertices.
Definition
The ego 3-profile is the 3-profile of ego graph N(v).
- Graph induced by set of neighbors Γ(v)
E. R. Elenberg Beyond Triangles 6/20
Related Terms
For each v ∈ V :
Definition
The local 3-profile counts how many times v participates in eachHi with 2 other vertices.
Definition
The ego 3-profile is the 3-profile of ego graph N(v).
- Graph induced by set of neighbors Γ(v)
E. R. Elenberg Beyond Triangles 6/20
Motivation
• Global 3-profile concisely describes local connectivity
- Molecule classification
• Local and ego 3-profiles are feature vectors for each vertex
- Spam detection- Generative models
E. R. Elenberg Beyond Triangles 7/20
Introduction
• Problem: Compute (or approximate) 3-profile quantities for alarge graph
• Approach: Edge sub-sampling and distributed implementation
E. R. Elenberg Beyond Triangles 8/20
Introduction
• Problem: Compute (or approximate) 3-profile quantities for alarge graph
• Approach: Edge sub-sampling and distributed implementation
E. R. Elenberg Beyond Triangles 8/20
Contributions
1 Derive a 3-profile sparsifier with provable guarantees
2 Design distributed, graph engine algorithms to calculate localand ego 3-profiles
3 Evaluate performance on real-world datasets
E. R. Elenberg Beyond Triangles 9/20
Related Work
Well studied across several communities:
• Graph sub-sampling
[Kim, Vu ’00] [Tsourakakis, et al. ’08 -’11] [Ahmed, et al. ’14]
• Large-scale triangle counting
[Satish, et al. ’14] [Shank ’07] [Suri, Vassilvitskii ’11]
• Subgraph counting
[Alon, et al. ’97] [Kloks, et al. ’00] [Kowaluk, et al. ’13]
• Graphlets
[Przulj ’07] [Shervashidze, et al. ’09]
E. R. Elenberg Beyond Triangles 10/20
Outline
1 Introduction
2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound
3 3-PROF Algorithm
4 Experiments
5 Conclusions
E. R. Elenberg Beyond Triangles 10/20
Edge Sub-sampling Process
• Sub-sample each edge in the graph independently withprobability p
• Relate the original and sub-sampled graphs via a 1-stepMarkov chain
E. R. Elenberg Beyond Triangles 11/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
p2
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
p2
3p2 (1−
p)
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
1
(1−p)
(1− p
)2
(1−p)
3
p
2p(1− p
)
3p(1− p
)2
p2
3p2 (1−
p)
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Edge Sub-sampling Process
Original
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
1
(1−p)
(1− p
)2
(1−p)
3
p
2p(1− p
)
3p(1− p
)2
p2
3p2 (1−
p)
p3
Sub-sampled
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Estimator
=
1 1− p (1− p)2 (1− p)3
0 p 2p(1− p) 3p(1− p)2
0 0 p2 3p2(1− p)0 0 0 p3
−1 Sub-sampled
E. R. Elenberg Beyond Triangles 12/20
Main Result
Theorem (3-profile sparsifiers)
For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε
(|V |3
)with high probability.
Definition
A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.
Proof Sketch:
- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator
f(G, p) = e1e2e4 + e4e5e6 + . . .
E. R. Elenberg Beyond Triangles 13/20
Main Result
Theorem (3-profile sparsifiers)
For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε
(|V |3
)with high probability.
Definition
A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.
Proof Sketch:
- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator
f(G, p) = e1e2e4 + e4e5e6 + . . .
E. R. Elenberg Beyond Triangles 13/20
Main Result
Theorem (3-profile sparsifiers)
For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε
(|V |3
)with high probability.
Definition
A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.
Proof Sketch:
- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator
f(G, p) = e1e2e4 + e4e5e6 + . . .
E. R. Elenberg Beyond Triangles 13/20
Outline
1 Introduction
2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound
3 3-PROF Algorithm
4 Experiments
5 Conclusions
E. R. Elenberg Beyond Triangles 13/20
3-PROF
Vertex program in the Gather-Apply-Scatter framework
1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)
2 For each edge va: Scatter
v an3,va = |Γ(v) ∩ Γ(a)|,
v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .
3 For each vertex v: Gather and Apply
v an3,v = 1
2
∑a∈Γ(v) n3,va
v anc2,v = 1
2
∑a∈Γ(v) n
c2,va, . . .
E. R. Elenberg Beyond Triangles 14/20
3-PROF
Vertex program in the Gather-Apply-Scatter framework
1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)
2 For each edge va: Scatter
v an3,va = |Γ(v) ∩ Γ(a)|,
v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .
3 For each vertex v: Gather and Apply
v an3,v = 1
2
∑a∈Γ(v) n3,va
v anc2,v = 1
2
∑a∈Γ(v) n
c2,va, . . .
E. R. Elenberg Beyond Triangles 14/20
3-PROF
Vertex program in the Gather-Apply-Scatter framework
1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)
2 For each edge va: Scatter
v an3,va = |Γ(v) ∩ Γ(a)|,
v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .
3 For each vertex v: Gather and Apply
v an3,v = 1
2
∑a∈Γ(v) n3,va
v anc2,v = 1
2
∑a∈Γ(v) n
c2,va, . . .
E. R. Elenberg Beyond Triangles 14/20
3-PROF
Vertex program in the Gather-Apply-Scatter framework
1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)
2 For each edge va: Scatter
v an3,va = |Γ(v) ∩ Γ(a)|,
v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .
3 For each vertex v: Gather and Apply
v an3,v = 1
2
∑a∈Γ(v) n3,va
v anc2,v = 1
2
∑a∈Γ(v) n
c2,va, . . .
E. R. Elenberg Beyond Triangles 14/20
Outline
1 Introduction
2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound
3 3-PROF Algorithm
4 Experiments
5 Conclusions
E. R. Elenberg Beyond Triangles 14/20
Implementation
• GraphLab PowerGraph v2.2
• Multicore server• 256 GB RAM, 72 logical cores
• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each
Datasets
Name Vertices Edges (undirected)
Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866
E. R. Elenberg Beyond Triangles 15/20
Implementation
• GraphLab PowerGraph v2.2
• Multicore server• 256 GB RAM, 72 logical cores
• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each
Datasets
Name Vertices Edges (undirected)
Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866
E. R. Elenberg Beyond Triangles 15/20
Results: 3-profile Sparsifier Accuracy, 5 runs
p=0.7 p=0.4 p=0.1 p=0.010.985
0.990
0.995
1.000
1.005
1.010
1.015A
ccur
acy
[exa
ct/a
ppro
x]PLD, Accuracy, 3-profiles
triangleswedges
edgeempty
E. R. Elenberg Beyond Triangles 16/20
Results: Multicore, 3 runs
Compare 3-PROF to GraphLab’s default triangle count
Twitter PLD0
100
200
300
400
500
600
Run
ning
time
[sec
]Twitter and PLD, Multicore (p=1)
3-prof Trian
E. R. Elenberg Beyond Triangles 17/20
Results: AWS, 5 runs
Compare EGO-PAR to naive, serial algorithm (EGO-SER )
100 egos 1K egos 10K egos10−1
100
101
102
103
104
105
Run
ning
time
[sec
]
>1000 sec
>10000 sec
LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-par 12 nodes
E. R. Elenberg Beyond Triangles 18/20
Results: AWS, 5 runs
10k egos0
2
4
6
8
10
12
14R
unni
ngtim
e[s
ec]
LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes
E. R. Elenberg Beyond Triangles 19/20
Outline
1 Introduction
2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound
3 3-PROF Algorithm
4 Experiments
5 Conclusions
E. R. Elenberg Beyond Triangles 19/20
Summary
1 Edge sub-sampling produces fast, accurate 3-profile estimates
2 3-profile counting consumes roughly the same resources astriangle counting
3 Distributed algorithms scale well over large data and largecomputing clusters
github.com/eelenberg/3-profiles
E. R. Elenberg Beyond Triangles 20/20
(Backup) Edge Pivot Equations
v a
∑a∈Γ(v)
(n3,va
2
) F2(v) 3F3(v)
= +
E. R. Elenberg Beyond Triangles 20/20
(Backup) Edge Pivot Equations
v a
∑a∈Γ(v)
(n3,va
2
) F2(v) 3F3(v)
= +
E. R. Elenberg Beyond Triangles 20/20
(Backup) Edge Pivot Equations
v a
∑a∈Γ(v)
(n3,va
2
) F2(v) 3F3(v)
= +
E. R. Elenberg Beyond Triangles 20/20
(Backup) Results: 3-profile Sparsifier Accuracy, 5 runs
p=0.7 p=0.5 p=0.3 p=0.1
0.996
0.998
1.000
1.002
1.004
Acc
urac
y[e
xact
/app
rox]
Twitter, Accuracy, 3-profilestriangleswedges
edgeempty
E. R. Elenberg Beyond Triangles 20/20
(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs
12 nodes 16 nodes 20 nodes0
1
2
3
4
5
6
7
Run
ning
time
[sec
]
LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5
3-prof p=0.1Trian p=1
Trian p=0.5Trian p=0.1
LiveJournal Running Time
12 nodes 16 nodes 20 nodes0
20
40
60
80
100
120
Run
ning
time
[sec
]
PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5
3-prof p=0.1Trian p=1
Trian p=0.5Trian p=0.1
PLD Running Time
E. R. Elenberg Beyond Triangles 20/20
(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs
12 nodes 16 nodes 20 nodes0.0
0.2
0.4
0.6
0.8
1.0
Net
wor
kse
nt[b
ytes
]
×1010 LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5
3-prof p=0.1Trian p=1
Trian p=0.5Trian p=0.1
LiveJournal Network Usage
12 nodes 16 nodes 20 nodes0.0
0.2
0.4
0.6
0.8
1.0
1.2
Net
wor
kse
nt[b
ytes
]
×1011 PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5
3-prof p=0.1Trian p=1
Trian p=0.5Trian p=0.1
PLD Network Usage
E. R. Elenberg Beyond Triangles 20/20
(Backup) Results: AWS, 5 runs
100 egos0
20
40
60
80
100
120
Run
ning
time
[sec
]
LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-ser 16 nodes Ego-ser 20 nodes
EGO-SER
100 egos0
2
4
6
8
10
12
Run
ning
time
[sec
]
LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes
EGO-PAR
E. R. Elenberg Beyond Triangles 20/20