Breadth #2-4 AP 2012 Breadth Theme for October: Observational Studies and Interpretations.
iBFS: Concurrent Breadth-First Search on GPUshome.gwu.edu/~asherliu/publication/ibfs_slides.pdf ·...
Transcript of iBFS: Concurrent Breadth-First Search on GPUshome.gwu.edu/~asherliu/publication/ibfs_slides.pdf ·...
iBFS: Concurrent Breadth-First Search on GPUs
Hang Liu H. Howie Huang Yang Hu
The George Washington University
Graphs are Everywhere …
2
Many Algorithms Need Concurrent Traversals❖ Single-Source Shortest Path
❖ Multi-Source Shortest Path
❖ All-Pairs Shortest Path
❖ Reachability Index Construction
❖ Centrality
3
3
4 6
2 5 8
0 1 7
8
5 7
1 2 6
0 4 3
6
3 7
0 1 2
4 5 8
1 4
7 8 6
2 5 3
Concurrent Traversals
4
210
543
876
0
BFS-2, Root Vertex 3BFS-1, Root Vertex 0
BFS-4, Root Vertex 8BFS-3, Root Vertex 6
Graph
3
4 6
BFS-2, Root Vertex 3
8
7 5
6
3 7
1 4
Opportunities
5
Shared Frontiers
0
BFS-1, Root Vertex 0
BFS-4, Root Vertex 8BFS-3, Root Vertex 6
210
543
876
Graph
Frontier Sharing is Common
6
Avg.
Fro
ntie
r Sha
ring
(%)
0
10
20
30
40
50
Facebook Hollywood KG2 Orkut Random Twitter WikiGraphs
Average 39%
iBFS Techniques❖ Joint Traversal
❖ Bitwise Optimization
❖ GroupBy
7
4 6
3 7 5 7
1 41 4 4 6
3 7 5 7
F F U 1 2 U 2 F U
U U U 2 F F 1 2 F1 6 7 82 3 4 50
U F F U U 2 F 2 11 6 7 82 3 4 50
1 2 F F 2 F U U U
Joint Frontier Queue and Joint Status Array
8
1 4 6 3 7 5
✗ ✗
Joint Frontier Queue (JFQ)
1 F
U U …
U U
F 1Vertex 0 Vertex 8
Joint Status Array (JSA)
BFS-1 BFS-2
BFS-3 BFS-4
Number, e.g., 1 or 2 - Depth F - Frontier U - Unvisited
Frontier Queue
Status Array
Frontier Queue
Status ArrayVertex ID
2 3 4BFS 1
2 3 4BFS 1 2 3 4BFS 1
1 3 4 5 6 7
5 6 8
3 U F 2 U 2 1 F U 3 2 2 U U F 15 6 7 8
JSA
JFQ
Benefits of Joint Traversal
9
…
Load neighbors once
Coalesce memory access to status array
Eliminate redundant frontiers
Improving Frontier Sharing
❖ The performance of iBFS, to some extent, is determined by how many frontiers are shared at each level during the joint traversal of each group
❖ For any BFS group, say group A, its sharing ratio is equal to the expected value of SpeedupA (Lemma 1)
10
It is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
GroupBy
11
0
18
36
54
72
90
2 3 4 5 6
Group AGroup BRandom group
Fron
tier S
harin
g (%
)
12
GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
Iteration
0
18
36
54
72
90
2 3 4 5 6
Group AGroup BRandom group
Fron
tier S
harin
g (%
)
13
GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
Iteration
0
18
36
54
72
90
2 3 4 5 6
Group AGroup BRandom group
Fron
tier S
harin
g (%
)
14
GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
Iteration
0
18
36
54
72
90
2 3 4 5 6
Group AGroup BRandom group
Fron
tier S
harin
g (%
)
15
GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
Iteration
0
18
36
54
72
90
2 3 4 5 6
Group AGroup BRandom group
Fron
tier S
harin
g (%
)
16
How to Select Group A?
GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)
Iteration
Outdegree-Based GroupBy Rules
17
Rule 1: < p Rule 1: < pSource Vertex 1 Source Vertex 2
Rule 2: > q…
High Outdegree
Vertex
• Rule 1: The outdegrees of two source vertices are less than p. • Ensure small outdegrees of the source vertices
• Rule 2: Two source vertices connect to at least one common vertex whose outdegree is greater than q. • Ensure that other non-shared neighbors will not amortize the sharing ratio
Evaluation❖ Implement in 4,000 lines of C++ and CUDA code
❖ Also implement a CPU version
❖ Extend our SC’15 Enterprise work (GPU-based single source BFS)
❖ Run on NVIDIA K40 and K20 GPUs
❖ Scale to over 100 nodes on TACC Stampede
❖ Report TEPS (traversed edges per second)
18
Speedup Insights
19
065
130195260325390455
Facebook Hollywood KG0 Orkut Pokec Random R-MAT Twitter
Naive +Joint traversal +Bitwise optimizations +GroupBy
506 729 640 831
TEPS
(Bill
ion)
Joint traversal improves performance by 40%, Bitwise optimization additional 11x, and GroupBy additional 2x
Scalability
20
Spee
dup
1
10
100
#Machines
1 2 4 8 16 32 64 112
RandomFacebookOrkutTwitter
Workload imbalance is the only issue for scalability, No communications during computation
ConclusioniBFS achieves 57,000 billion TEPS on 112 GPUs❖ Joint Traversal❖ Bitwise Optimization❖ GroupBy Strategy
iBFS vs. MS-BFS [VLDB ’15] (State-of-the-art)❖ MS-BFS extends single threaded BFS❖ MS-BFS bitwise operation cannot support early termination❖ MS-BFS does not have GroupBy
21
Acknowledgement
22
Thank You{asherliu, howie, huyang}@gwu.edu
Please visit us at Poster #28 in Grand Ballroom A at 3:30PM today!