iBFS: Concurrent Breadth-First Search on GPUshome.gwu.edu/~asherliu/publication/ibfs_slides.pdf ·...

iBFS: Concurrent Breadth-First Search on GPUs

Hang Liu H. Howie Huang Yang Hu

The George Washington University

Graphs are Everywhere …

2

Many Algorithms Need Concurrent Traversals❖ Single-Source Shortest Path

❖ Multi-Source Shortest Path

❖ All-Pairs Shortest Path

❖ Reachability Index Construction

❖ Centrality

3

3

4 6

2 5 8

0 1 7

8

5 7

1 2 6

0 4 3

6

3 7

0 1 2

4 5 8

1 4

7 8 6

2 5 3

Concurrent Traversals

4

210

543

876

0

BFS-2, Root Vertex 3BFS-1, Root Vertex 0


Graph

3

4 6

BFS-2, Root Vertex 3

8

7 5

6

3 7

1 4

Opportunities

5

Shared Frontiers

0

BFS-1, Root Vertex 0


210

543

876

Graph

Frontier Sharing is Common

6

Avg.

Fro

ntie

r Sha

ring

(%)

0

10

20

30

40

50

Facebook Hollywood KG2 Orkut Random Twitter WikiGraphs

Average 39%

iBFS Techniques❖ Joint Traversal

❖ Bitwise Optimization

❖ GroupBy

7

4 6

3 7 5 7

1 41 4 4 6

3 7 5 7

F F U 1 2 U 2 F U

U U U 2 F F 1 2 F1 6 7 82 3 4 50

U F F U U 2 F 2 11 6 7 82 3 4 50

1 2 F F 2 F U U U

Joint Frontier Queue and Joint Status Array

8

1 4 6 3 7 5

✗ ✗

Joint Frontier Queue (JFQ)

1 F

U U …

U U

F 1Vertex 0 Vertex 8

Joint Status Array (JSA)

BFS-1 BFS-2

BFS-3 BFS-4

Number, e.g., 1 or 2 - Depth F - Frontier U - Unvisited

Frontier Queue

Status Array

Frontier Queue

Status ArrayVertex ID

2 3 4BFS 1

2 3 4BFS 1 2 3 4BFS 1

1 3 4 5 6 7

5 6 8

3 U F 2 U 2 1 F U 3 2 2 U U F 15 6 7 8

JSA

JFQ

Benefits of Joint Traversal

9

…

Load neighbors once

Coalesce memory access to status array

Eliminate redundant frontiers

Improving Frontier Sharing

❖ The performance of iBFS, to some extent, is determined by how many frontiers are shared at each level during the joint traversal of each group

❖ For any BFS group, say group A, its sharing ratio is equal to the expected value of SpeedupA (Lemma 1)

10

It is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)

GroupBy

11

0

18

36

54

72

90

2 3 4 5 6

Group AGroup BRandom group

Fron

tier S

harin

g (%

)

12

GroupByIt is observed that a higher sharing ratio of the initial levels can lead to a higher expected sharing ratio in later levels (Theorem 1 and Lemma 2)

Iteration

0

18

36

54

72

90

2 3 4 5 6


Fron

tier S

harin

g (%

)

13


Iteration

0

18

36

54

72

90

2 3 4 5 6


Fron

tier S

harin

g (%

)

14


Iteration

0

18

36

54

72

90

2 3 4 5 6


Fron

tier S

harin

g (%

)

15


Iteration

0

18

36

54

72

90

2 3 4 5 6


Fron

tier S

harin

g (%

)

16

How to Select Group A?


Iteration

Outdegree-Based GroupBy Rules

17

Rule 1: < p Rule 1: < pSource Vertex 1 Source Vertex 2

Rule 2: > q…

High Outdegree

Vertex

• Rule 1: The outdegrees of two source vertices are less than p. • Ensure small outdegrees of the source vertices

• Rule 2: Two source vertices connect to at least one common vertex whose outdegree is greater than q. • Ensure that other non-shared neighbors will not amortize the sharing ratio

Evaluation❖ Implement in 4,000 lines of C++ and CUDA code

❖ Also implement a CPU version

❖ Extend our SC’15 Enterprise work (GPU-based single source BFS)

❖ Run on NVIDIA K40 and K20 GPUs

❖ Scale to over 100 nodes on TACC Stampede

❖ Report TEPS (traversed edges per second)

18

Speedup Insights

19

065

130195260325390455

Facebook Hollywood KG0 Orkut Pokec Random R-MAT Twitter

Naive +Joint traversal +Bitwise optimizations +GroupBy

506 729 640 831

TEPS

(Bill

ion)

Joint traversal improves performance by 40%, Bitwise optimization additional 11x, and GroupBy additional 2x

Scalability

20

Spee

dup

1

10

100

#Machines

1 2 4 8 16 32 64 112

RandomFacebookOrkutTwitter

Workload imbalance is the only issue for scalability, No communications during computation

ConclusioniBFS achieves 57,000 billion TEPS on 112 GPUs❖ Joint Traversal❖ Bitwise Optimization❖ GroupBy Strategy

iBFS vs. MS-BFS [VLDB ’15] (State-of-the-art)❖ MS-BFS extends single threaded BFS❖ MS-BFS bitwise operation cannot support early termination❖ MS-BFS does not have GroupBy

21

Acknowledgement

22

Thank You{asherliu, howie, huyang}@gwu.edu

Please visit us at Poster #28 in Grand Ballroom A at 3:30PM today!

iBFS: Concurrent Breadth-First Search on GPUshome.gwu.edu/~asherliu/publication/ibfs_slides.pdf ·...

Documents

Transcript of iBFS: Concurrent Breadth-First Search on GPUshome.gwu.edu/~asherliu/publication/ibfs_slides.pdf ·...