Download - Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao Wang, Yuduo

Multi-‐GPU Graph Analytics Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang and John D. Owens, University of California, Davis{ychpan, yzhwang, yudwu, ctcyang, jowens}@ucdavis.edu

Introduction -‐ about Gunrock

Programming Model

Multi-‐GPU Framework Results

Future Work

Samples

AcknowledgementThe GPU hardware and cluster access was provided by NVIDIA.This work was funded by the DARPA XDATA program under AFRL Contract FA8750-‐13-‐C-‐0002 and by NSF awards CCF-‐1017399 and OCI-‐1032859.

References[1] Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. Multi-‐GPU Graph Analytics. CoRR, abs/1504.04804, Apr. 2015.[2] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A High-‐Performance Graph Processing Library on the GPU. CoRR, abs/1501.05387v2, March 2015.[3] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117-‐128, Feb.2012. [4] J. Zhong and B. He. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25(6):1543-‐1552,June 2014.[5] Z. Fu, H. K. Dasari, B. Bebee, M. Berzins, and B. Thompson. Parallel breadth first search on GPU clusters. In IEEE International Conference on Big Data, pages 110-‐118, Oct. 2014.

Gunrock is a multi-‐GPU graph processing library, which targets at:• High performance analytics of large graphs• Low programming complexity in implementing parallel

graph algorithms on GPUsHomepage: http://gunrock.github.ioThe copyright of Gunrock is owned by The Regents of the University of California, 2015. All source code are released under Apache 2.0.

ref. ref. hardware ref. performance our hardware our performancermat_n20_128 Merrill et al. [3] 4x Tesla C2050 8.3 GTEPS 4x Tesla K40 11.2 GTEPSrmat_n20_16 Zhong et al. [4] 4x Tesla C2050 15.4 ms 4x Tesla K40 9.29 mspeak GTEPS Fu et al. [5] 16x Tesla K20 15 GTEPS 6x Tesla K40 22.3 GTEPSpeak GTEPS Fu et al. [5] 64x Tesla K20 29.1 GTEPS 6x Tesla K40 22.3 GTEPS

• performance analysis and optimization• extending Gunrock onto multiple nodes• asynchronized graph algorithms

Local input frontier

PartitionerInput graph Partition table

Sub-graph builder Sub-graphs


Output sub-frontier

Merge

Received data package

Remote input frontier

Output sub-frontier

Full-queue kernels

Merged frontier

Output frontier

Separate


Remote output frontier

Data package

Output sub-frontier

Merge

Remote input frontier

Output sub-frontier

Merged frontier

Output frontier

Separate

Local input

frontier

Remote output frontier

Data package

Received data package

FinishConverged? Converged?

GPU0 GPU1

Package data

Push to peer

Unpackage

Sub-queue kernels Sub-queue kernelsSub-queue kernels

Unpackage

Legend:

Package data

Push to peer

Parameters required from user

User provided operations

Sub-queue kernels

1

4

3

2

6

5

0

1

1

1

2

2

Graph traversal

V-IdLabel

Source vertex /Frontier 0

Frontier 1

Frontier 2

1

2

3

04

6

5

0

4

1

35

2

6

4

1

5

02

6

3

4

1

5

0

2

6

3

0

4

1

35

2

6

GPU0 GPU1

Graph partition

xy

Original vertices

x y

xy

x yLocal vertices

Remote vertices(with local replica)

x Local V-idy Remote V-id

• 2D partitioning• Fixed partitioning• more algorithms

Comparison with previous work on GPU BFS

5

10

15

20

2 4 6Number of GPUs

GTE

PS

rmat_n20_1023rmat_n23_48

Traversed edges per second (TEPS) for BFS

Number of GPUs

1

2

3

4

5

2 4 6Number of GPUs

Spee

dup

vs. G

PU x

1 BFSSSSPBCCCPR

Number of GPUs

Strong scaling on rmat_n22_48

1.0

1.5

2.0

2.5

2 4 6Number of GPUs

T1*n

/Tn

BFSSSSPBCCCPR

Number of GPUs

Weak scaling on R-‐MAT graphs(scale 48, each GPU hosting ~180M edges)

Speedup of BFS for different graph types(error bars showing minima and maxima from graphs of the same types)

Speedup of PR for different graph types(error bars showing minima and maxima from graphs of the same types)

Speedup of different graph algorithms(error bars showing minima and maxima from different graphs) Gunrock’s multi-‐GPU framework aims at:

• Programmability: easy to develop graph primitives to support multiple GPUs-‐> hides most implementation details in the framework, and only requires little inputs (what data to exchange, how to combine data, when to stop)

• Algorithm generality: support a wide range of graph algorithms-‐> isolates from the actual algorithm implementations

• Hardware compatibility: usable on most single node GPU systems-‐> works on any number of GPUs, with or w/o peer GPU memory access

• Performance: low runtime, and leverages the underlying hardware well-‐> uses multiple CPU control threads and GPU streams to overlap computations on different portions of frontier, as well as communication

• Scalability: scalable in terms of both performance and memory usage-‐> Performs just enough GPU memory (re)allocation to keep usage small

Graph algorithm as a data-‐centric processFrontier: compact queue of nodes or edges

Generation

Operation

Advance: visit neighbor lists Filter: select and reorganize

Compute: per-‐element computation kernels in parallelcan be combined with advance or filter

Full-queue kernelsSingle GPU data flow

Multi GPU data flow

Download - Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao Wang, Yuduo

Download - Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao Wang, Yuduo