Multi-‐GPU Graph Analytics Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang and John D. Owens, University of California, Davis{ychpan, yzhwang, yudwu, ctcyang, jowens}@ucdavis.edu
Introduction -‐ about Gunrock
Programming Model
Multi-‐GPU Framework Results
Future Work
Samples
AcknowledgementThe GPU hardware and cluster access was provided by NVIDIA.This work was funded by the DARPA XDATA program under AFRL Contract FA8750-‐13-‐C-‐0002 and by NSF awards CCF-‐1017399 and OCI-‐1032859.
References[1] Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. Multi-‐GPU Graph Analytics. CoRR, abs/1504.04804, Apr. 2015.[2] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A High-‐Performance Graph Processing Library on the GPU. CoRR, abs/1501.05387v2, March 2015.[3] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117-‐128, Feb.2012. [4] J. Zhong and B. He. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25(6):1543-‐1552,June 2014.[5] Z. Fu, H. K. Dasari, B. Bebee, M. Berzins, and B. Thompson. Parallel breadth first search on GPU clusters. In IEEE International Conference on Big Data, pages 110-‐118, Oct. 2014.
Gunrock is a multi-‐GPU graph processing library, which targets at:• High performance analytics of large graphs• Low programming complexity in implementing parallel
graph algorithms on GPUsHomepage: http://gunrock.github.ioThe copyright of Gunrock is owned by The Regents of the University of California, 2015. All source code are released under Apache 2.0.
ref. ref. hardware ref. performance our hardware our performancermat_n20_128 Merrill et al. [3] 4x Tesla C2050 8.3 GTEPS 4x Tesla K40 11.2 GTEPSrmat_n20_16 Zhong et al. [4] 4x Tesla C2050 15.4 ms 4x Tesla K40 9.29 mspeak GTEPS Fu et al. [5] 16x Tesla K20 15 GTEPS 6x Tesla K40 22.3 GTEPSpeak GTEPS Fu et al. [5] 64x Tesla K20 29.1 GTEPS 6x Tesla K40 22.3 GTEPS
• performance analysis and optimization• extending Gunrock onto multiple nodes• asynchronized graph algorithms
Local input frontier
PartitionerInput graph Partition table
Sub-graph builder Sub-graphs
Local input frontier
Output sub-frontier
Merge
Received data package
Remote input frontier
Output sub-frontier
Full-queue kernels
Merged frontier
Output frontier
Separate
Local input frontier
Remote output frontier
Data package
Output sub-frontier
Merge
Remote input frontier
Output sub-frontier
Merged frontier
Output frontier
Separate
Local input
frontier
Remote output frontier
Data package
Received data package
FinishConverged? Converged?
GPU0 GPU1
Package data
Push to peer
Unpackage
Sub-queue kernels Sub-queue kernelsSub-queue kernels
Unpackage
Legend:
Package data
Push to peer
Parameters required from user
User provided operations
Sub-queue kernels
1
4
3
2
6
5
0
1
1
1
2
2
Graph traversal
V-IdLabel
Source vertex /Frontier 0
Frontier 1
Frontier 2
1
2
3
04
6
5
0
4
1
35
2
6
4
1
5
02
6
3
4
1
5
0
2
6
3
0
4
1
35
2
6
GPU0 GPU1
Graph partition
xy
Original vertices
x y
xy
x yLocal vertices
Remote vertices(with local replica)
x Local V-idy Remote V-id
• 2D partitioning• Fixed partitioning• more algorithms
Comparison with previous work on GPU BFS
5
10
15
20
2 4 6Number of GPUs
GTE
PS
rmat_n20_1023rmat_n23_48
Traversed edges per second (TEPS) for BFS
Number of GPUs
1
2
3
4
5
2 4 6Number of GPUs
Spee
dup
vs. G
PU x
1 BFSSSSPBCCCPR
Number of GPUs
Strong scaling on rmat_n22_48
1.0
1.5
2.0
2.5
2 4 6Number of GPUs
T1*n
/Tn
BFSSSSPBCCCPR
Number of GPUs
Weak scaling on R-‐MAT graphs(scale 48, each GPU hosting ~180M edges)
Speedup of BFS for different graph types(error bars showing minima and maxima from graphs of the same types)
Speedup of PR for different graph types(error bars showing minima and maxima from graphs of the same types)
Speedup of different graph algorithms(error bars showing minima and maxima from different graphs) Gunrock’s multi-‐GPU framework aims at:
• Programmability: easy to develop graph primitives to support multiple GPUs-‐> hides most implementation details in the framework, and only requires little inputs (what data to exchange, how to combine data, when to stop)
• Algorithm generality: support a wide range of graph algorithms-‐> isolates from the actual algorithm implementations
• Hardware compatibility: usable on most single node GPU systems-‐> works on any number of GPUs, with or w/o peer GPU memory access
• Performance: low runtime, and leverages the underlying hardware well-‐> uses multiple CPU control threads and GPU streams to overlap computations on different portions of frontier, as well as communication
• Scalability: scalable in terms of both performance and memory usage-‐> Performs just enough GPU memory (re)allocation to keep usage small
Graph algorithm as a data-‐centric processFrontier: compact queue of nodes or edges
Generation
Operation
Advance: visit neighbor lists Filter: select and reorganize
Compute: per-‐element computation kernels in parallelcan be combined with advance or filter
Full-queue kernelsSingle GPU data flow
Multi GPU data flow
Top Related