Parallel bfs using 2 stacks
-
Upload
saptaparni-kumar -
Category
Science
-
view
37 -
download
0
Transcript of Parallel bfs using 2 stacks
Parallel BFS on Distributed Memory SystemsAydin Buluc and Kamesh Madduri
Sapta
DC reading group
September 29, 2016
Outline
IntroductionShared Memory BFS
Model
Contributions
Serial BFS overview
Another paper: Parallel BFS using 2 queues
This paper: Hybrid Parallel BFS using 2 stacks
Experimental Results
Conclusion
Introduction
BFS is important.
I BFS usually forms a sub-part to more complex graphalgorithms.
I Now that we have BIG graphs, parallelizing it is veryimportant
I Shared Memory BFS involves: (1) communication betweenprocessors and (2) distribution of the graph(vertices) amongprocessors
Model
I Graph G (V ,E ), and |V | = n and |E | = m, also m is O(n);i.e. sparse graphs.
I Edge weights = 1.
Contributions
I Traditional representation: 1 dimensional BFS (1D adjacencyarrays).
I Sparse matrix representation: 2D partitioning of the graph(Not discussed).
Serial BFS overview
I Sequential BFS uses a queue data structureI BFS requirement :
I all vertices at a distance k from the source should be “visited”before vertices at distance k + 1.
I Explanation?
I Level Synchronous BFS is a key concept in correct sharedmemory BFS.
Modified BFS : Use 2 stacks
Can be parallelized as is: perform lines 6-7 in parallel,lines 8-10 are atomic
Related Work: Level Synchronous Parallel BFS using 2queues by Agarwal et al SC’10 [1]
Hybrid 1D Parallel BFS Algorithm
One of the main areas for optimization to this basic parallelalgorithm isload-balancing: ensuring that parallelization of the edge visitsteps is load-balanced
I 1D partitioning: If there are p processors in the system, giveownership of n/p vertices, to each processor.
I Random shuffling of the vertice identifiers prior topartitioning. So all processors ge roughly same number ofvertices(n/p) and edges(m/p)
I Use of local stacks NSi for pushes and then globalunion.(Overhead < 3% of execution time)
1D BFS
1D BFS contd..
1D BFS errors
I The value of level is not incremented
I The Next Stack NSi data structure should be emptied beforetraversing next level.
Experiments
I 1D Flat MPI: one process per core
I 1D Hybrid: one or more MPI processes within a node
I synthetic graphs based on the R-MAT random graphmodel(default m : n 16) , web crawl of the UK domain (133million vertices and 5.5 billion edges).
I Systems: Hopper (6392-node Cray XE6) and Franklin(9660-node Cray XT4)
Experimental Results
I Strong scaling on FranklinI Higher is betterI GTEPS: Giga Traversed Edges per Second
Experimental Results
I lower is betterI Strong scaling on Franklin
Experimental Results
I Weak Scaling on Franklin
I Lower is better
Experiments
I Flat 1D algorithms are about 1.5− 1.8 times faster than the2D algorithms.
I The 1D hybrid algorithm, are slower than the flat 1Dalgorithm for smaller concurrencies, starts to performsignificantly faster for larger concurrencies.
Conclusion
I Conjecture: Level synchronous BFS can be implementedwithout any error with relaxed queues
I Question: Can the error be bounded if we don’t have a levelsynchronous algorithm?
V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalablegraph exploration on multicore processors. In Proc. ACM/IEEEConference on Supercomputing (SC10), November 2010.
A. Buluc K. Madduri. Parallel breadth-first search ondistributed memory systems. In Proceedings of 2011International Conference for High Performance Computing,Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,New York, NY, USA, 2011. ACM.
C.E. Leiserson and T.B. Schardl. A work-efficient parallelbreadth-first search algorithm (or how to cope with thenondeterminism of reducers). In Proc. 22nd ACM Symp. onParallism in Algorithms and Architectures (SPAA ’10), pages303–314, June 2010.
Thank You :)