Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL),...

Alex PothenPurdue UniversityCSCAPES Institutewww.cs.purdue.edu/homes/apothen/

Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State)

CSC’11 Workshop May 2011

Multithreaded Algorithms for Graph Coloring

References Multithreaded algorithms for graph coloring.

Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing.

New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011.

Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS.

Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.

Architectu

Algorithm

ion Concurrency

Latency Tolerance

Performance

OutlineThe many-core and multi-threaded world

◦ Intel Nehalem◦ Sun Niagara◦ Cray XMT

A case study on multithreaded graph coloring◦ An Iterative and Speculative Coloring Algorithm◦ A Dataflow algorithm

RMAT graphs: ER, G, and B Experimental results Conclusions

Architectural Features

O(|E|)-time implementations possible for all four

• B = max back degree over entire seq.• B+1 colors sufficeto color G.

Proc. Threads/Core

Cores/Socket

Threads

Cache Clock Multithreading, Other Detail

Intel Nehalem

2 4 16 Shared L3

2.5 G Simultaneous, Cache Coher. protocol

Sun Niagara 2

8 2 128 Shared L2

1.2 G Simultaneous

Cray XMT

128 128 Procs.

16,384

None 500 M Interleaved, Fine-grained synchronization

Multithreaded: Iterative Greedy Algorithm

Forbidden Colors

Multi-threaded: Data Flow Algorithm

Forbidden Colors

RMAT Graphs

R-MAT: Recursive MATrix method

Experiments ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25)◦ RMAT-G (0.45, 0.15, 0.15, 0.25)◦ RMAT-B (0.55, 0.15, 0.15, 0.15)

Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.

RMAT Graphs a bc d

Nehalem: Strong Scaling (Niagara)

RMAT-ER

RMAT-G

RMAT-B

Cray XMT: Strong and Weak Scaling Iter-G Iter-B

DF-G DF-B

Comparing Three Platforms

a) ER c) Good

e) Bad

No. Colors in Parallel Algorithms

Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem)

SL Ordering Relaxed SL Ordering

Our contributions: Multithreaded Coloring

Massive multithreading◦ Can tolerate memory latency for graphs/sparse matrices◦ Dataflow algorithms easier to implement than distributed memory

versions◦ Thread concurrency ameliorates lack of caches, and lower clock speeds◦ Thread parallelism can be exploited at fine grain if supported by

lightweight synchronization◦ Graph structure critically influences performance

Many-core machines◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2)

and ordering algorithms that port to different machines◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1

thread on X cores) ◦ Decomposition into tasks at a finer grain than distributed-memory version,

and relax synchronization to enhance concurrency◦ Will form nodes of Peta- and Exa-scale machines, so single node

performance studies are needed

Multi-threaded Parallelism

•Memory access times determine performance •By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free•Interleaved vs. Simultaneous multithreading (IMT or SMT)

Figure from Robert Golla, Sun Time

Multi-core: Sun Niagara 2

• Two 8-core sockets, •8 hw threads per core•1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks

•Simultaneous multithreading•Two threads from a core can be issued in a cycle•Shallow pipeline

Multicore: Intel Nehalem

• Two quad-core sockets, 2.5 GHz• Two hyperthreads per core support SMT•Off chip-data latency 106 cycles

•Advanced architectural features:Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution

Massive Multithreading: Cray XMT Latency tolerance via massive multi-

threading◦ Context switch between threads in a single clock cycle◦ Global address space, hashed to memory banks to reduce

hot-spots◦ No cache or local memory, average latency 600 cycles

Memory request doesn’t stall processor◦ Other threads work while the request is fulfilled

Light-weight, word-level synchr. (full/empty bits)Notes:

◦ 500 MHz clock◦ 128 Hardware thread streams/proc., ◦ Interleaved multithreading

Multithreaded Algorithms for Graph Coloring

◦We developed two kinds of multithreaded algorithms for graph coloring: An iterative, coarse-grained method for generic

shared-memory architectures A dataflow algorithm designed for massively

multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT

◦Benchmarked the algorithms on three systems: Cray XMT, Sun Niagara 2 and Intel Nehalem

◦Excellent speedup observed on all three platforms

Coloring Algorithms

Greedy coloring algorithms Distance-k, star, and acyclic coloring are NP-hard Approximating coloring to within O(n1-e) is NP-hard

for any e>0

GREEDY(G=(V,E))Order the vertices in Vfor i = 1 to |V| do

Determine colors forbidden to vi

Assign vi the smallest permissible color

end-for

A greedy heuristic usually gives a near-optimal solution

The key is to find good orderings for coloring, and many have been developed

Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

Distance-1Coloring, Greedy Alg.

Many-core greedy coloring Given a graph, parallelize greedy coloring on many-core

machines such that Speedup is attained, and Number of colors is roughly same as in serial

Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular

D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success

Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines

◦ Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution

◦ Number of conflicts bounded, so this approach yields an effective algorithm

◦ Extended to distance-2 coloring by G, M and P (2002) We adapt this approach to implement the greedy

algorithm for many-core computing

Parallel Coloring

Parallel Coloring: Speculation

Experimental results

Iterative Dataflow

Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges

Experimental resultsNiagara 2Iterati

Perf. With doubling threads on a core = Doubling cores!

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

RMAT-B, 224 vertices,134M edges

Iterative Greedy Coloring: Multithreaded Algorithm

Adj(v), color(w), forbidden(v): d(v) reads eachforbidden(v): d(v) writes

Adj(v), color(w): d(v) reads each

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

Tentative Conclusions, Future Work

Future Plans: Multithreaded Coloring

Massive multithreading◦ Microbechmarking to understand where the cycles go: thread

management, data accesses, synchronization, instruction scheduling, function unit limitations…

◦ Develop a performance model of the computation◦ Experiment with other graph classes ◦ Consider new algorithmic paradigms

Many-core machines◦ Four items as above◦ Ordering for coloring: Archetype of a problem for computing a

sequential ordering in a parallel environment (Mostofa Patwary and Assefaw Gebremedhin)

◦ Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5

Thanks

Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine, Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke

Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL),...

Documents

Transcript of Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL),...

Milestone Report - PNNL

The Bechtel Hanford Inc./PNNL Partnershipenergyenvironment.pnnl.gov/brochures/bechtellocal.pdf · The Bechtel Hanford Inc./PNNL Partnership ... PNNL provided technology-related information

Weighted Matchings & Applications in Scientific Computing Mahantesh Halappanavar Department of Computer Science Joint Work with: Florin Dobrian and Alex.

GRID MODERNIZATION - PNNL

WITH MULTIPLE MISSIONS - PNNL

PNNL-15870 (.pdf)

Soils are Alive! - PNNL

Army Re-tuning™ - PNNL

JSATS Detector Software Manual - PNNL

Na-Battery Development at PNNL

33 (’JW’n - PNNL

PNNL April 2011 ogce

PNNL Overview - Consortium for Verification Technology€¦ · PNNL Overview PNNL-SA-106030 October 15-16, 2015 1 CONSORTIUM FOR VERIFICATION TECHNOLOGY ... Post Masters 74 Post Docs

PNNL-19176 (.pdf)

The Black-Scholes Model for Option Pricing - 2 -Mahantesh Halappanavar -Meeting-3, 09-11-2007.

PNNL-16615 (.pdf)

Changhong Zhao - PNNL

PNNL-15848 Revision 3 (.pdf)

Simulating Quarks and Gluons with Quantum Chromodynamics February 10, 2005. CS635 Parallel Computer Architecture. Mahantesh Halappanavar.

Distributed-memory Graph Algorithms: Case studies with ... · Sayan Ghosh*, Mahantesh Halappanavar+, Ananth Kalyanaraman*, Assefaw Gebremedhin*, Antonino Tumeo+ *Washington State

Distributed-memory Graph Algorithms: Case studies with ... · Sayan Ghosh, Mahantesh Halappanavar+, Ananth Kalyanaraman, Assefaw Gebremedhin, Antonino Tumeo+ Washington State