Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013

25
Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard 2/27/2013 Scientific Computation Research Center Rensselaer Polytechnic Institute 1 Advances in PUMI for High Core Count Machines

description

Advances in PUMI for High Core Count Machines. Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013 Scientific Computation Research Center Rensselaer Polytechnic Institute. Outline. Distributed Mesh Data Structure Phased Message Passing - PowerPoint PPT Presentation

Transcript of Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013

Page 1: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard 2/27/2013

Scientific Computation Research Center Rensselaer Polytechnic Institute

1

Advances in PUMI forHigh Core Count Machines

Page 2: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Outline

1. Distributed Mesh Data Structure2. Phased Message Passing3. Hybrid (MPI/thread) Programming Model4. Hybrid Phased Message Passing5. Hybrid Partitioning6. Hybrid Mesh Migration

2

Page 3: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Unstructured Mesh Data Structure

3

Mesh

Part

Regions

Edges

Faces

Vertices

Pointer inData Structure

Page 4: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Distributed Mesh Representation

Mesh elements assigned to parts Uniquely identified by handle or global ID Treated as a serial mesh with the addition of part boundaries

Part boundary: groups of mesh entities on shared links between partsRemote copy: duplicated entity copy on non-local partResident part set : list of parts where the entity exists

Can have multiple parts per process.

4

Page 5: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Message Passing

Primitive functional set: Size – members in group Rank – ID of self in group Send – non-blocking synchronous send Probe – non-blocking probe Receive – blocking receive

Non-blocking barrier (ibarrier) API Call 1: Begin ibarrier API Call 2: Wait for ibarrier termination Used for phased message passing Will be available in MPI3, right now custom solution

5

Page 6: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

ibarrier Implementation

Using all non-blocking point-to-point calls:

For N ranks, lg(N) go to and from rank 0Uses a separate MPI communicator

0 1 2 3 4

Reduce

Broadcast

6

Page 7: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Phased Message Passing

Similar to Bulk Synchronous ParallelUses non-blocking barrier

1. Begin phase2. Send all messages3. Receive any messages sent this phase4. End phase

Benefits: Efficient termination detection when neighbors unknown Phases are implicit barriers – simplify algorithms Allows buffering all messages per rank per phase

7

Page 8: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Phased Message Passing

Implementation:

1. Post all sends for this phase2. While local sends incomplete: receive any

1. Now local sends complete (remember they are synchronous)

3. Begin “stopped sending” ibarrier4. While ibarrier incomplete: receive any

1. Now all sends complete, can stop receiving

5. Begin “stopped receiving” ibarrier6. While ibarrier incomplete: compute

1. Now all ranks stopped receiving, safe to send next phase

7. Repeat

send recv send recv send recv

ibarriers aresignal edges

8

Page 9: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid System

Node

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Process

Thread Thread Thread Thread

Thread Thread Thread Thread

Thread Thread Thread Thread

Thread Thread Thread Thread

Blue Gene/Q Program

MAPPING

*Processes per node and threads per core are variable

9

Page 10: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Programming System

1. Message Passing is the de facto standard programming model for distributed memory architectures.

2. The classic shared memory programming model: mutexes, atomic operations, lockless structures

Most massively parallel code is currently using model 1.

Models are very different, hard to convert from 1 to 2.

10

Page 11: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Programming System

We will try message passing between threads.

Threads can send to other threads in the same process

And to threads in a different process.

Same model as MPI, replace “process” with “thread”.

Porting is faster: change the message passing API.

Shared memory is still exploited, lock with messages:

11

Thread 1:Write(A)Release(lockA)

Thread2:Lock(lockA)Write(A)

Thread 1:Write(A)SendTo(2)

Thread2:ReceiveFrom(1)Write(A)

becomes

Page 12: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Parallel Control Utility

Multi-threading API for hybrid MPI/thread mode Launch a function pointer on N threads Get thread ID, number of threads in process Uses pthread directly

Phased communication API Send messages in batches per phase, detect end of phase

Hybrid MPI/thread communication API Uses hybrid ranks and size Same phased API, automatically changes to hybrid when

called within threads

Future: Hardware queries by wrapping hwloc** Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/)

12

Page 13: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Message Passing

Everything built from primitives, need hybrid primitives:Size: # of threads on the whole machineRank: machine-unique ID of the threadSend, Probe, and Receive using hybrid ranks

0 1 2 3 0 1 2 3

0 1

4 5 6 70 1 2 3

Process rank

Thread rank

Hybrid rank

13

Page 14: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Message Passing

Initial simple hybrid primitives: just wrap MPI primitivesMPI_Init_thread with MPI_THREAD_MULTIPLEMPI rank = floor(Hybrid rank / threads per process)MPI tag bit fields:

From thread To thread Hybrid tag

Phased

ibarrier

MPI primitives

Phased

ibarrier

Hybrid primitives

MPI primitives

MPI mode: Hybrid mode:

14

Page 15: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Partitioning

Partition mesh to processes, then partition to threads Map Parts to threads, 1-to-1 Share entities on inter-thread part boundaries

MPIProcess 1

Process 2

Process 3

Process 4

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

15

Page 16: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Partitioning

Entities are shared within a process Part boundary entity is created once per process Part boundary entity is shared by all local parts Only owning part can modify entity(avoids almost all contention) Remote copy: duplicate entity copy on another process Parallel control utility can provide architecture info to mesh, which is distributed accordingly.

iM0

jM0

1P

0P 2P

inter-process boundary

intra-process part boundary (implicit)

process j process i

16

Page 17: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Mesh Migration

Moving mesh entities between parts Input: local mesh elements to send to other parts Other entities to move are determined by adjacencies

Complex Subtasks Reconstructing mesh adjacencies Re-structuring the partition model Re-computing remote copies

ConsiderationsNeighborhoods change: try to maintain scalability despite loss of

communication localityHow to benefit from shared memory

17

Page 18: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

11 1

Migration Steps

Mesh Migration

1

(B) Get affected entities and compute post-migration residence parts

(D) Delete migrated entities

P0 P2

P1

2

1 11 1 1

22

2 222

1

(A) Mark destination part id

2

1 11 1 1

22

2 222

1

(C) Exchange entitiesand update part boundary

P0

P2

P1

2

1 1

2

22

22 2

18

Page 19: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Migration

Shared memory optimizations: Thread to part matching: use partition model for concurrency Threads handle part boundary entities which they own

Other entities are ‘released’ Inter-process entity movement

Send entity to one thread per process Intra-process entity movement

Send message containing pointer

0 1

2 3

0 1

2 3

release

grab

19

Page 20: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Hybrid Migration

1. Release shared entities2. Update entity resident part sets3. Move entities between processes4. Move entities between threads5. Grab shared entities

Two-level temporary ownership: Master and Process MasterMaster: smallest resident part IDProcess Master: smallest on-process resident part ID

20

0 1

2 3

Page 21: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Representative Phase:1. Old Master Part sends entity to new Process Master Parts2. Receivers bounce back addresses of created entities3. Senders broadcast union of all addresses

Hybrid Migration

21

0 1

4 5 6 7

Old Resident Parts:{1,2,3}New Resident Parts:{5,6,7}

Data to create copyAddress of local copyAddresses of all copies

Page 22: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Many subtle complexities:1. Most steps have to be done one dimension at a time2. Assigning upward adjacencies causes thread contention

1. Use a separate phase of communication to make them2. Use another phase to remove them when entities are deleted

3. Assigning downward adjacencies requires addresses on the new process

1. Use a separate phase to gather remote copies

Hybrid Migration

22

Page 23: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Preliminary Results

Model: bi-unit cube Mesh: 260K tets, 16 parts Migration: sort by X coordinate

23

Page 24: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Preliminary Results

First test of hybrid algorithm: Using 1 node of the CCNI Blue Gene /Q: Cases:

1. 16 MPI ranks, 1 thread per rank1. 18.36 seconds for migration2. 433 MB mesh memory use (sum of all MPI ranks)

2. 1 MPI rank, 16 threads per rank1. 9.62 seconds for migration + thread create/join2. 157 MB mesh memory use (sum of all threads)

24

Page 25: Dan Ibanez,  Micah  Corah ,  Seegyoung Seol , Mark  Shephard 2/27/2013

Thank You

25

Seegyoung Seol – FMDB architect, part boundary sharingMicah Corah – SCOREC undergraduate, threaded part loading