Optimizing Threaded MPI Execution on SMP Clusters

Optimizing Threaded MPI Optimizing Threaded MPI Execution on SMP ClustersExecution on SMP Clusters

Hong Tang and Tao YangHong Tang and Tao Yang

Department of Computer ScienceDepartment of Computer Science

University of California, Santa BarbaraUniversity of California, Santa Barbara

June, 20th, 2001 Hong Tang 2

Parallel Computation on SMP Parallel Computation on SMP ClustersClusters

Massively Parallel Machines Massively Parallel Machines SMP SMP Clusters Clusters Commodity Components: Off-the-shelf Commodity Components: Off-the-shelf

Processors + Fast Network (Myrinet, Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet)Fast/GigaBit Ethernet)

Parallel Programming Model for SMP Parallel Programming Model for SMP ClustersClustersMPI: Portability, Performance, Legacy ProgramsMPI: Portability, Performance, Legacy ProgramsMPI+Variations: MPI+Multithreading, MPI+Variations: MPI+Multithreading,

MPI+OpenMPMPI+OpenMP


Threaded MPI ExecutionThreaded MPI Execution

MPI Paradigm: Separated Address Spaces for MPI Paradigm: Separated Address Spaces for Different MPI Nodes Different MPI Nodes

Natural Solution: MPI Nodes Natural Solution: MPI Nodes Processes Processes What if we map MPI nodes to threads?What if we map MPI nodes to threads?

Faster synchronization among MPI nodes running on Faster synchronization among MPI nodes running on the same machine.the same machine.

Demonstrated in previous work [PPoPP ’99] for a Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed single shared memory machine. (Developed techniques to safely execute MPI programs using techniques to safely execute MPI programs using threads.)threads.)

Threaded MPI Execution on SMP ClustersThreaded MPI Execution on SMP ClustersIntra-Machine Comm. through Shared MemoryIntra-Machine Comm. through Shared MemoryInter-Machine Comm. through NetworkInter-Machine Comm. through Network


Threaded MPI Execution Benefits Threaded MPI Execution Benefits Inter-Machine CommunicationInter-Machine Communication

Common IntuitionCommon Intuition

Our Findings Our Findings

Inter-machine communication cost is Inter-machine communication cost is dominated by network delay, so the dominated by network delay, so the

advantage of executing MPI nodes as advantage of executing MPI nodes as threads diminishes.threads diminishes.

Using threads can significantly reduce the Using threads can significantly reduce the buffering and orchestration overhead for buffering and orchestration overhead for

inter-machine communications.inter-machine communications.


Related WorkRelated Work

MPI on Network ClustersMPI on Network ClustersMPICH MPICH –– a portable MPI implementation.a portable MPI implementation.LAM/MPILAM/MPI – communication through a standalone RPI – communication through a standalone RPI server.server.

Collective Communication Optimization Collective Communication Optimization SUN-MPI and MPI-StarTSUN-MPI and MPI-StarT – modify MPICH ADI layer; target – modify MPICH ADI layer; target for SMP clusters.for SMP clusters.

MagPIeMagPIe – target for SMP clusters connected through – target for SMP clusters connected through WAN.WAN.

Lower Communication Layer OptimizationLower Communication Layer OptimizationMPI-FM and MPI-AMMPI-FM and MPI-AM..

Threaded Execution of Message Passing ProgramsThreaded Execution of Message Passing ProgramsMPI-Lite, LPVM, TPVMMPI-Lite, LPVM, TPVM..


Background: MPICH DesignBackground: MPICH Design

MPI Collective

MPI Point-to-Point

ADI

ChameleonInterface

T3D SGI others

P4

TCP shmem

MPI Collective

MPI Point to Point

Abstract Device Interface

Devices


MPICH Communication MPICH Communication StructureStructure

WS

WSWS

WS WS

WSWS

WS

MPICH without shared memoryMPICH with shared memory

WS- A cluster node- MPI node (process)- MPICH daemon process

- Inter-process pipe- Shared memory- TCP connection


TMPI Communication TMPI Communication StructureStructureWS

WSWS

WS

- TCP connection- Direct mem access and thread sync

WS- A cluster node- MPI node (thread)- TMPI daemon thread


Comparison of TMPI and Comparison of TMPI and MPICHMPICH

Drawbacks of MPICH w/ Shared MemoryIntra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local

peer.Cannot do automatic resource clean-up.

Drawbacks of MPICH w/o Shared MemoryBig overhead for intra-node communication.Too many daemon processes and open connections.

Drawbacks of both MPICH SystemsExtra data copying for inter-machine communication.


TMPITMPI CommunicationCommunication DesignDesign

MPI

INTER INTRA

NETD

TCP others

MPI Communication

Inter- and Intra-MachineCommunication

Abstract Network andThread Sync Interface

OS Facilities

THREAD

pthreadother

thread impl


Separation of Point-to-Point and Separation of Point-to-Point and Collective Communication Collective Communication ChannelsChannels

Observations: MPI Point-to-point Communication and Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different.Collective Communication Semantics are Different.

Separated channels for pt2pt and collective comm.Separated channels for pt2pt and collective comm. Eliminate daemon intervention for collective communication.Eliminate daemon intervention for collective communication. Less effective for MPICH – no sharing of ports among Less effective for MPICH – no sharing of ports among

processes.processes.

Point-to-pointPoint-to-point CollectiveCollective

Unknown Source Unknown Source (MPI_ANY_SOURCE)(MPI_ANY_SOURCE)

Determined SourceDetermined Source

(Ancestor in the spanning tree.)(Ancestor in the spanning tree.)

Out-of-order Out-of-order

(Message Tag)(Message Tag)In order deliveryIn order delivery

Asynchronous Asynchronous

(Non-block Receive)(Non-block Receive)SynchronousSynchronous

WS

WSWS

WS

- TCP connection- Direct mem access and thread sync

WS- A cluster node- MPI node (thread)- TMPI daemon thread


Observation: Two level communication Observation: Two level communication hierarchy.hierarchy.Inside an SMP node: shared memory (Inside an SMP node: shared memory (1010-8 -8 sec)sec)Between SMP nodes: network (Between SMP nodes: network (1010-6 -6 sec)sec)

Idea: Building the communication spanning Idea: Building the communication spanning tree in two stepstree in two stepsChoose a root MPI node on each cluster node Choose a root MPI node on each cluster node

and build a spanning tree among all the cluster and build a spanning tree among all the cluster nodes.nodes.

Second, all other MPI nodes connect to the local Second, all other MPI nodes connect to the local root node.root node.

Hierarchy-Aware Collective Hierarchy-Aware Collective CommunicationCommunication

Spanning trees for an MPI program with 9 nodes on three cluster nodes.The three cluster nodes contain MPI node 0-2, 3-5 and 6-8 respectively.

Thick edges are network edges.

MPICH(balanced binary tree)

0

87

6543

21

MPICH(hypercube)

7

6 53

41

0

2 8

TMPI

5 7 84

23

0

1 6


Question: How do we manage temporary Question: How do we manage temporary buffering of message data when the remote buffering of message data when the remote receiver is not ready to accept data?receiver is not ready to accept data?

Choices:Choices:Send the data with the request – eager push.Send the data with the request – eager push.Send request only and send data when the Send request only and send data when the

receiver is ready – three-phase protocol.receiver is ready – three-phase protocol.TMPI – adapt between both methods.TMPI – adapt between both methods.

One Step Eager-push ProtocolRemote node can buffer the msg.

req/datS

RQgot dat

DD

Three-phase ProtocolRemote node cannot buffer the msg.

reqS

Qgot reqD

D

receiver ready

dat

RQ

got datD

D

D

R

Adaptive Buffer ManagementAdaptive Buffer Management

Graceful Degradation from Eager-push to Three-phase Protocol

S

QD

D

RQ

D

D

R

D


Experimental StudyExperimental Study

Goal: Illustrate the advantage of threaded Goal: Illustrate the advantage of threaded MPI execution on SMP clusters.MPI execution on SMP clusters.

Hardware SettingHardware SettingA cluster of 6 Quad-Xeon 500MHz SMPs, with A cluster of 6 Quad-Xeon 500MHz SMPs, with

1GB main memory and 2 fast Ethernet cards per 1GB main memory and 2 fast Ethernet cards per machine.machine.

Software SettingSoftware SettingOS: RedHat Linux 6.0, kernel version 2.2.15 w/ OS: RedHat Linux 6.0, kernel version 2.2.15 w/

channel bonding enabled.channel bonding enabled.Process-based MPI System: MPICH 1.2Process-based MPI System: MPICH 1.2Thread-based MPI System: TMPI (45 functions in Thread-based MPI System: TMPI (45 functions in

MPI 1.1 standard)MPI 1.1 standard)


Inter-Cluster-Node Point-to-Inter-Cluster-Node Point-to-PointPoint

Ping-ping, TMPI vs MPICH w/ shared memoryPing-ping, TMPI vs MPICH w/ shared memory

0 200 400 600 800 1000200

300

400

500

600

700

Message Size (bytes)

Ro

un

d T

rip

Tim

e ( s

)

(a) Ping-Pong Short Message

TMPIMPICH

0 200 400 600 800 1000

8

10

12

14

16

18

20

Message Size (KB)

Tra

nsf

er R

ate

(MB

)

(b) Ping-Pong Long Message

TMPIMPICH


Intra-Cluster-Node Point-to-Intra-Cluster-Node Point-to-PointPoint

0 200 400 600 800 1000

50

100

150

200

Message Size (bytes)

Ro

un

d T

rip

Tim

e (

s)

(a) Ping-Pong Short Message

TMPIMPICH1MPICH2

0 200 400 600 800 100020

40

60

80

100

120

140

160

180

Message Size (KB)

Tra

nsf

er R

ate

(MB

)

(b) Ping-Pong Long Message

TMPIMPICH1MPICH2

Ping-pong, TMPI vs MPICH1 (MPICH w/ shared Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory)memory) and MPICH2 (MPICH w/o shared memory)


Collective CommunicationCollective Communication Reduce, Bcast, Allreduce.Reduce, Bcast, Allreduce. TMPITMPI / / MPICH_SHMMPICH_SHM / / MPICH_NOSHMMPICH_NOSHM Three node distributions, three root node settings.Three node distributions, three root node settings.

(us)(us) rootroot ReduceReduce BcastBcast AllreduceAllreduce

4x14x1

samesame 99//121121//43844384 1010//137137//79137913

160 160 //175175//627627rotaterotate 3333//8181//36993699 129129//9191//42384238

combcomboo

2525//102102//34363436 1717//3232//966966

1x41x4

samesame 2828//19991999//18441844 2121//16101610//15515511

571571//675675//775775rotaterotate 146146//19441944//18781878 164164//17741774//18318344

combcomboo

167167//19771977//18541854 4343//409409//392392

4x44x4

samesame 3939//25322532//48094809 5656//27922792//1024102466

736736//14121412//1991419914rotaterotate 161161//17181718//85668566 216216//22042204//80380366

combcomboo

141141//22422242//85158515 6262//489489//20542054

1) MPICH w/o shared memory performs the worst.

2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce.

3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases.


Macro-Benchmark PerformanceMacro-Benchmark Performance

0 5 10 15 200

200

400

600

800

1000

Number of MPI Nodes

MF

LO

P

(a) Matrix Multiplication

TMPIMPICH

0 5 10 15 20 250

200

400

600

800

1000

Number of MPI Nodes

MF

LO

P

(b) Gaussian Elimination

TMPIMPICH


ConclusionsConclusions

http://www.cs.ucsb.edu/projects/tmpi/http://www.cs.ucsb.edu/projects/tmpi/

Great Advantage of Threaded MPI Great Advantage of Threaded MPI Execution on SMP ClustersExecution on SMP ClustersMicro-benchmark: 70+ times faster than Micro-benchmark: 70+ times faster than

MPICH.MPICH.Macro-benchmark: 100% faster than MPICH.Macro-benchmark: 100% faster than MPICH.

Optimization TechniquesOptimization TechniquesSeparated Collective and Point-to-Point Separated Collective and Point-to-Point

Communication ChannelsCommunication ChannelsAdaptive Buffer ManagementAdaptive Buffer ManagementHierarchy-Aware Communications Hierarchy-Aware Communications


Background: Safe Execution of Background: Safe Execution of MPI Programs using ThreadsMPI Programs using Threads

Program Transformation: Program Transformation: Eliminate global and Eliminate global and static variables (called static variables (called permanent variablespermanent variables).).

Thread-Specific Data (TSD)Thread-Specific Data (TSD)Each thread can associate a pointer-sized Each thread can associate a pointer-sized datadata variable variable with a commonly defined with a commonly defined keykey value (an integer). With value (an integer). With the same key, different threads can set/get the values the same key, different threads can set/get the values of their own copy of the data variable.of their own copy of the data variable.

TSD-based TransformationTSD-based Transformation Each permanent variable declaration is replaced with a Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. of the permanent variable with the corresponding key. In places where global variables are referenced, use the In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the global keys to retrieve the per-thread copies of the variables.variables.


Source Program Program After Transformation int X=1; int kX=0; void main_init()

{ if (kX==0) kX=key_create(); }

void user_init() { int *pX=malloc(sizeof(int)); *pX=1; setval(kX, pX); }

int f() {

int f() { int *pX=getval(kX);

return X++; }

return (*pX)++; }

Program Transformation Program Transformation –– An ExampleAn Example

Optimizing Threaded MPI Execution on SMP Clusters

Documents

Transcript of Optimizing Threaded MPI Execution on SMP Clusters