Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen –...

Protocol-Dependent Message-Passing Performance on Linux Clusters

Dave Turner – Xuehua Chen – Adam Oline

This work is funded by the DOE MICS office.

http://www.scl.ameslab.gov/

Inefficiencies in the communication system

Applications MPI native layer internal buses driver & NIC switch fabric

Poor MPI usage

No mapping

50% bandwidth

2-3x latency

OS bypass

TCP tuning

PCI

Memory

Hardware limits

Driver tuning

Topological bottlenecks

TCPworkstations

PCs

GMMyrinet cards

ARMCI

MPIMPICH - LAM

MPI/Pro - MP_Lite

MPI-21-sided

gets or puts

SHMEM1-sided

gets or puts

TCGMSGruns on

ARMCI or MPI

PVMNetPIPE

2-sidedprotocols

1-sidedprotocols

rawperfor-mance

Network Protocol Independent Performance Evaluator

GPSHMEMon ARMCI

SHMEM1-sided

functions

Cray T3E

SGI Origins

LAPI IBM SP

VIAOS-bypass

Giganet hardware

M-VIA Ethernet

The NetPIPE utility

NetPIPE does a series of ping-pong tests between two nodes.

Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies.

Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes.

Measuring the overhead of message-passing protocols.

Help in tuning the optimization parameters of message-passing libraries.

Identify dropouts in networking hardware.

Optimizing driver and OS parameters (socket buffer sizes, etc.).

Some typical uses

What is not measured NetPIPE can measure the load on the CPU using getrusage, but this was not done here.

The effects from the different methods for maintaining message progress.

Scalability with system size.

0

500

1000

1500

2000

2500

3000

1 100 10,000 1,000,000

Message size in Bytes

Th

rou

gh

pu

t in

Mb

ps

old Cray MPI

MP_Literaw SHMEM

new Cray MPI

A NetPIPE example: Performance on a Cray T3E

Raw SHMEM delivers: 2600 Mbps 2-3 us latency

Cray MPI originally delivered: 1300 Mbps 20 us latency

MP_Lite delivers: 2600 Mbps 9-10 us latency

New Cray MPI delivers: 2400 Mbps 20 us latency

The top of the spikes are where the message size is divisible by 8 Bytes.

The network hardware and computer test-beds

Linux PC test-bed

Two 1.8 GHz P4 computers

768 MB PC133 memory

32-bit 33 MHz PCI bus

RedHat 7.2 Linux 2.4.7-10

Alpha Linux test-bed

Two 500 MHz dual-processor Compaq DS20s

1.5 GB memory

32/64-bit 33 MHz PCI bus

RedHat 7.1 Linux 2.4.17

PC SMP test-bed

1.7 GHz dual-processor Xeon

1.0 GB memory

RedHat 7.3 Linux 2.4.18-3smp

All measurements were done back-to-back except for the Giganet hardware, which went through an 8-port switch.

MPICH

MPICH 1.2.3 release

Uses the p4 device for TCP.

P4_SOCKBUFSIZE must be increased to ~256 kBytes.

Rendezvous threshold can be changed in the source code.

MPICH-2.0 will be out soon!

Developed by Argonne National Laboratory and Mississippi State University.

0

100

200

300

400

500

600

700

1 100 10,000 1,000,000


Th

rou

gh

pu

t in

Mb

ps

MPICH P4_SOCKBUFSIZE = 256 kB

raw TCP

MPICH out of the box

LAM/MPI

LAM 6.5.6-4 release from the RedHat 7.2 distibution.

Must lamboot the daemons.

-lamd directs messages through the daemons.

-O avoids data conversion for homogeneous systems.

No socket buffer size tuning.

No threshold adjustments.

Currently developed at Indiana University. http://www.lam-mpi.org/

0

100

200

300

400

500

600

700

1 100 10,000 1,000,000Message size in Bytes

Th

rou

gh

pu

t in

Mb

ps LAM/MPI with -O

raw TCP

LAM/MPI out of the box

LAM/MPI with -lamd

MPI/Pro

MPI/Pro 1.6.3-1 release

Easy to install RPM

Requires rsh, not ssh

-tcp_long 128 kBytes gets rid of most of the dip at the rendezvous threshold.

Other parameters didn’t help.

Thanks to MPI Software Technology for supplying the MPI/Pro software for testing. http://www.mpi-softtech.com/

0

100

200

300

400

500

600

700


Th

rou

gh

pu

t in

Mb

ps

MPI/Pro -tcp_long = 128 kB

raw TCP

MPI/Pro out of the box

The MP_Lite message-passing library

• A light-weight MPI implementation• Highly efficient for the architectures supported• Designed to be very user-friendly• Ideal for performing message-passing research

http://www.scl.ameslab.gov/Projects/MP_Lite/

MPI Applicationsrestricted to a subset

of the MPI commands

MP_Lite

VIAOS-bypass

TCPworkstations

PCs

SMPshared-memory

segment

SHMEMone-sidedfunctions

MPIto retain portabilityfor MP_Lite syntax

Mixed systemdistributed

SMPs

MP_Lite syntax

Giganet hardware

M-VIA Ethernet

Cray T3E

SGI Origins

0

100

200

300

400

500

600

700


Th

rou

gh

pu

t in

Mb

ps

PVM fully optimized

raw TCP

PVM out of the box (uses pvmd daemons)

PVM w ith XDR encoding

PVM

PVM 3.4.3 release from the RedHat 7.2 distribution.

Uses XDR encoding and the pvmd daemons by default.

pvm_setopt(PvmRoute, PvmRouteDirect) bypasses the pvmd daemons.

pvm_initsend(PvmDataInPlace) avoids XDR encoding for homogeneous systems.

Developed at Oak Ridge National Laboratory. http://www.csm.ornl.gov/pvm/

0

100

200

300

400

500

600


Th

rou

gh

pu

t in

Mb

ps

MPICH

MP_Literaw TCP

PVM

LAM/MPI

MPI/Pro

TCGMSG

Performance on Netgear GA620 Fiber Gigabit Ethernet cards between two PCs

All libraries do reasonably well on this mature card and driver.

MPICH and PVM suffer from an extra memory copy.

LAM/MPI, MPI/Pro, and MPICH have dips at the rendezvous threshold due to the large 180 us latency. Tunable thresholds would easily eliminate this minor drop in performance.

Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

0

100

200

300

400

500

600

1 100 10,000 1,000,000


Th

rou

gh

pu

t in

Mb

ps

MPICH

MP_Lite

raw TCP

PVM

LAM/MPI

MPI/Pro

Performance on TrendNet and Netgear GA622T Gigabit Ethernet cards between two Linux PCs

Both cards are very sensitive to the socket buffer sizes.

MPICH and MP_Lite do well because they adjust the socket buffer sizes.

Increasing the default socket buffer size in the other libraries, or making it an adjustable parameter, would fix this problem.

More tuning of the ns83820 driver would also fix this problem.

TrendNet TEG-PCITX copper GigE 32-bit 33/66 MHz ns83820 driver

0100200300400500600700800900

1000

1 100 10,000 1,000,000


Th

rou

gh

pu

t in

Mb

ps

MPICH

MP_Lite

raw TCP

PVM

LAM/MPI

Performance on SysKonnect Gigabit Ethernet cards between Compaq DS20s running Linux

The SysKonnect cards using a 9000 Byte MTU provides a more challenging environment.

MP_Lite delivers nearly all the 900 Mbps performance.

LAM/MPI again suffers due to the smaller socket buffer sizes.

MPICH suffers from the extra memory copy.

PVM suffers from both.

SysKonnect SK-9843-SX fiber GigE 32/64-bit 33/66 MHz sk98lin driver

0

100

200

300

400

500

600

700

800

900

1000


Th

rou

gh

pu

t in

Mb

ps

MPICH-GM

raw GM

MPI/Pro-GMIP-GM

TCP/IP on GigE

Performance on Myrinet cards between two Linux PCs

MPICH-GM and MPI/Pro-GM both pass almost all the performance of GM through to the application.

SCore claims to provide better performance, but is not quite ready for prime time yet.

IP-GMIP-GM provides little benefit over TCP on Gigabit Ethernet, and at a much greater cost.

Myrinet PCI64A-2 SAN card 66 MHz RISC with 2 MB memory

0

100

200

300

400

500

600

700

800

900

1000


Th

rou

gh

pu

t in

Mb

ps

MVICH Giganet

MP_Lite Giganet

MP_Lite via_sk

MPI/Pro Giganet

MVICH via_sk

Performance on VIA Giganet hardware and on SysKonnect GigE cards using M-VIA

between two Linux PCsMPI/Pro, MVICH, and MP_Lite all provide 800 Mbps bandwidth on the Giganet hardware, but MPI/Pro has a longer latency of 42 us compared with 10 us for the others.

The M-VIA 1.2b2 performance is roughly at the same level that raw TCP provides.

The M-VIA 1.2b3 release has not been tested, nor has using jumbo frames.

Giganet CL1000 cards through an 8-port CL5000 switchhttp://www.nersc.gov/research/ftg/{via,mvich}/

SMP message-passing performance on a dual-processor Compaq DS20 running Alpha Linux

With the data starting in main memory.

0

500

1000

1500

2000

2500

3000

3500

4000


Th

rou

gh

pu

t in

Mb

ps

MPICH78 us

MP_Lite7 us

PVM128 us

LAM/MPI 48 us

SMP message-passing performance on a dual-processor Compaq DS20 running Alpha Linux

With the data starting in cache.

0

500

1000

1500

2000

2500

3000

3500

4000

1 100 10,000 1,000,000


Th

rou

gh

pu

t in

Mb

ps

MPICH75 us

MP_Lite6 us

PVM131 us

LAM/MPI 47 us

SMP message-passing performance on a dual-processor Xeon running Linux

With the data starting in main memory.

0

500

1000

1500

2000

2500

3000

3500

4000


Th

rou

gh

pu

t in

Mb

ps

MPICH25 us

MP_Lite3 us

PVM33 us

LAM/MPI 1 us

SMP message-passing performance on a dual-processor Xeon running Linux

With the data starting in cache.

0

1000

2000

3000

4000

5000

6000

7000


Th

rou

gh

pu

t in

Mb

ps

MPICH25 us

MP_Lite2 us

PVM33 us

LAM/MPI 1 us

0

100

200

300

400

500

600

700


Th

rou

gh

pu

t in

Mb

ps

ARMCI

MP_Literaw TCP

LAM/MPI

One-sided Puts between two Linux PCs

MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence.

LAM/MPI has no message progress, so a fence is required.

ARMCI uses a polling method, and therefore does not require a fence.

An MPI-2 implementation of MPICH is under development.

An MPI-2 implementation of MPI/Pro is under development.

Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

ConclusionsMost message-passing libraries do reasonably well if properly tuned.

All need to have the socket buffer sizes and thresholds user-tunable.

Optimizing the network drivers would also correct some of the problems.

There is still much room for improvement for SMP and 1-sided communications.

Future Work

All network cards should be tested on a 64-bit 66 MHz PCI bus to put more strain on the message-passing libraries.

Testing within real applications is vital to verify NetPIPE results, test scalability of the implementation methods, investigate loading of the CPU, and study the effects of the various approaches to maintaining message progress.

Score should be compared to GM.

VIA and Infinaband modules are needed for NetPIPE.

Protocol-Dependent Message-Passing Performance on Linux Clusters

Dave Turner – Xuehua Chen – Adam Oline

[email protected]

http://www.scl.ameslab.gov/

Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen –...

Documents

Transcript of Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen –...