The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie...

The MPC Parallel Computer

Hardware, Low-level Protocols and

Performances

University P. & M. Curie (PARIS)

LIP6 laboratory

Olivier Glück

Introduction

Very low cost and high performance parallel computer

PC cluster using optimized interconnection network

A PCI network board (FastHSL) developed at LIP6 :

High speed communication network (HSL,1 Gbit/s)

RCUBE : router (8x8 crossbar, 8 HSL ports)

PCIDDC : PCI network controller (a specific

communication protocol)

Goal : supply efficient soft layers

Hardware architecture

R3PCI

DDC

R3PCI

DDC

Standard PC running

LINUX or FreeBSD

FastHSL boards

Standard PC running

LINUX or FreeBSD

The MPC machine

The FastHSL board

Hardware layers

HSL link (1 Gbit/s)

coaxial cable, point to point, full duplex

data encoded on 12 bits

low-level flow control

RCUBE

Rapid Reconfigurable Router, extensibility

Latency : 150 ns

wormhole strategy, interval routing schemes

PCIDDC

the network interface controller

implements communication protocol : Remote DMA

zero copy

Low-level communication protocol

Zero-copy protocol (Direct

deposit protocol)

FastHSL board accesses directly

to host memory

Process

Memory

Process

Memory

Sender Receiver

Process

Memory

Kernel

Memory

I/O

Memory

I/O

Memory

Kernel

Memory

Process

Memory

PUT : the lowest level software API

Unix based layer : FreeBSD or Linux

Zero-copy strategy

Provides a basic kernel API using the PCIDDC remote-write

Parameters of a PUT() call : remote node local physical address remote physical address size of data message identifier callback functions for signaling

PUT performances

PC Pentium II 350MHz

Throughput : 494 Mbit/s

Half-throughput : 66 bytes

Latency : 4 µs (without system call)

0

100

200

300

400

500

600

1 10 100 1000 10000 100000

Size (bytes)

Th

rou

gh

pu

t (M

bit

/s)

MPI over MPC

HSL Network

Free BSD or LINUX Driver

MPI

PUT

Implementation of MPICH

over PUT API

MPI implementation (1)

2 main problems :

Where to write data in remote physical memory ?

PUT only transfers contiguous blocks in physical

memory

2 kinds of messages :

control or short messages

data messages

MPI implementation (2)

Short (or control) messages :

Control information or limited-size user data

Use allocated buffers at starting time, contiguous

in physical memory

One memory copy in emission and reception

MPI implementation (3) Data messages :

transfer data larger than the maximum size of a control message or for specific MPI functions (e.g. MPI_Ssend)

RDV protocol

manage zero-copy transfer

Rendez-vous protocol

Sender Receiver

ctl

ack

ack

data

data

MPI performances (1)

Latency : 26 µs Throughput : 490 Mbit/s

Throughput : MPI-MPC P350

0

100

200

300

400

500

600

Size (byte)

Th

rou

gh

pu

t (M

bit

/s)

MPI-MPC / P350 / FreeBSD

MPI performances (2)Throughput (Log2) : Cray-T3E & MPC

-4

-2

0

2

4

6

8

10

12

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

Size (bytes)

Th

rou

gh

pu

t (b

as

e 2

)

MPI-T3E / Proc 300

MPI-MPC / P350 / FreeBSD

Cray Latency : 57 µs Throughput : 1200 Mbit/s

MPC Latency : 26 µs Throughput : 490 Mbit/s

MPI performances (3)

Throughput : MPI-BIP & MPI-MPC

0

50

100

150

200

250

300

350

400

450

1 4 16 64 256 1024 4096 16384 65536

Size (bytes)

Th

rou

gh

pu

t (M

b/s

) MPI-BIP / P200 / Linux

MPI-MPC / P166 / Linux

Conclusion

MPC : a very low cost PC clusters

Performances : similar to Myrinet clusters

Very good extensibility (no centralized router)

Perspectives :

a new router

an another network controller

improvements in MPI over MPC

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie...

Documents

Transcript of The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie...