Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason...

Parallel Computing on Wide-Area Clusters: the Albatross Project

Aske PlaatThilo KielmannJason Maassen

Rob van NieuwpoortRonald Veldema

Vrije Universiteit Amsterdam

Faculty of Sciences

vrije Universiteit

Henri Bal

2

Introduction

• Cluster computing becomes popular- Excellent price/performance ratio- Fast commodity networks

• Next step: wide-area cluster computing- Use multiple clusters for single application- Form of metacomputing

• Challenges- Software infrastructure (e.g., Legion, Globus)- Parallel applications that can tolerate WAN-

latencies

3

Albatross project

• Study applications and programming environments for wide-area parallel systems

• Basic assumption: wide-area system is hierarchical- Connect clusters, not individual workstations

• General approach- Optimize applications to exploit hierarchical

structure most communication is local

4

Outline

• Experimental system and programming environments

• Application-level optimizations

• Performance analysis

• Wide-area optimized programming environments

5

Distributed ASCI Supercomputer (DAS)VU (128) UvA (24)

Leiden (24) Delft (24)

6 Mb/sATM

Node configuration

200 MHz Pentium Pro64-128 MB memory2.5 GB local disksMyrinet LANFast Ethernet LANRedhat Linux 2.0.36

6

Programming environments

• Existing library/language + expose hierarchical structure- Number of clusters- Mapping of CPUs to clusters

• Panda library- Point-to-point

communication- Group communication- Multithreading

Panda

Java Orca MPI

LFC TCP/IP

ATMMyrinet

7

Example: Java

• Remote Method Invocation (RMI)- Simple, transparent, object-oriented, RPC-like

communication primitive• Problem: RMI performance

- JDK RMI on Myrinet is factor 40 slower than C-RPC(1228 vs. 30 µsec)

• Manta: high-performance Java system [PPoPP’99]- Native (static) compilation: source executable- Fast RMI protocol between Manta nodes- JDK-style protocol to interoperate with JVMs

8

JDK versus MantaJDK time

µsManta time

µs

Serialization Runtime 670 Compiler 11

RMI protocol Heavy-weight 950 Light-weight 10

Communication TCP/IP 280 RPC/LFC 30

200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter

9

Null-latency(µsec)

Bandwidth(MByte/sec)

Myrinet LAN 39.9 38.6

ATM WAN 5600 0.55

• 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance

• Application-level optimizations [JavaGrande’99]- Minimize WAN-overhead

Manta on wide-area DAS

10

Example: SOR

• Red/black Successive Overrelaxation- Neighbor communication, using RMI

• Problem: nodes at cluster-boundaries- Overlap wide-area communication with computation- RMI is synchronous use multithreading

Cluster 1 Cluster 2

CPU 3CPU 2CPU 1 CPU 6CPU 5CPU 4

50 5600 µsec

µs

11

Wide-area optimizations

Application Communicationstructure

Wide-areaoptimization

SOR Nearest-neighbor Latency hiding

ASP Broadcast Spanning-treebroadcast

TSP Central jobqueue Static distributionover clusters

IDA* Work stealing Steal from localcluster first

12

Performance Java applications

• Wide-area DAS system: 4 clusters of 10 CPUs• Sensitivity to wide-area latency and bandwidth:

- See HPCA’99

05

1015202530354045

SOR ASP TSP IDA*

Spee

dup 1 x 10 CPUs

4 x 10 CPUs1 x 40 CPUs

13

• Optimized applications obtain good speedups- Reduce wide-area communication, or hide its latency

• Java RMI is easy to use, but some optimizations are awkward to express- Lack of asynchronous communication and broadcast

• RMI model does not help exploiting hierarchical structure of wide-area systems

• Need wide-area optimized programming environment

Discussion

14

MagPIe: wide-area collective communication

• Collective communication among many processors- e.g., multicast, all-to-all, scatter, gather,

reduction

• MagPIe: MPI’s collective operations optimized for hierarchical wide-area systems [PPoPP’99]

• Transparent to application programmer

15

Spanning-tree broadcast

Cluster 1 Cluster 2 Cluster 3 Cluster 4

• MPICH (WAN-unaware)- Wide-area latency is chained- Data is sent multiple times over same WAN-link

• MapPIe (WAN-optimized)- Each sender-receiver path contains at most 1 WAN-link- No data item travels multiple times to same cluster

16

MagPIe results

• MagPIe collective operations are wide-area optimal, except non-associative reduction

• Operations up to 10 times faster than MPICH

• Factor 2-3 speedup improvement over MPICH for some (unmodified) MPI applications

17

Conclusions

• Wide-area parallel programming is feasible for many applications

• Exploit hierarchical structure of wide-area systems to minimize WAN overhead

• Programming systems should take hierarchical structure of wide-area systems into account

Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason...

Documents

Transcript of Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason...