Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason...
-
Upload
stuart-allison -
Category
Documents
-
view
212 -
download
0
description
Transcript of Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason...
Parallel Computing on Wide-Area Clusters: the Albatross Project
Aske PlaatThilo KielmannJason Maassen
Rob van NieuwpoortRonald Veldema
Vrije Universiteit Amsterdam
Faculty of Sciences
vrije Universiteit
Henri Bal
2
Introduction
• Cluster computing becomes popular- Excellent price/performance ratio- Fast commodity networks
• Next step: wide-area cluster computing- Use multiple clusters for single application- Form of metacomputing
• Challenges- Software infrastructure (e.g., Legion, Globus)- Parallel applications that can tolerate WAN-
latencies
3
Albatross project
• Study applications and programming environments for wide-area parallel systems
• Basic assumption: wide-area system is hierarchical- Connect clusters, not individual workstations
• General approach- Optimize applications to exploit hierarchical
structure most communication is local
4
Outline
• Experimental system and programming environments
• Application-level optimizations
• Performance analysis
• Wide-area optimized programming environments
5
Distributed ASCI Supercomputer (DAS)VU (128) UvA (24)
Leiden (24) Delft (24)
6 Mb/sATM
Node configuration
200 MHz Pentium Pro64-128 MB memory2.5 GB local disksMyrinet LANFast Ethernet LANRedhat Linux 2.0.36
6
Programming environments
• Existing library/language + expose hierarchical structure- Number of clusters- Mapping of CPUs to clusters
• Panda library- Point-to-point
communication- Group communication- Multithreading
Panda
Java Orca MPI
LFC TCP/IP
ATMMyrinet
7
Example: Java
• Remote Method Invocation (RMI)- Simple, transparent, object-oriented, RPC-like
communication primitive• Problem: RMI performance
- JDK RMI on Myrinet is factor 40 slower than C-RPC(1228 vs. 30 µsec)
• Manta: high-performance Java system [PPoPP’99]- Native (static) compilation: source executable- Fast RMI protocol between Manta nodes- JDK-style protocol to interoperate with JVMs
8
JDK versus MantaJDK time
µsManta time
µs
Serialization Runtime 670 Compiler 11
RMI protocol Heavy-weight 950 Light-weight 10
Communication TCP/IP 280 RPC/LFC 30
200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter
9
Null-latency(µsec)
Bandwidth(MByte/sec)
Myrinet LAN 39.9 38.6
ATM WAN 5600 0.55
• 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance
• Application-level optimizations [JavaGrande’99]- Minimize WAN-overhead
Manta on wide-area DAS
10
Example: SOR
• Red/black Successive Overrelaxation- Neighbor communication, using RMI
• Problem: nodes at cluster-boundaries- Overlap wide-area communication with computation- RMI is synchronous use multithreading
Cluster 1 Cluster 2
CPU 3CPU 2CPU 1 CPU 6CPU 5CPU 4
50 5600 µsec
µs
11
Wide-area optimizations
Application Communicationstructure
Wide-areaoptimization
SOR Nearest-neighbor Latency hiding
ASP Broadcast Spanning-treebroadcast
TSP Central jobqueue Static distributionover clusters
IDA* Work stealing Steal from localcluster first
12
Performance Java applications
• Wide-area DAS system: 4 clusters of 10 CPUs• Sensitivity to wide-area latency and bandwidth:
- See HPCA’99
05
1015202530354045
SOR ASP TSP IDA*
Spee
dup 1 x 10 CPUs
4 x 10 CPUs1 x 40 CPUs
13
• Optimized applications obtain good speedups- Reduce wide-area communication, or hide its latency
• Java RMI is easy to use, but some optimizations are awkward to express- Lack of asynchronous communication and broadcast
• RMI model does not help exploiting hierarchical structure of wide-area systems
• Need wide-area optimized programming environment
Discussion
14
MagPIe: wide-area collective communication
• Collective communication among many processors- e.g., multicast, all-to-all, scatter, gather,
reduction
• MagPIe: MPI’s collective operations optimized for hierarchical wide-area systems [PPoPP’99]
• Transparent to application programmer
15
Spanning-tree broadcast
Cluster 1 Cluster 2 Cluster 3 Cluster 4
• MPICH (WAN-unaware)- Wide-area latency is chained- Data is sent multiple times over same WAN-link
• MapPIe (WAN-optimized)- Each sender-receiver path contains at most 1 WAN-link- No data item travels multiple times to same cluster
16
MagPIe results
• MagPIe collective operations are wide-area optimal, except non-associative reduction
• Operations up to 10 times faster than MPICH
• Factor 2-3 speedup improvement over MPICH for some (unmodified) MPI applications
17
Conclusions
• Wide-area parallel programming is feasible for many applications
• Exploit hierarchical structure of wide-area systems to minimize WAN overhead
• Programming systems should take hierarchical structure of wide-area systems into account