An Evaluation of Current High-Performance Networks
-
Upload
cassandra-gilbert -
Category
Documents
-
view
26 -
download
3
description
Transcript of An Evaluation of Current High-Performance Networks
Unified Parallel C at LBNL/UCB
An Evaluation of Current High-Performance Networks
Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael Welcome, Kathy Yelick
Lawrence Berkeley National Lab
&U.C. Berkeley
http://upc.lbl.gov
Unified Parallel C at LBNL/UCB
Motivation
• Benchmark a variety of current high-speed Networks- Measure Latency and Software Overhead, not just
Bandwidth- One-sided communication provides advantages vs.
2-sided MPI?• Global Address Space (GAS) Languages
- UPC, Titanium (Java), Co-Array Fortran- Small message performance (8 bytes)
- Support sparse/irregular/adaptive programs- Programming model: incremental optimization- Overlapping messages can hide the latency
Unified Parallel C at LBNL/UCB
Systems Evaluated
System NetworkBus
(per sec)1-sided
hardwareAPIs
Cray T3E CustomCustom
(330 MB) • SHMEM,
E-registers
IBM SP SP switch 2GXX bus
(2 GB)LAPI
HP AlphaServer QuadricsPCI 64/66
(532 MB) • SHMEM
IBM Netfinity MyrinetPCI 32/66
(266 MB) • GM
PC cluster GigEPCI 64/66
(532 MB)VIPL
Unified Parallel C at LBNL/UCB
Modified LogGP Model
• LogGP: no overlap • Observed: overheads can overlap: L can be negative
P0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL: end to end latency (instead of transport latency L)g: minimum time between small message sendsG: additional gap per byte for larger messages
Unified Parallel C at LBNL/UCB
Microbenchmarks
P0osend
gap
P0osend
gap cpu
P0
osendgap
cpu
1) Ping-pong test: measures EEL (end-to-end latency)
2) Flood test: measures gap (g/G)
3) CPU overlap test: measures software overheads
Flood Test CPU Test 1 CPU Test 2
Unified Parallel C at LBNL/UCB
Latencies for 8 byte ‘puts’
0
5
10
15
20
25
T3E/S
hm
T3E/E
-Reg
T3E/M
PI
IBM
/LAPI
IBM
/MPI
Quadr
ics/S
hm
Quadr
ics/M
PI
Myr
inet/G
M
Myr
inet/M
PI
GigE/V
IPL
GigE/M
PI
use
c
End-to-end latency (1-way)
Unified Parallel C at LBNL/UCB
8-byte ‘put’ Latencies with Software Overheads
0
5
10
15
20
25
usec
Other
Send overhead only (S)
Overhead overlap (V)
Receive overhead only (R)
Unified Parallel C at LBNL/UCB
Gap varies with msg clustering
0
5
10
15
20
25
30
gap
bet
wee
n m
sgs
(use
c)
q=1q=2q=4
q=8q=16
Clustering messages can both use idle cycles, and reduce the number of idle cycles that need to be filled
Unified Parallel C at LBNL/UCB
Potential for CPU overlap during clustered message sends
Hardware support for 1-way communication provides more opportunity for computational overlap
0
2
4
6
8
10
12
usec
Gap (g)Send OverheadReceive Overhead
Unified Parallel C at LBNL/UCB
Fixed message cost (g), vs. per-byte cost (G)
0
2
4
6
8
10
12
use
c
Per Message Cost (g)Per KByte Cost (G*1024)
Unified Parallel C at LBNL/UCB
“Large” Messages
Factor of 6 between minimum sizes needed for “large” message (large = bandwidth dominates fixed message cost)
0
500
1000
1500
2000
2500
3000
3500
4000
Byt
es
Cross-over between g and G
Unified Parallel C at LBNL/UCB
Small message performance over time
Software send overhead for 8-byte messages over time.
Not improving much over time (even in absolute terms)
Unified Parallel C at LBNL/UCB
Conclusion
• Latency and software overhead of messages varies widely among today’s HPC networks- Affects ability to effectively mask communication latency, with
large effect on GAS language viability- especially software overhead--latency can be hidden
• These parameters have historically been overlooked in benchmarks and vendor evaluations- Hopefully this will change- Recent discussions with vendors promising- Incorporation into standard benchmarks would be nice…
http://upc.lbl.gov