Architectural Interactions in High Performance Clusters
description
Transcript of Architectural Interactions in High Performance Clusters
RTPP98 - 1
Architectural Interactions in High Performance Clusters
RTPP 98
David E. Culler
Computer Science Division
University of California, Berkeley
RTPP98 - 2
Run-Time Framework
Network
° ° °
MachineArchitecture
RunTime
ParallelProgram
MachineArchitecture
RunTime
ParallelProgram
MachineArchitecture
RunTime
ParallelProgram
RTPP98 - 3
Two Example RunTime Layers
• Split-C– thin global address space abstraction over Active
Messages
– get, put, read, write
• MPI– thicker message passing abstraction over Active
Messages
– send, receive
RTPP98 - 4
Split-C over Active Messages
• Read, Write, Get, Put built on small Active Message request / reply (RPC)
• Bulk-Transfer (store & get)
Request
handler
handler
Reply
RTPP98 - 5
Model Framework: LogP
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume ( L/g to a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) message between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/rate)
• Processors
Round Trip time: 2 x ( 2o + L)
RTPP98 - 6
LogP Summary of Current Machines
Max MB/s: 38 141 47
0
2
4
6
8
10
12
14
16µs
gLOrOs
RTPP98 - 7
Methodology
• Apparatus: – 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)
– M2F Lanai + Myricom Net in Fat Tree variant
– GAM + Split-C
• Modify the Active Message layer to inflate L, o, g, or G independently
• Execute a diverse suite of applications and observe effect
• Evaluate against natural performance models
RTPP98 - 8
Adjusting L, o, and g (and G) in situ
Lanai
Host Workstation
O: stall Ultraon msg write
AM lib
Lanai
Host Workstation
AM lib
g: delay Lanaiafter msg injection
(after fragment forbulk transfers)
L: defer markingmsg as valid untilRx + L
O: stall Ultraon msg read
Myrinet
RTPP98 - 9
0
50
100
0 50 100
L (desired)
µs
Calibration
0
50
100
150
200
0 50 100
O (desired)
µs
o
g
L
desired
0
50
100
0 50 100
g (desired)
RTPP98 - 10
Applications Characteristics
• Message Frequency
• Write-based vs. Read-based
• Short vs. Bulk Messages
• Synchronization
• Communication Balance
RTPP98 - 11
Applications used in the Study
Program Description Input Time(16) Time(32) Msg Interval(µs)radix Int Radix Sort 16 M 32-bit keys 17.0 s 9.8 s 7.6em3d Int Sample Sort 32 M 32-bit keys 101 s 44.3 s 10.2sample EM Wave Prop. 8 K nodes, deg. 10, 100 steps 24.3 s 15.9 s 14.0ebarnes Hierarchical N-body 1 M bodies 81.2 s 46.6 s 60.1p-ray Ray Tracer 1 M pixel image, 16 k objs 25.8 s 17.8 s 108.1murphi Protocol Verifier SCI, 2 procs, 1 line, 1 mem 71.3 s 37.9 s 219.4connect Connected Components 4 M nodes, 2D, 30 % 2.5 s 1.45 s 282.9radb Bulk Radix 16 M 32-bit keys 6.5 s 3.95 s 1260.0
RTPP98 - 12
Baseline Communication
Program µs/msg ms/Barrier Avg Msg/Proc Max Msg./Proc Reads Bulk Msg
radix 7.6 895 2,228,364 2,229,106 0.0% 0.0%
em3d 10.2 324 9,953,384 9,974,265 0.0% 0.0%
sample 14.0 2499 1,966,199 2,319,362 0.0% 0.0%
ebarnes 60.1 233 1,351,194 1,400,601 9.6% 23.5%
p-ray 108.1 1465 216,869 353,640 47.7% 47.9%
murphi 219.4 36054 328,699 332,054 0.0% 51.2%
connect 282.9 101 8,912 9,221 34.1% 0.1%
radb 1260.0 80 5,520 5,927 0.0% 43.7%
RTPP98 - 13
Application Sensitivity to Communication Performance
RTPP98 - 14
Sensitivity to Overhead
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100 110
Overhead
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
RTPP98 - 15
Sensitivity to gap (1/msg rate)
0
2
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50 60 70 80 90 100 110
gap
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RTPP98 - 16
Sensitivity to Latency
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80 90 100 110
Latency
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RTPP98 - 17
Sensitivity to bulk BW (1/G)
0
0.5
1
1.5
2
2.5
0 10 20 30 40
MB/s
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
RTPP98 - 18
Modeling Effects of Overhead
• Tpred = Torig + 2 x max #msgs x o– request / response
– proc with most msgs limits overall time
• Why does this model under-predict?
Overhead (µs) radix sample em3d ebarnes p-ray connect murphi radb0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001 1.03 1.00 1.14 1.02 1.03 1.02 1.00 1.012 1.02 1.00 1.26 1.00 1.04 1.04 1.02 1.004 1.02 1.00 1.50 0.90 1.07 1.07 1.04 1.005 1.01 0.99 1.59 0.88 1.06 1.12 1.03 1.00
10 1.01 0.99 2.01 4.83 1.21 1.30 1.09 0.9920 1.02 0.98 2.53 N/A 1.31 1.32 1.16 0.9950 1.03 0.98 3.22 N/A 1.50 1.60 1.30 1.02
100 1.04 0.98 3.61 N/A 1.58 1.85 1.46 1.07
RTPP98 - 19
Modeling Effects of gap
• Uniform communication model
Tpred = Torig , if g < I, average msg interval
= Torig + m (g - I ), otherwise
• Bursty Communication
Tpred = Torig + m g
g
RTPP98 - 20
Extrapolating to Low Overhead
0
1
2
3
4
5
0 5 10 15
Overhead
Slo
wdow
n
Barnes
Radix
EM3D(write)
EM3D(read)
Sample
P-Ray
Murphi
Connect
NOWsort
RadB
RTPP98 - 21
MPI over AM: ping-pong bandwidth
0
10
20
30
40
50
60
70
10 100 1000 10000 100000 1000000
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
SGI Challenge
Meiko CS2
NOW
IBM SP2
Cray T3D
RTPP98 - 22
MPI over AM: start-up
0
10
20
30
40
50
60
70
80
90
SG
IC
ha
llen
ge
Me
iko
NO
W
IBM
SP2
Cra
y T
3D
mic
rose
con
ds
RTPP98 - 23
NPB2 Speedup: NOW vs SP2
RTPP98 - 24
NOW vs. Origin
RTPP98 - 25
Single Processor Performance
Origin Ultra 170 SP2BT 2488 4178 2574SP 1652 2897 1817LU 1373 2470 1871MG 53 90 53IS 37 41 29FT 133 131 139SPECfp95 19 9.4 9.7SPECint95 9.5 5.6 3.2Triad MB/s 317 254 655
RTPP98 - 26
Understanding Speedup
SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)
Tcompute = (work/p + extra) x efficiency
RTPP98 - 27
Performance Tools for Clusters
• Independent data collection on every node– Timing
– Sampling
– Tracing
• Little perturbation of global effects
RTPP98 - 28
Where the Time Goes: LU-a
0
500
1000
1500
2000
2500
3000
4 8 16 32
Processors
To
tal T
ime Wait
Receive
Send
Compute
RTPP98 - 29
Where the Time Goes: BT-a
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4 9 16 25 36
Processors
To
tal
Tim
e Wait
Receive
Send
Compute
RTPP98 - 30
Constant Problem Size Scaling
4
8163264
128256
RTPP98 - 31
Communication Scaling
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
0 10 20 30 40
FT
IS
LU
MG
SP
BT
0
1
2
3
4
5
6
7
8
0 10 20 30 40
FT
IS
LU
MG
SP
BT
Normalized Msgs per Proc Average Message Size
RTPP98 - 32
Communication Scaling: Volume
Bytes per Processor
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40
FT
IS
LU
MG
SP
BT
Total Bytes
0.00E+00
1.00E+09
2.00E+09
3.00E+09
4.00E+09
5.00E+09
6.00E+09
7.00E+09
8.00E+09
9.00E+09
0 10 20 30 40
FT
IS
LU
MG
SP
BT
RTPP98 - 33
Extra Work
RTPP98 - 34
Cache Working Sets: LU
8-fold reductionin miss rate from4 to 8 proc
RTPP98 - 35
Cache Working Sets: BT
RTPP98 - 36
Cycles per Instruction
RTPP98 - 37
MPI Internal Protocol
Sender Receiver
RTPP98 - 38
Revised Protocol
Sender Receiver
RTPP98 - 39
Sensitivity to Overhead
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
0 100 200 300 400
Added Overhead
Slo
wd
ow
n P=32
P=16
P=8
P=4
P=2
RTPP98 - 40
Conclusions
• Run Time systems for Parallel Programs must deal with a host of architectural interactions
– communication
– computation
– memory system
• Build a performance model of you RTPP– only way to recognize anomalies
• Build tools along with the RT to reflect characteristics and sensitivity back to PP
• Much can lurk beneath a perfect speedup curve