Post on 23-Jan-2022
Supercomputer Design Through SimulationCray User Group (CUG) Meeting
Lugano, Switzerland
Rolf Riesenrolf@cs.sandia.gov
Sandia National Laboratories
May 9, 2006
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Talk Overview
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Goal
Simulate a supercomputer; e.g., Red Storm, usingfederated discrete event simulatorsWith enough fidelity to make future purchase anddesign decisions concerning things like:
CPU choiceMemory size and speedNetwork interfaceTopologyApplication behaviorResearch directionsetc.
Created initial prototype with promising attributes
This talk describes simulator
Collective results on Thursday
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 DesignNode (Application)MPI Wrapper LibraryMPI CommunicatorsVirtual TimeLinking
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Node (Application)
Hybrid simulator:
App runs regularly and uses MPI to exchange data
Each MPI send and receive generates an event to thenetwork simulator
Sim generates rcv events that are matched by clients
Algorithm determines when and how to update virtualtime on each node
Use MPI wrappers and profiling interface
Current network simulator uses simple model:
∆ = sB + L
∆ network delay B network bandwidths message size L network latency
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
MPI Wrapper Library
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
MPI Wrapper Library
int MPI_Send(void ∗data,int len,MPI_Datatype dt,int dest,int tag,MPI_Comm comm)
{tx= get_vtime();
// Send the MPI messagerc= PMPI_Send(data, len, dt, dest, tag, comm);
// Send event to simulatorevent_send(tx , len, dt, dest, tag);
return rc;}
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
MPI Wrapper Library
int MPI_Recv(void ∗data, int len, MPI_Datatype dt, int src,int tag, MPI_Comm comm, MPI_Status ∗stat)
{t1= get_vtime();
// Receive the MPI messagerc= PMPI_Recv(data, len, dt, src, tag, comm, stat);
// Wait for the matching eventevent_wait(&tx , &∆, stat−>MPI_TAG, stat−>MPI_SOURCE);
if (tx + ∆ > t1)t3= tx + ∆;
elset3= t1;
set_vtime(t3); // Adjust virtual timereturn rc;
}
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
MPI Wrapper Library
Event traffic for collectives
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
MPI Communicators
Simulator framework sets up communicator forapplication nodes only
MPI_COMM_WORLDcovers application and simulator
Wrappers swap MPI_COMM_WORLDwith internalcommunicator when application calls MPI
Application never sees real MPI_COMM_WORLD
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Virtual Time
if (tx + ∆ > t1)t3= tx + ∆;
elset3= t1;
set_vtime(t3);
If message was sent earlier than we started looking for it, wehave to assume it was already here
Just “erase” the time we spent actually receiving it
If message arrived after we started waiting for it, use thevirtual send time + ∆ to set local virtual clock
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Virtual Time
TalkOverview
Goal
DesignNode (Application)
MPI Wrappers
Communicators
Virtual Time
Linking
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Linking
Currently need to rename main() to main_node()
Should not be necessary when we use MPI_Init()
In Fortran programs program has to be changed tosubroutine main_node
No Changes to application are necessary!
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Usage
Two steps:Create point-to-point modelCreate collective model
Measure two-node latency curve and write function tomodel it
Measure all-to-all performance and write model
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Usage
0
200
400
600
800
1000
1200
32 kB 64 kB 96 kB 128 kB 160 kB 192 kB
Ban
dwid
th M
B/s
Message Size
Bandwidth on Red Storm Apr. 14
modelRun at 13:54:33Run at 14:03:01Run at 14:16:29
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Usage
0 s
0.01 ms
0.02 ms
0.03 ms
0.04 ms
0.05 ms
0.06 ms
0.07 ms
0.08 ms
0.09 ms
16 k 32 k 48 k 64 k 80 k 96 k 112 k 128 k
Tim
e
Number of ints exchanged
4 nodes16 nodes64 nodes
model
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Validation
0
5
10
15
20
25
30
35
40
45
BT 16BT 64
CG 16
CG 64
EP 16EP 64
FT 16FT 64
IS 16
IS 64
LU 16
LU 64
MG 16
MG 64
SP 16SP 64
Run
tim
e in
sec
onds
NAS Class A Run Times
RealSimulation
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 ExperimentsCommunication PatternsVarying Bandwidth and LatencyZero-Cost CollectivesIntrusion-Free MPI Traces
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Communication Patterns
MG (class B) message density distribution
0 100 200 300 400 500 600
# m
essa
ges
Destination Node
Sou
rce
Nod
e
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 0 4 8
12 16 20 24 28 32 36 40 44 48 52 56 60
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Communication Patterns
BT (class A) data density distribution
0 1 2 3 4 5 6 7 8 9 10
meg
a by
tes
Destination Node
Sou
rce
Nod
e
0 2 4 6 8 10 12 14 0
2
4
6
8
10
12
14
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Varying Bandwidth and Latency
Simulator can change bandwidth and latencyindependently
This can be used to evaluate application performanceunder varying network characteristics
→ predict impact of new network
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Varying Bandwidth and Latency
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Zero-Cost Collectives
Putting collectives into NIC, building specialized NIC, oroptimizing them is interesting
How much application performance can be gained isnot clear
Simulator can assign ∆ = 0 to collectives and leavepoint-to-point alone
See talk on Thursday
TalkOverview
Goal
Design
Usage
Validation
ExperimentsPatterns
Varying
Collectives
MPI Traces
Related Work
Future Work
Summary
End
Intrusion-Free MPI Traces
So far gathered only limited amounts of dataSimulator can gather, and save to disk, large amount ofdata
Without changing application virtual time
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Comparison to Other Work
This approach seems to be new
Combines low-intrusion measurement research withdiscrete event simulation
Needs more validation, but seems to be very accurate
Opens up many different and simple ways of evaluatingapplications and research directions
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Comparison to Other Work
No instrumentation code inserted into appRename main() (program ) only change to app
No disturbance of (virtual) runtime of appIndependent of amount of data collected.
No extra memory needed on compute nodes to storetrace data
Language independent (Fortran, Fortran 90 with MPI-2,and C)
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Future Work
Continuing Work
Need to incorporate more accurate network model
This will allow simulation of congestion, and evaluationof topology choices, node allocation, etc.
Move below MPI into NIC for more fine-grainedsimulation
Incorporate non-network simulators; CPU and NIC sims
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Section Outline
1 Goal
2 Design
3 Usage
4 Validation
5 Experiments
6 Comparison to Other Work
7 Future Work
8 Summary
TalkOverview
Goal
Design
Usage
Validation
Experiments
Related Work
Future Work
Summary
End
Summary
Novel tool to collect MPI data
Language independent
Only linking with application needed
Virtual runtime of application is not changed
Lots of future possibilities