1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++...
Transcript of 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++...
1Charm++ Workshop 2010
The BigSim Parallel Simulation System
Gengbin Zheng, Ryan Mokos
Charm++ Workshop 2010Parallel Programming Laboratory
University of Illinois at Urbana-Champaign
14/28/2010
Charm++ Workshop 2010
Outline
OverviewBigSim EmulatorBigSim Simulator
24/28/2010
Summarizing the State of Art
Petascale Very powerful parallel machines exist (Jaguar, Roadrunner, etc)Application domains exist that need that kind of power
New generation of applicationsUse sophisticated algorithmsDynamic adaptive refinementsMulti-scale, multi-physicsParallel applications are more complex than sequential ones, hard to predict without actually running it
Challenge: Is it possible to simulate these applications on large scale using small clusters?
3Charm++ Workshop 2010
4/28/2010
BigSim
Why BigSim, and why on Charm++?Targets large scale simulationObject-based processor virtualization
For a virtualized execution environment
Efficient message passing runtime by Charm++
Support fine-grained decomposition
Portability
4Charm++ Workshop 2010
4/28/2010
5
BigSim Infrastructure
EmulatorA virtualized execution environment
Charm++ and MPI applicationsNo or small changes to MPI application source codes. facilitate code development and debugging
SimulatorTrace-driven approach
Parallel Discrete Event SimulationSimple latency, full network contention modeling
Predict parallel performance at varying levels of resolution
Charm++ Workshop 2010 54/28/2010
6Charm++ Workshop 2010
Charm++/MPI applications
Simulation trace logs
BigSim Simulator
Performance visualization (Projections)
BigSim Emulator
AMPI Runtime
Architecture of BigSim
6
Charm++ Runtime
4/28/2010
POSE
7
MPI Alltoall Timeline
Charm++ Workshop 20104/28/2010
8
BigSim Emulator
Emulate full machine on existing machines Actually run a parallel program
E.g. NAMD on 256K target processors using 8K cores of Ranger cluster
Implemented on Charm++Libraries that link to user application
Simple architecture abstractionMany multiprocessor (SMP) nodes connected via message passingDo not emulate at instruction level
Charm++ Workshop 2010 84/28/2010
Processor-level queues
Communication processors
Worker processors
Node-level queue
Converse scheduler
Converse Queue
Processor-level queues
Communication processors
Incoming queue
Worker processors
Node-level queue
Physical Processor
Target Node
9
Incoming queue
Target Node
BigSim Emulator: functional view
9Charm++ Workshop 20104/28/2010
Processor Virtualization
User View System View
Programmer: Decomposes the computation into
objects
Runtime: Maps the computation on to the processors
10Charm++ Workshop 20104/28/2010
Major Challenges
Running multiple copies of code on each processorShared global variables
Charm++ applications already handle thisAMPI
Global/static variablesRuntime techniques, compiler tools
E.g. NAMD on 1024 target processors using 8 cores
Simulation timeMemory footprint
Global read-only variables can be sharedOut-of-core execution
Charm++ Workshop 2010 114/28/2010
NAMD Emulation
Charm++ Workshop 2010 12
Only 19 times of slowdown Only 7 times of increase in mem
4/28/2010
13Charm++ Workshop 2010
Out-of-core Emulation
MotivationApplications with large memory footprint
VM system can not handle well
Use hard driveSimilar to checkpointing
Message driven executionPeek msg queue => what execute next? (prefetch)
134/28/2010
14Charm++ Workshop 2010
What is in the Trace Logs?
Traces for2 target processors
Each SEB has:
• startTime, endTime• Incoming Message ID• Outgoing messages• Dependences
14
Tools for reading bgTrace binary files:
1.charm/example/bigsim/tools/loadlogConvert to human-readable format
2.charm/example/bigsim/tools/log2projConvert to trace projections log files
4/28/2010
BigSim Simulator: BigNetSimPost-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++Parallel Discrete Event SimulationPass emulator traces through different network models in BigNetSim to get final performance resultsDetails of using BigNetSim:
http://charm.cs.uiuc.edu/workshops/charmWorkshop2009/slides/tut_BigSim09.ppthttp://charm.cs.uiuc.edu/manuals/html/bignetsim/manual.html
4/28/2010 Charm++ Workshop 2010 15
POSE
Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objectsNetwork data constructs (message, packet, etc.) implemented as event methods on simulation objects
4/28/2010 Charm++ Workshop 2010 16
Posers
4/28/2010 Charm++ Workshop 2010 17
Each poser is a tiny simulation
Performance Prediction
Two components:Time to execute blocks of sequential, computational code
SEBs = Sequential Execution Blocks
Communication time based on a particular network topology
4/28/2010 Charm++ Workshop 2010 18
Sequential Time Prediction (Emulator)Manual
Advance processor time using BgElapse() calls in application code
Wallclock timeUse multiplier (scale factor) to account for architecture differences
Performance countersCount instructions with hardware countersUse expected time of each instruction on target machine to derive execution time
Instruction-level simulation (e.g., Mambo)Record cycle-accurate execution times for functionsUse interpolation tool to replace SEB times
4/28/2010 Charm++ Workshop 2010 19
Sequential Time Prediction (continued)
Model-based (recent work)Performed after emulationDetermine application functions responsible for most of the computation timeRun these functions on target machine
Obtain run times based on function parameters to create model
Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times
Generates corrected set of traces
4/28/2010 Charm++ Workshop 2010 20
Communication Time Prediction (Simulator)
Valid for a particular network topologyGeneric: Simple Latency model
Formula predicts time using latency and bandwidth parameters
SpecificBlueGene, Blue Waters, and othersLatency-only option – uses formula specific to networkFull contention
4/28/2010 Charm++ Workshop 2010 21
Specific Model (Full Network)
4/28/2010 Charm++ Workshop 2010 22
BGnode
BGproc BGproc
Net Interface
Switch
Transceiver
Channel
Channel
Channel
Channel
Channel
Channel
Generic Model (Simple Latency)
4/28/2010 Charm++ Workshop 2010 23
BGnode
BGproc BGproc
Net Interface
Switch
Transceiver
Channel
Channel
Channel
Channel
Channel
Channel
What We Model
ProcessorsNodesNICsSwitches/hubsChannelsPacket-level direct and indirect routingBuffers with credit schemeVirtual channels
4/28/2010 Charm++ Workshop 2010 24
Other BigNetSim FeaturesSkip points
Set skip points in application code (e.g., after startup)Simulate only between skip points
TransceiverTraffic pattern generator – replaces nodes and processors
WindowingSet file window size to decrease memory footprintCan cut footprint in half or better, depending on trace structure
Checkpoint-to-disk (recent work)Saves simulator state based on time or GVT interval for restart if crash occurs
4/28/2010 Charm++ Workshop 2010 25
BigNetSim Tools
Located in BigNetSim/trunk/toolsLog Analyzer
Provides info about a set of tracesNumber of events / simulated processorNumber of messages sent
Log Transformation (recently completed)Produces new set of traces with remapped objectsUseful for testing load-balancing scenarios
4/28/2010 Charm++ Workshop 2010 26
BigNetSim Output
BgPrintf() statementsAdded to application code“%f” converted to committed time during simulation
GVT = Global Virtual TimeEach GVT tick = 1/factor secondsfactor is defined in BigNetSim/trunk/Main/TCsim.h
Link utilization statisticsProjections traces
Use -tproj command-line parameter4/28/2010 Charm++ Workshop 2010 27
BigNetSim Output ExampleCharm++: standalone mode (not using charmrun)Charm warning> Randomization of stack pointer is turned on in Kernel, run
'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work!
Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1Opts: netsim on: 0Initializing POSE...POSE initialization complete.Using Inactivity Detection for termination.netsim skip_on 0 0Info> timing factor 1.000000e+08 ...Info> invoking startup task from proc 0 ...[0:RECV_RESUME] Start of major loop at 0.014741[0:RECV_RESUME] End of major loop at 0.034914Simulation inactive at time: 38129444Final GVT = 38129444Final link stats [Node 0, Channel 0, ### Link]: ovt: 38129444, utilization
time: 29685846, utilization %: 77.855439, packets sent: 472210 gvt=38129444
Final link stats [Node 0, Channel 3, ### Link]: ovt: 38129444, utilization time: 631019, utilization %: 0.016549, packets sent: 4259 gvt=38129444
1 PE Simulation finished at 18.052671.Program finished.
4/28/2010 Charm++ Workshop 2010 28
29
Ring Projections Timeline
Charm++ Workshop 20104/28/2010
BigNetSim PerformanceExamples of sequential simulator performance on Blue Print
4k-VP MILCStartup time: 0.7 hoursExecution time: 5.6 hoursTotal run time: 6.3 hoursMemory footprint: ~3.1 GB
256k-VP 3D Jacobi (10x10x10 grid, 3 iterations)Startup time: 0.5 hoursExecution time: 1.5 hoursTotal run time: 2.0 hoursMemory footprint: ~20 GB
Still tuning parallel simulator performance4/28/2010 Charm++ Workshop 2010 30
Thank you!
Free download of Charm++ and BigSim:http://charm.cs.uiuc.edu
Send questions and comments to:[email protected]
4/28/2010 Charm++ Workshop 2010 31