1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++...

31
1 Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana- Champaign 1 4/28/2010

Transcript of 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++...

Page 1: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

1Charm++ Workshop 2010

The BigSim Parallel Simulation System

Gengbin Zheng, Ryan Mokos

Charm++ Workshop 2010Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

14/28/2010

Page 2: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Charm++ Workshop 2010

Outline

OverviewBigSim EmulatorBigSim Simulator

24/28/2010

Page 3: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Summarizing the State of Art

Petascale Very powerful parallel machines exist (Jaguar, Roadrunner, etc)Application domains exist that need that kind of power

New generation of applicationsUse sophisticated algorithmsDynamic adaptive refinementsMulti-scale, multi-physicsParallel applications are more complex than sequential ones, hard to predict without actually running it

Challenge: Is it possible to simulate these applications on large scale using small clusters?

3Charm++ Workshop 2010

4/28/2010

Page 4: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigSim

Why BigSim, and why on Charm++?Targets large scale simulationObject-based processor virtualization

For a virtualized execution environment

Efficient message passing runtime by Charm++

Support fine-grained decomposition

Portability

4Charm++ Workshop 2010

4/28/2010

Page 5: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

5

BigSim Infrastructure

EmulatorA virtualized execution environment

Charm++ and MPI applicationsNo or small changes to MPI application source codes. facilitate code development and debugging

SimulatorTrace-driven approach

Parallel Discrete Event SimulationSimple latency, full network contention modeling

Predict parallel performance at varying levels of resolution

Charm++ Workshop 2010 54/28/2010

Page 6: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

6Charm++ Workshop 2010

Charm++/MPI applications

Simulation trace logs

BigSim Simulator

Performance visualization (Projections)

BigSim Emulator

AMPI Runtime

Architecture of BigSim

6

Charm++ Runtime

4/28/2010

POSE

Page 7: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

7

MPI Alltoall Timeline

Charm++ Workshop 20104/28/2010

Page 8: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

8

BigSim Emulator

Emulate full machine on existing machines Actually run a parallel program

E.g. NAMD on 256K target processors using 8K cores of Ranger cluster

Implemented on Charm++Libraries that link to user application

Simple architecture abstractionMany multiprocessor (SMP) nodes connected via message passingDo not emulate at instruction level

Charm++ Workshop 2010 84/28/2010

Page 9: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Processor-level queues

Communication processors

Worker processors

Node-level queue

Converse scheduler

Converse Queue

Processor-level queues

Communication processors

Incoming queue

Worker processors

Node-level queue

Physical Processor

Target Node

9

Incoming queue

Target Node

BigSim Emulator: functional view

9Charm++ Workshop 20104/28/2010

Page 10: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Processor Virtualization

User View System View

Programmer: Decomposes the computation into

objects

Runtime: Maps the computation on to the processors

10Charm++ Workshop 20104/28/2010

Page 11: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Major Challenges

Running multiple copies of code on each processorShared global variables

Charm++ applications already handle thisAMPI

Global/static variablesRuntime techniques, compiler tools

E.g. NAMD on 1024 target processors using 8 cores

Simulation timeMemory footprint

Global read-only variables can be sharedOut-of-core execution

Charm++ Workshop 2010 114/28/2010

Page 12: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

NAMD Emulation

Charm++ Workshop 2010 12

Only 19 times of slowdown Only 7 times of increase in mem

4/28/2010

Page 13: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

13Charm++ Workshop 2010

Out-of-core Emulation

MotivationApplications with large memory footprint

VM system can not handle well

Use hard driveSimilar to checkpointing

Message driven executionPeek msg queue => what execute next? (prefetch)

134/28/2010

Page 14: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

14Charm++ Workshop 2010

What is in the Trace Logs?

Traces for2 target processors

Each SEB has:

• startTime, endTime• Incoming Message ID• Outgoing messages• Dependences

14

Tools for reading bgTrace binary files:

1.charm/example/bigsim/tools/loadlogConvert to human-readable format

2.charm/example/bigsim/tools/log2projConvert to trace projections log files

4/28/2010

Page 15: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigSim Simulator: BigNetSimPost-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++Parallel Discrete Event SimulationPass emulator traces through different network models in BigNetSim to get final performance resultsDetails of using BigNetSim:

http://charm.cs.uiuc.edu/workshops/charmWorkshop2009/slides/tut_BigSim09.ppthttp://charm.cs.uiuc.edu/manuals/html/bignetsim/manual.html

4/28/2010 Charm++ Workshop 2010 15

Page 16: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

POSE

Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objectsNetwork data constructs (message, packet, etc.) implemented as event methods on simulation objects

4/28/2010 Charm++ Workshop 2010 16

Page 17: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Posers

4/28/2010 Charm++ Workshop 2010 17

Each poser is a tiny simulation

Page 18: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Performance Prediction

Two components:Time to execute blocks of sequential, computational code

SEBs = Sequential Execution Blocks

Communication time based on a particular network topology

4/28/2010 Charm++ Workshop 2010 18

Page 19: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Sequential Time Prediction (Emulator)Manual

Advance processor time using BgElapse() calls in application code

Wallclock timeUse multiplier (scale factor) to account for architecture differences

Performance countersCount instructions with hardware countersUse expected time of each instruction on target machine to derive execution time

Instruction-level simulation (e.g., Mambo)Record cycle-accurate execution times for functionsUse interpolation tool to replace SEB times

4/28/2010 Charm++ Workshop 2010 19

Page 20: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Sequential Time Prediction (continued)

Model-based (recent work)Performed after emulationDetermine application functions responsible for most of the computation timeRun these functions on target machine

Obtain run times based on function parameters to create model

Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times

Generates corrected set of traces

4/28/2010 Charm++ Workshop 2010 20

Page 21: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Communication Time Prediction (Simulator)

Valid for a particular network topologyGeneric: Simple Latency model

Formula predicts time using latency and bandwidth parameters

SpecificBlueGene, Blue Waters, and othersLatency-only option – uses formula specific to networkFull contention

4/28/2010 Charm++ Workshop 2010 21

Page 22: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Specific Model (Full Network)

4/28/2010 Charm++ Workshop 2010 22

BGnode

BGproc BGproc

Net Interface

Switch

Transceiver

Channel

Channel

Channel

Channel

Channel

Channel

Page 23: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Generic Model (Simple Latency)

4/28/2010 Charm++ Workshop 2010 23

BGnode

BGproc BGproc

Net Interface

Switch

Transceiver

Channel

Channel

Channel

Channel

Channel

Channel

Page 24: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

What We Model

ProcessorsNodesNICsSwitches/hubsChannelsPacket-level direct and indirect routingBuffers with credit schemeVirtual channels

4/28/2010 Charm++ Workshop 2010 24

Page 25: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Other BigNetSim FeaturesSkip points

Set skip points in application code (e.g., after startup)Simulate only between skip points

TransceiverTraffic pattern generator – replaces nodes and processors

WindowingSet file window size to decrease memory footprintCan cut footprint in half or better, depending on trace structure

Checkpoint-to-disk (recent work)Saves simulator state based on time or GVT interval for restart if crash occurs

4/28/2010 Charm++ Workshop 2010 25

Page 26: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigNetSim Tools

Located in BigNetSim/trunk/toolsLog Analyzer

Provides info about a set of tracesNumber of events / simulated processorNumber of messages sent

Log Transformation (recently completed)Produces new set of traces with remapped objectsUseful for testing load-balancing scenarios

4/28/2010 Charm++ Workshop 2010 26

Page 27: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigNetSim Output

BgPrintf() statementsAdded to application code“%f” converted to committed time during simulation

GVT = Global Virtual TimeEach GVT tick = 1/factor secondsfactor is defined in BigNetSim/trunk/Main/TCsim.h

Link utilization statisticsProjections traces

Use -tproj command-line parameter4/28/2010 Charm++ Workshop 2010 27

Page 28: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigNetSim Output ExampleCharm++: standalone mode (not using charmrun)Charm warning> Randomization of stack pointer is turned on in Kernel, run

'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work!

Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1Opts: netsim on: 0Initializing POSE...POSE initialization complete.Using Inactivity Detection for termination.netsim skip_on 0 0Info> timing factor 1.000000e+08 ...Info> invoking startup task from proc 0 ...[0:RECV_RESUME] Start of major loop at 0.014741[0:RECV_RESUME] End of major loop at 0.034914Simulation inactive at time: 38129444Final GVT = 38129444Final link stats [Node 0, Channel 0, ### Link]: ovt: 38129444, utilization

time: 29685846, utilization %: 77.855439, packets sent: 472210 gvt=38129444

Final link stats [Node 0, Channel 3, ### Link]: ovt: 38129444, utilization time: 631019, utilization %: 0.016549, packets sent: 4259 gvt=38129444

1 PE Simulation finished at 18.052671.Program finished.

4/28/2010 Charm++ Workshop 2010 28

Page 29: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

29

Ring Projections Timeline

Charm++ Workshop 20104/28/2010

Page 30: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

BigNetSim PerformanceExamples of sequential simulator performance on Blue Print

4k-VP MILCStartup time: 0.7 hoursExecution time: 5.6 hoursTotal run time: 6.3 hoursMemory footprint: ~3.1 GB

256k-VP 3D Jacobi (10x10x10 grid, 3 iterations)Startup time: 0.5 hoursExecution time: 1.5 hoursTotal run time: 2.0 hoursMemory footprint: ~20 GB

Still tuning parallel simulator performance4/28/2010 Charm++ Workshop 2010 30

Page 31: 1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

Thank you!

Free download of Charm++ and BigSim:http://charm.cs.uiuc.edu

Send questions and comments to:[email protected]

4/28/2010 Charm++ Workshop 2010 31