Parallel Simulations on High-Performance Clusters

Parallel Simulations on High-Performance Clusters

C.D. PhamRESAM laboratory

Univ. Lyon 1, [email protected]

Outline

• Backgrounds– Discrete Event Simulation (DES)– Parallel DES and the synchronization

problems

• The CSAM Tool– Architecture of the simulator kernel– The communication network model

• Results– On mono-processor cluster– On multi-processor cluster

Simulation• To simulate is to reproduce the

behavior of a physical system with a model

• Practically, computers are used to numerically simulate a logical model

• Simulations are used for performance evaluation and prediction of complex systems– fluids dynamic, chemistry reactions (continous)– communication network models: routing,

congestion avoidance, mobile… (discrete)

• Simulation is more flexible than analytical methods

Discrete Event Simulation (DES)

• assumption that a system changes its state at discrete points in simulation time

a1 a2 a3 a4d1 d2 d3

S1 S3

S2

0 t 2t 3t 4t 5t 6t

time-step

DES concepts

• fundamental concepts:– system state (variables)– state transitions (events)– simulation time: totally ordered set of values

representing time in the system being modeled

• the system state can only be modified upon reception of an event

• modeling can be – event-oriented– process-oriented

Life cycle of a DES

• a DES system can be viewed as a collec-tion of simulated objects and a sequence of event computations

• each event computation contains a time stamp indicating when that event occurs in the physical system

• each event computation may:– modify state variables– schedule new events into the simulated future

• events are stored in a local event list– events are processed in time stamped order– usually, no more event = termination

A simple DES model

local event list

A B

5

link model delay = 5send processing time = 5

receive processing time = 1packet arrival

P1 at 5, P2 at 12, P3 at 22

<e4,15> B receive P1 from Ae4<e5,16> B sends ACK(P1) to Ae5

e8 <e8,23> B receive P2 from A

<e2,10> A sends P1 to B e2<e1,5> A receive packet P1 e1

<e6,17> A sends P2 to B e6

<e3,12> A receive packet P2 e3

<e9,22> A receive packet P3 e9

e7<e7,21> A receive ACK(P1)

Why it works?

• events are processed in time stamp order

• an event at time t can only generate future events with timestamp greater or equal to t (no event in the past)

• generated events are put and sorted in the event list, according to their timestamp

– the event with the smallest timestamp is always processed first,

– causality constraints are implicitly maintained.

Why change? It ’s so simple!

• models becomes larger and larger• the simulation time is overwhelming

or the simulation is just untractable• example:

– parallel programs with millions of lines of codes,– mobile networks with millions of mobile hosts,– ATM networks with hundreds of complex

switches,– multicast model with thousands of sources,– ever-growing Internet,– and much more...

Some figures to convince...

• ATM network models– Simulation at the cell-level,– 200 switches– 1000 traffic sources, 50Mbits/s– 155Mbits/s links,– 1 simulation event per cell arrival.

– simulation time increases as link speed increases,– usually more than 1 event per cell arrival,– how scalable is traditional simulation?

More than 26 billions events to simulate 1 second!30 hours if 1 event is processed in 1us

Parallel simulation - principles

• execution of a discrete event simulation on a parallel or distributed system with several physical processors.

• the simulation model is decomposed into several sub-models that can be executed in parallel– spacial partitioning,– temporel partitioning,

• radically different from simple simulation replications.

Parallel simulation - pros & cons

• pros– reduction of the simulation time,– increase of the model size,

• cons– causality constraints are difficult to maintain,– need of special mechanisms to synchronize

the different processors,– increase both the model and the simulation

kernel complexity.

• challenges– ease of use, transparency.

Parallel simulation - examplelogical process (LP)

packetheventt

parallel

A simple PDES model

local event list

A B

5

link model delay = 5send processing time = 5

receive processing time = 1packet arrival

P1 at 5, P2 at 12, P3 at 22

<e5,16> B sends ACK(P1)e5

<e2,10> A sends P1 to B e2

e6<e6,17> A sends P2 to B

<e1,5> A rec. packet P1 e1

<e3,12> A rec. packet P2 e3<e4,15> B rec. P1 from Ae4

<e8,23> B rec. P2 from Ae8e7<e3,21> A rec. ACK(P1)

t

e9<e9,22> A rec. packet P3

causality error, violation

Synchronization problems

• fundamental concepts– each Logical Process (LP) can be at a

different simulation time– local causality constraints: events in each LP

must be executed in time stamp order

• synchronization algorithms– Conservative: avoids local causality

violations by waiting until it ’s safe– Optimistic: allows local causality violations

but provisions are done to recover from them at runtime

CSAM (Pham, UCBL)

• CSAM: Conservative Simulator for ATM network Model

• Simulation at the cell-level• Conservative and/or sequential• C++ programming-style, predefined

generic model of sources, switches, links…

• New models can be easily created by deriving from base classes

• Configuration file that describes the topology

CSAM - Kernel characteristics

• Exploits the lookahead of communication links: transparent for the user

• Virtual Input Channels– reduces overhead for event manipulation,– reduces overhead for null-messages handling.

• Cyclic event execution• Message aggregation

– static aggregation size,– asymmetric aggregation size on CLUMPS,– sender-initiated,– receiver-initiated.

CSAM - Life cycleMPI buffers

31 2 1 t6t7t8t9t103

13 2

3

1

2

t3

t4

t5

Future Event List

t3 t4t5 last[i]safetime = min(last[i])

t2

MPI buffers

3

1

2

1 t6

t7 t8

t9

t103

13 2

3

1

2

t3

t4

t5

t7 t9t8 last[i]safetime = min(last[i])t3 t7

t2

end

end

t2+L

(a) end of cycle, send a null-message (b) get new messages, begin new cycle

Future Event List

t23

Test case: 78-switch ATM network

Distance-Vector Routing with dynamic link cost functionsConnection setup, admission control protocols

Why is it difficult?

• Very small granularity: 1 message represents 1 cell tranfer– high level of message synchronisation– very small computation/communication ratio

• Load imbalance between links– large number of control messages– partitioning and load balancing are difficult

CSAM - Some results...Routing protocol’s reconfiguration time

CSAM - Some results...

Parallel Simulation on High Performance Clusters

• Myrinet-based cluster of 12 Pentium Pro at 200MHz, 64 MBytes, Linux

• Myrinet-based cluster of 4 dual Pentium Pro 450MHz, 128 Mbytes, Linux

• Myrinet board with LANai 4.1, 256KB

• BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP communication libraries

Speedup on a myrinet clusterPentium Pro 200MHz

More than 53 millions events to simulate 0.31s

0

1

2

3

4

5

6

7

2 4 6 8 10

number of processors

sp

ee

du

p

Speedup with CLUMPS

0

0.5

1

1.5

2

2.5

no aggr 156 256 512 1024 256-156

512-156

1024-156

spe

ed

up

2 ext. 2 int. 4 ext. 2x2 int.

Dual Pentium Pro 450MHz

Increasing the model size (CLUMPS)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

no aggr 156 256 512 1024 256-156

512-156

1024-156

sp

ee

du

p

78 switches 156 switches

Dual Pentium Pro 450MHz, 4x2 int

Speedup on SGI/Cray Origin 2000

0

1

2

3

4

5

6

7

4 6 8 10

number of processors

sp

ee

du

p

Conclusions

• Parallel Simulation is very sensitive to latency

• High Performance Clusters is a good alternative to traditionnal massively parallel computer

• CLUMPS architectures are very attractive as the price on the communication card can be cut in half

Parallel Simulations on High-Performance Clusters

Documents

Transcript of Parallel Simulations on High-Performance Clusters