RESAM Laboratory Univ. Lyon 1, France

RESAM LaboratoryUniv. Lyon 1, France

lead by Prof. B. Tourancheau

Laurent LefèvreCongDuc PhamPascale Primet

PhD. studentPatrick GeoffrayRoland Westrelin

Research interests• High-performance communication

systems– Myrinet-based clusters, cluster management– BIP, MPI-BIP, BIP-SMP

• Distributed Shared Memory systems– DOSMOS system

• Network support for Multimedia and Cooperative applications– QoS, multicast– CoTool environment

• Parallel simulation, synchronization algorithms, communication network models– CSAM tools

Parallel and Distributed Simulation of Communication

Networks(towards cluster-based

solution)

C.D. PhamRESAM laboratory

Univ. Lyon 1, [email protected]

Outline

• Introduction– Discrete Event Simulation (DES)– Parallel DES and the synchronization problems

• Conservative protocols– Architecture of a conservative LP– The Chandy-Misra-Bryant protocol– The lookahead ability

• Optimistic protocols– Architecture of an optimistic LP– Time Warp

Outline, more...

• CSAM, a tools for ATM network models– kernel characteristics– results

• Cluster-based solutions– Myrinet, BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP– Fast Ethernet, Gamma?– GigaEthernet?

Introduction

Discrete Event Simulation (DES)Parallel DES and synchronization problems

Discrete Event Simulation (DES)

• assumption that a system changes its state at discrete points in simulation time

a1 a2 a3 a4d1 d2 d3

S1 S3

S2

0 t 2t 3t 4t 5t 6t

NOT...

DES concepts

• fundamental concepts:– system state (variables)– state transitions (events)– simulation time: totally ordered set of values

representing time in the system being modeled

• the system state can only be modified upon reception of an event

• modeling can be – event-oriented– process-oriented

Life cycle of a DES

• a DES system can be viewed as a collec-tion of simulated objects and a sequence of event computations

• each event computation contains a time stamp indicating when that event occurs in the physical system

• each event computation may:– modify state variables– schedule new events into the simulated future

• events are stored in a local event list– events are processed in time stamped order– usually, no more event = termination

A simple DES model

local event list

A B

5

link model delay = 5send processing time = 5

receive processing time = 1packet arrival

P1 at 5, P2 at 12, P3 at 22

<e4,15> B receive P1 from Ae4<e5,16> B sends ACK(P1) to Ae5

e8 <e8,23> B receive P2 from A

<e2,10> A sends P1 to B e2<e1,5> A receive packet P1 e1

<e6,17> A sends P2 to B e6

<e3,12> A receive packet P2 e3

<e9,22> A receive packet P3 e9

e7<e7,21> A receive ACK(P1)

Why it works?

• events are processed in time stamp order

• an event at time t can only generate future events with timestamp greater or equal to t (no event in the past)

• generated events are put and sorted in the event list, according to their timestamp

– the event with the smallest timestamp is always processed first,

– causality constraints are implicitly maintained.

Why change? It ’s so simple!

• models becomes larger and larger• the simulation time is overwhelming

or the simulation is just untractable• example:

– parallel programs with millions of lines of codes,– mobile networks with millions of mobile hosts,– ATM networks with hundreds of complex

switches,– multicast model with thousands of sources,– ever-growing Internet,– and much more...

Some figures to convince...

• ATM network models– Simulation at the cell-level,– 200 switches– 1000 traffic sources, 50Mbits/s– 155Mbits/s links,– 1 simulation event per cell arrival.

– simulation time increases as link speed increases,– usually more than 1 event per cell arrival,– how scalable is traditional simulation?

More than 26 billions events to simulate 1 second!30 hours if 1 event is processed in 1us

Parallel simulation - principles

• execution of a discrete event simulation on a parallel or distributed system with several physical processors.

• the simulation model is decomposed into several sub-models that can be executed in parallel– spacial partitioning,– temporel partitioning,

• radically different from simple simulation replications.

Parallel simulation - pros & cons

• pros– reduction of the simulation time,– increase of the model size,

• cons– causality constraints are difficult to maintain,– need of special mechanisms to synchronize

the different processors,– increase both the model and the simulation

kernel complexity.

• challenges– ease of use, transparency.

Parallel simulation - examplelogical process (LP)

packetheventt

parallel

A simple PDES model

local event list

A B

5

link model delay = 5send processing time = 5

receive processing time = 1packet arrival

P1 at 5, P2 at 12, P3 at 22

<e5,16> B sends ACK(P1)e5

<e2,10> A sends P1 to B e2

e6<e6,17> A sends P2 to B

<e1,5> A rec. packet P1 e1

<e3,12> A rec. packet P2 e3<e4,15> B rec. P1 from Ae4

<e8,23> B rec. P2 from Ae8e7<e3,21> A rec. ACK(P1)

t

e9<e9,22> A rec. packet P3

causality error, violation

Synchronization problems

• fundamental concepts– each Logical Process (LP) can be at a

different simulation time– local causality constraints: events in each LP

must be executed in time stamp order

• synchronization algorithms– Conservative: avoids local causality

violations by waiting until it ’s safe– Optimistic: allows local causality violations

but provisions are done to recover from them at runtime

Conservative protocols

Architecture of a conservative LPThe Chandy-Misra-Bryant protocolThe lookahead ability

Architecture of a conservative LP

– LPs communicate by sending non-decreasing timestamped messages

– each LP keeps a static FIFO channel for each LP with incoming communication

– each FIFO channel (input channel, IC) has a clock ci that ticks according to the timestamp of the topmost message, if any, otherwise it keeps the timestamp of the last message

LPB LPA

LPC LPD

c1=tB1

tB1tB

2

tC3tC

4tC5

tD4

c2=tC3

c3=tD3

A simple conservative algorithm

• each LP has to process event in time-stamp order to avoids local causality violations

The Chandy-Misra-Bryant algorithm

while (simulation is not over) { determine the ICi with the smallest Ci

if (ICi empty) wait for a message else { remove topmost event from ICi

process event }}

Safe but has to block

LPB LPA

LPC LPD

36

147

10

5

IC1

IC2

IC3

min IC event

12

31

42

53

BLOCK3

61

729

Blocks and even deadlocks!

S

A

B

M

merge point

BLOCKED

cycle

S sends allmessages to B

444 446

How to solve deadlock: null-messages

SA

B

M

null-messages for artificial propagation of simulation time

10 10

4410 445

67

12

10

UNBLOCKED

What frequency?

How to solve deadlock: null-messages

a null-message indicates a Lower Bound Time Stampminimum delay between links is 4LP C initially at simulation time 0

11 910 7A B C

4

LP C sends a null-message with time stamp 4

LP A sends a null-message with time stamp 8

8

LP B sends a null-message with time stamp 12

12

LP C can process event with time stamp 7

12

The lookahead ability

• null-messages are sent by an LP to indicate a lower bound time stamp on the future messages that will be sent

• null-messages rely on the « lookahead » ability– communication link delays– server processing time (FIFO)

• lookahead is very application model dependant and need to be explicitly identified

Lookahead for concurrent processing

LPB

LPA

LPC

LPD

s

TA TA+LA

s s

s s

s s

s safe event

unsafe event

What if lookahead is small?a null-message indicates a Lower Bound Time Stamp

minimum delay between links is 4LP C initially at simulation time 0

11 910 7A B C

1

LP C sends a null-message with time stamp 1

LP A sends a null-message with time stamp 2

2

LP B sends a null-message with time stamp 3

3

LP C can process event with time stamp 7

7

1

then 5

5

then 6

6

then 7

7

Conservative: pros & cons

• pros– simple, easy to implement– good performance when lookahead is large

(communication networks, FIFO queue)

• cons– pessimistic in many cases– large lookahead is essential for performance– no transparent exploitation of parallelism– performances may drop even with small

changes in the model (adding preemption, adding one small lookahead link…)

Optimistic protocols

Architecture of an optimistic LPTime Warp

Architecture of an optimistic LP

– LPs send timestamped messages, not necessarily in non-decreasing time stamp order

– no static communication channels between LPs, dynamic creation of LPs is easy

– each LP processes events as they are received, no need to wait for safe events

– local causality violations are detected and corrected at runtime

LPB LPA

LPC LPD

tB1tB

2 tC3tC

4 tC5 tD

4

Processing events as they arrive

11

LPB

13

LPD

18

LPB

22

LPC

25

LPD

28

LPC

36

LPB

32

LPD

LPB

LPA

LPC

LPD

LPA

processed!

what to do with late messages?

TimeWarp. Rollback? How?

• Late messages (stragglers) are handled with a rollback mechanism– undo false/uncorrect local computations,

• state saving: save the state variables of an LP• reverse computation

– undo false/uncorrect remote computations,• anti-messages: anti-messages and (real) messages

annihilate each other

– process late messages– re-process previous messages: processed

events are NOT discarded!

A pictured-view of a rollback

11131822252836

32

4345

25 13state points

anti-msg 13152024273038

1118222832

3438 30

36

– The real rollback distance depends on the state saving period: short period reduces rollback overhead but increases state saving overhead

11131822252832364345 283236

unprocessed

processed

Reception of an anti-message

– may initiate a rollback if the corresponding positive message has already been processed,

– may annihilate the corresponding positive message if it is still unprocessed,

– may wait in the input queue if the corresponding positive message has not been received yet.

222528364345

43

22252836434548

222528364345

25 rollback

48

Need for a Global Virtual Time

• Motivations– an indicator that the simulation time advances – reclaim memory (fossil collection)

• Basically, GVT is the minimum of– all LPs ’ logical simulation time– timestamp of messages in transit

• GVT garantees that– events below GVT are definitive events (I/O)– no rollback can occur before the GVT– state points before GVT can be reclaimed– anti-messages before GVT can be reclaimed

A pictured-view of the GVT

LPB

LPA

LPC

LPD

c

c

old GVT

c c c

cccc

c c c c

new GVT

c c cc

c

D

conditional event

definitive event

c c

c

c c

c

c

c

c

c

c c

D D

D

D

D

D

D

D

DWAN

TED

Optimistic overheads

• Periodic state savings– states may be large, very large!– copies are very costly

• Periodic GVT computations– costly in a distributed architecture,– may block computations,

• Rollback thrashing– cascaded rollback, no advancement!

• Memory!– memory is THE limitation

Optimistic: pros & cons

• pros– exploits all the parallelism in the model,

lookahead is less important,– transparent to the end-user– interactive simulations can be enabled– can be hopefully general-purpose

• cons– very complex, needs lots of memory,– large overheads (state saving, GVT,

rollbacks…)

Optimizations, variations

ConservativeOptimisticMixed approaches, adaptive approaches,

Conservative: outline

• Add more information to reduce the number of null-messages– special msg: carrier null-messages [Cai90]– topology information: [DeVries90]– time/delay information: Bounded Lag

[Lubuchewsky89]– time window: CTW [Ayani92]

• In general, one tries to add additional knowledge of the model in the simulator, may not be general-purpose

Optimistic: outline

• Reduce rollback-related overhead– lazy-cancellation, lazy re-evaluation– limit optimism (time window, blocking)

• Reduce memory comsumtion– fast GVT algorithms, hardware support– incremental state saving, reverse

computation– cancelback, artificial rollback

• In general, one tries to reduce the optimism to avoid to many computation speculations

Mixed/adaptive approaches

• General framework that (automatically) switches to conservative or optimistic

• Adaptive approaches may determine at runtime the amount of conservatism or optimism

conservative optimistic

mixed

messages

performance

optimistic

conservative

Parallel simulation today

• Lots of algorithms have been proposed– variations on conservative and optimistic– adaptives approaches

• Few end-users– impossible to compete with sequential

simulators in terms of user interface, generability, ease of use etc.

• Research mainly focus on – applications, ultra-large scale simulations– tools and execution environments (clusters)– composability issues

CSAM (Pham, UCBL)

• CSAM: Conservative Simulator for ATM network Model

• Simulation at the cell-level• Conservative and/or sequential• C++ programming-style, predefined

generic model of sources, switches, links…

• New models can be easily created by deriving from base classes

• Configuration file that describes the topology

CSAM - Kernel characteristics

• Exploits the lookahead of communication links: transparent for the user

• Virtual Input Channels– reduces overhead for event manipulation,– reduces overhead for null-messages handling.

• Cyclic event execution• Message aggregation

– static aggregation size,– asymmetric aggregation size on CLUMPS,– sender-initiated,– receiver-initiated.

CSAM - Life cycleMPI buffers

31 2 1 t6t7t8t9t103

13 2

3

1

2

t3

t4

t5

Future Event List

t3 t4t5 last[i]safetime = min(last[i])

t2

MPI buffers

3

1

2

1 t6

t7 t8

t9

t103

13 2

3

1

2

t3

t4

t5

t7 t9t8 last[i]safetime = min(last[i])t3 t7

t2

end

end

t2+L

(a) end of cycle, send a null-message (b) get new messages, begin new cycle

Future Event List

t23

Test case: 78-switch ATM network

Distance-Vector Routing with dynamic link cost functionsConnection setup, admission control protocols

CSAM - Some results...

Routing protocol’s reconfiguration time

CSAM - Some results...

End-to-end delays

Cluster-based solution

• Myrinet-based cluster of 12 Pentium Pro at 200MHz, 64 MBytes, Linux

• Myrinet-based cluster of 4 dual Pentium Pro 450MHz, 128 Mbytes, Linux

• Myrinet board with LANai 4.1, 256KB

• BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP communication libraries

CSAM - speedup on a myrinet cluster

0

1

2

3

4

5

6

2 4 6 8 10

number of processors

sp

ee

du

p

Pentium Pro 200MHz

More than 53 millions events to simulate 0.31s

CSAM - speedup with CLUMPS

0

0.5

1

1.5

2

2.5

no aggr 156 256 512 1024 256-156

512-156

1024-156

spe

ed

up

2 ext. 2 int. 4 ext. 2x2 int.

Dual Pentium Pro 450MHz

Increasing the model size (CLUMPS)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

no aggr 156 256 512 1024 256-156

512-156

1024-156

sp

ee

du

p

78 switches 156 switches

Dual Pentium Pro 450MHz, 4x2 int

Conclusions

• Parallel simulation techniques can be successfully be applied to communication network models

• To enable the PSTTL approach (parallel simulation to the labs), we need– powerful software,– powerful BUT, cheap and accessible, execution

environment– Myrinet? Fast Ethernet? Giga Ethernet?

We will always take the cheapest if performance are good

References

• Parallel simulation– K. M. Chandy and J. Misra, Distributed Simulation: A

Case Study in Design and Verification of Distributed Programs, IEEE Trans. on Soft. Eng., 1979, pp440-452

– R. Fujimoto, Parallel Discrete Event Simulation, Comm. of the ACM, Vol. 33(10), Oct. 90, pp31-53

– http://

RESAM Laboratory Univ. Lyon 1, France

Documents

Transcript of RESAM Laboratory Univ. Lyon 1, France