Parallel and Distributed Simulation

Parallel and Distributed Simulation

Time Parallel Simulation

Outline

• Space-Time Framework• Time Parallel Simulation• Parallel Cache Simulation• Simulation Using Parallel Prefix

Space-Time FrameworkA simulation computation can be viewed as computing the state of the physical processes in the system being modeled over simulated time.

algorithm:

1. partition space-time region into non-overlapping regions

2. assign each region to a logical process

3. each LP computes state of physical system for its region, using inputs from other regions and producing new outputs to those regions

4. repeat step 3 until a fixed point is reached

simulated timeph

ysic

al p

roce

sses

region inspace-time

diagram

inputs to regionLP6

LP5

LP4

LP3

LP2

LP1

simulated time

phys

ical

pro

cess

es

space-parallel simulation(e.g., Time Warp)

Time Parallel Simulation

Basic idea:• divide simulated time axis into non-overlapping intervals• each processor computes sample path of interval assigned to it

Observation: The simulation computation is a sample path through the set of possible states across simulated time.

simulated time

po

ssib

le s

ys

tem

sta

tes

processor1

processor2

processor3

processor4

processor5

Key question: What is the initial state of each interval (processor)?

simulated time

po

ssib

le s

ys

tem

sta

tes

processor1

processor2

processor3

processor4

processor5

Time Parallel Simulation: Relaxation Approach

Benefit: massively parallel execution

Liabilities: cost of “fix up” computation, convergence may be slow (worst case, N iterations for N processors), state may be complex

3. using final state of previous interval as initial state, “fix up” sample path

4. repeat step 3 until a fixed point is reached

1. guess initial state of each interval (processor)

2. each processor computes sample path of its interval

Example: Cache Memory

• Cache holds subset of entire memory– Memory organized as blocks– Hit: referenced block in cache– Miss: referenced block not in cache

• Replacement policy determines which block to delete when requested data not in cache (miss)– LRU: delete least recently used block from cache

• Implementation: Least Recently Used (LRU) stack• Stack contains address of memory (block number)• For each memory reference in input (memory ref trace)

– if referenced address in stack (hit), move to top of stack– if not in stack (miss), place address on top of stack, deleting

address at bottom

processor 1 processor 2 processor 3

address:

LRUStack:

second iteration: processor i uses final state of processor i-1 as initial state

1 2 1 3 4 3 6 7 2 1 2 6 9 3 3 6 4 2 3 1 7 2 7 4

address:

first iteration: assume stack is initially empty:

1 2 1 3 4 3 6 7 2 1 2 6 9 3 3 6 4 2 3 1 7 2 7 4

Example: Trace Drive Cache Simulation

1---

2---

4---

21--

12--

24--

12--

21--

324-

312-

621-

1324

4312

9621

7132

3412

3962

2713

6341

3962

7213

7634

6392

4721

processor 1 processor 2 processor 3

LRUStack:

1276

2463

2176

3246

2763

4639

(idle)

Done!

Given a sequence of references to blocks in memory, determine number of hits and misses using LRU replacement

9621

match!

9621

6217

1324

match!

1324

Outline

• Space-Time Framework• Time Parallel Simulation• Parallel Cache Simulation• Simulation Using Parallel Prefix

Time Parallel Simulation Using Parallel Prefix

Basic idea: formulate the simulation computation as a linear recurrence, and solve using a parallel prefix computationparallel prefix: Give N values, compute the N initial products

P1= X1

P2= X1 • X2

Pi = X1 • X2 • X3 • … • Xi for i = 1,… N; • is associative

add value one positionto the left

X1 X2 X3 X4 X5 X6 X7 X8

+++++++1,2 2,3 3,4 4,5 5,6 6,7 7,8

1

++++++1-3 1-4 2-5 3-6 4-7 5-8

add value two positionsto the left

1-2 ++++1-5 1-6 1-7 1-81-3 1-4

add value four positionsto the left

Parallel prefix requires O(log N) time

Example: G/G/1 QueueExample: G/G/1 queue, given

• ri = interarrival time of the ith job

• si = service time of the ith job

Compute

• Ai = arrival time of the ith job

• Di = departure time of the ith job, for i=1, 2, 3, … N

Solution: rewrite equations as parallel prefix computations:

• Ai = Ai-1 + ri (= r1 + r2 + r3 + … ri)

• Di = max (Di-1 , Ai ) + si

Summary of Time Parallel Algorithms

Con:• only applicable to a very limited set of problems

Pro:• allows for massive parallelism• often, little or no synchronization is required after

spawning the parallel computations• substantial speedups obtained for certain problems:

queueing networks, caches, ATM multiplexers

Parallel and Distributed Simulation

Documents

Transcript of Parallel and Distributed Simulation