Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of...

27
Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan Permulla and Richard M. Fujimoto College of Computing Georgia Institute of Technology
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    2

Transcript of Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of...

Page 1: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Efficient Optimistic Parallel Simulations Using Reverse Computation

Chris CarothersDepartment of Computer Science

Rensselaer Polytechnic Institute

Kalyan Permulla

and

Richard M. FujimotoCollege of Computing

Georgia Institute of Technology

Page 2: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Goal: speed up discrete-event simulation programs using multiple processors

Enabling technology for… intractable simulation models tractable off-line decision aides on-line aides for time critical

situation analysis DPAT: A distributed simulation success story

simulation model of the National Airspace developed @ MITRE using Georgia Tech Time Warp (GTW) simulates 50,000 flights in < 1 minute, which use to take 1.5 hours. web based user-interface to be used in the FAA Command Center for on-line “what if” planning

Parallel/distributed simulation has the potential to improve how “what if” planning strategies are evaluated

Why Parallel/Distributed Simulation?

Page 3: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

How to Synchronize Distributed Simulations?parallel time-stepped simulation:

lock-step execution

PE 1 PE 2 PE 3

barrier

VirtualTime

parallel discrete-event simulation:must allow for sparse, irregular

event computations

PE 1 PE 2 PE 3

VirtualTime

Problem: events arrivingin the past

Solution: Time Warp

processed event

“straggler” event

Page 4: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Time Warp...

Local Control Mechanism:error detection and rollback

LP 1 LP 2 LP 3

Virtual

Ti

me

undostate ’s

(2) cancel“sent” events

Global Control Mechanism:compute Global Virtual Time (GVT)

LP 1 LP 2 LP 3

Virtual

Ti

me

GVT

collect versionsof state / events& perform I/O

operationsthat are < GVT

processed event

“straggler” event

unprocessed event

“committed” event

Page 5: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Challenge: Efficient Implementation?

Advantages:• automatically finds available

parallelism• makes development easier• outperforms conservative schemes

by a factor of N

Disadvantages:• Large memory requirements to support

rollback operation• State-saving incurs high overheads for

fine-grain event computations• Time Warp is out of “performance”

envelop for many applications

Time Warp

P PPPPP P P

Shared Memory or High Speed Network

P

Our Solution: Reverse Computation

Page 6: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Outline... Reverse Computation

Example: ATM Multiplexor Beneficial Application Properties Rules for Automation Reversible Random Number Generator

Experimental Results Conclusions Future Work

Page 7: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Our Solution: Reverse Computation...

Use Reverse Computation (RC) automatically generate reverse code from model source undo by executing reverse code

Delivers better performance negligible overhead for forward computation significantly lower memory utilization

Page 8: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

if( qlen < B )

qlen++delays[qlen]++

else

lost++

NB

on cell arrival...

Original

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

Reverse

if( qlen < B )

b1 = 1

qlen++delays[qlen]++

else

b1 = 0

lost++

Forward

Example: ATM Multiplexor

Page 9: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

State size reduction from B+2 words to 1 word e.g. B=100 => 100x reduction!

Negligible overhead in forward computation removed from forward computation moved to rollback phase

Result significant increase in speed significant decrease in memory

How?...

Gains….

Page 10: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Beneficial Application Properties

1. Majority of operations are constructive e.g., ++, --, etc.

2. Size of control state < size of data state e.g., size of b1 < size of qlen, sent, lost, etc.

3. Perfectly reversible high-level operations

gleaned from irreversible smaller operations e.g., random number generation

Page 11: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Type Description Application Code Bit RequirementsOriginal Translated Reverse Self Child Total

T0 simple choice if() s1 if() {s1; b=1;} if(b==1){inv(s1);} 1 x1, 1+else s2 else {s2; b=0;} else{inv(s2);} x2 max(x1,x2)

T1 compound choice if () s1; if() {s1; b=1;} if(b==1) {inv(s1);} lg(n) x1, lg(n) +(n-way) elseif() s2; elseif() {s2; b=2;} elseif(b==2) {inv(s2);} x2, max(x1….xn)

elseif() s3; elseif() {s3; b=3;} elseif(b==3) {inv(s3);} ….,else() sn; else {sn; b=n;} else {inv(sn);} xn

T2 fixed iterations (n) for(n)s; for(n) s; for(n) inv(s); 0 x n*xT3 variable iterations while() s; b=0; for(b) inv(s); lg(n) x lg(n) +n*x

(maximum n) while() {s; b++;}T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive v@ = w; v@ = w; v = @w; 0 0 0

assignmentT6 k-byte destructive v = w; {b =v; v = w;} v = b; 8k 0 8k

assignmentT7 sequence s1; s1; inv(sn); 0 x1+ x1+…+xn

s2; s2; inv(s2); ….+sn; sn; inv(s1); xn

T8 Nesting of T0-T7 Recursively apply the above Recursively apply the above

Generation rules, and upper-bounds on bit requirements for various statement types

Rules for Automation...

Page 12: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Destructive assignment (DA): examples: x = y;

x %= y; requires all modified bytes to be saved

Caveat: reversing technique for DA’s can degenerate to traditional

incremental state saving

Good news: certain collections of DA’s are perfectly reversible! queueing network models contain collections of easily/perfectly

reversible DA’s queue handling (swap, shift, tree insert/delete, … ) statistics collection (increment, decrement, …) random number generation (reversible RNGs)

Destructive Assignment...

Page 13: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Reversing an RNG?double RNGGenVal(Generator g){ long k,s; double u; u = 0.0;

s = Cg [0][g]; k = s / 46693; s = 45991 * (s - k * 46693) - k * 25884; if (s < 0) s = s + 2147483647; Cg [0][g] = s; u = u + 4.65661287524579692e-10 * s;

s = Cg [1][g]; k = s / 10339; s = 207707 * (s - k * 10339) - k * 870; if (s < 0) s = s + 2147483543; Cg [1][g] = s; u = u - 4.65661310075985993e-10 * s; if (u < 0) u = u + 1.0;

s = Cg [2][g]; k = s / 15499; s = 138556 * (s - k * 15499) - k * 3979; if (s < 0.0) s = s + 2147483423; Cg [2][g] = s; u = u + 4.65661336096842131e-10 * s; if (u >= 1.0) u = u - 1.0;

s = Cg [3][g]; k = s / 43218; s = 49689 * (s - k * 43218) - k * 24121; if (s < 0) s = s + 2147483323; Cg [3][g] = s; u = u - 4.65661357780891134e-10 * s; if (u < 0) u = u + 1.0;

return (u);}

Observation: k = s / 46693 is a Destructive AssignmentResult: RC degrades to classic state-saving…can we do better?

Page 14: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

RNGs: A Higher Level ViewThe previous RNG is based on the following recurrence….

xi,n = aixi,n-1 mod mi

where xi,n one of the four seed values in the Nth set, mi is one the four

largest primes less than 231, and ai is a primitive root of mi.

Now, the above recurrence is in fact reversible….

inverse of ai modulo mi is defined,

bi = aimi-2 mod mi

Using bi, we can generate the reverse recurrence as follows:

xi,n-1 = bixi,n mod mi

Page 15: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Reverse Code Efficiency... Future RNGs may result in even greater savings.

Consider the MT19937 Generator... Has a period of 219937

Uses 2496 bytes for a single “generator”

Property... Non-reversibility of indvidual steps DO NOT imply that the

computation as a whole is not reversible. Can we automatically find this “higher-level” reversibility?

Other Reversible Structures Include... Circular shift operation Insertion & deletion operations on trees (i.e., priority queues).

Reverse computation is well-suited for queuing network models!

Page 16: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Performance StudyPlatform

SGI Origin 2000, 16 processors (R10000), 4GB RAM

Model• 3 levels of multiplexers, fan-in N• N^3 sources => N^3 + N^2 + N + 1 entities in totaleg. N=4 => entities=85, N=64 => entities=266,305

Page 17: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Reverse Computation executes significantly faster than

State Saving!

0

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10 11 12

Number of Processors

Ev

en

t R

ate

of

Re

v.

Co

mp

/ E

ve

nt

Ra

te o

f S

tate

Sa

vin

g

Fanin 4

Fanin 12

Fanin 32

Fanin 48

million events/second

Why the large increase in parallel performance?

Page 18: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Cache Performance...

Faults TLB P cache S cache SS 12pe: 43966018 1283032615 162449694 RC 12pe: 11595326 590555715 94771426

Page 19: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Related Work...

Reverse computation used in low power processors, debugging, garbage collection, database

recovery, reliability, etc.

All previous work either prohibit irreversible constructs, or use copy-on-write implementation for every modification

(correspond to incremental state saving)

Many operate at coarse, virtual page-level

Page 20: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Contributions

We identify that RC makes Time Warp usable for fine-grain models!

disproved previous beliefthat “fine grain models can’t be optimistically simulated efficiently”

less memory consumption, more speed, without extra user effort

RC generalizes state saving e.g., incremental state saving, copy state saving

For certain data types, RC is more memory efficient than SS e.g., priority queues

Page 21: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Future Work

Develop state minimization algorithms, by State compression:

bit size for reversibility < bit size of data variables State reuse:

same state bits for different statements based on liveness, analogous to register allocation

Complete RC automation algorithm designavoiding the straightforward incremental state saving approach

Lossy integer and floating point arithmetic Jump statements Recursive functions

Page 22: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Geronimo! System Architecture

multiprocessor rack-mounted CPUs(not in demonstration)

Myrinet

Geronimo

High Performance Simulation Application

distributed computeserver

Geronimo Features: (1) “risky” or “speculative” processing of object computations, (2) reverse computation to support “undo” operation, (3) “Active Code” in a combination, heterogeneous, shared-memory, message passing environment...

Page 23: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Geronimo!: “Risky” Processing...Error detection and Rollback

Object 1 Object 2 Object 3

Virtual

Ti

me

undostate ’s

(2) cancel“scheduled”

tasks

processed thread

“straggler” thread

unprocessed thread

Execution Framework:

• Objects

• schedule Threads / Tasks

• at some “virtual time”

Applications:

• discrete-event simulations

• scientific computing applications

CAVEAT: Good performance relies on cost of recovery * probability of failure being less than cost of being “safe”!

Page 24: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Geronimo!: Efficient “Undo” Traditional approach: State Saving

save byte-copies of modified items high overhead for fine-granularity computations memory utilization is large

need alternative for large-scale, fine-grain simulations

Our approach: Reverse Computation automatically generate reverse code from model source utilize reverse code to do rollback

negligible overhead for forward computation significantly lower memory utilization

joint with Kalyan Perumalla and Richard Fujimoto

Observation: “reverse” computation treats “code” as“state”. This results in a code-state duality.Can we generalize notion?…..

Page 25: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Geronimo!: Active Code

Key idea: allow object methods/code to be dynamically changed during run-time. objects can schedule in the future a new method or re-define old methods of

other objects and themselves. objects can erase/delete methods on themselves or other objects. new methods can contain “Active Code” which can re-specialize itself or other

objects. work in a heterogeneous environment.

How is this useful? increase performance by allowing the program to consistently “execute the

common case fast”. adaptive, perturbation-free, monitoring of distributed systems. potential for increasing a language’s “expressive power”.

Our approach? Java…no, need higher performance…maybe used in the future... special compiler…no, can’t keep up with changes to microprocessors.

Page 26: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Geronimo!: Active Code Implementation

Runtime infrastructure modifies source code tree start a rebuild of the executable on a another existing machine uses a system’s naïve compiler

Re-exec system call reloads only the new text or code segment of new executable fix-up old stack to reflect new code changes fix-up pointers to functions will run in “user-space” for portability across platforms

Language preprocessor instruments code to support stack and function pointer fix-up instruments code to support stack reconstruction and re-start process

Page 27: Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Research Issues Software architecture for the heterogeneous, shared-

memory, message passing environment. Development of distributed algorithms that are fully

optimized for this “combination” environment. What language to use for development, C or C++ or

both? Geronimo! API. Active Code Language and Systems Support. Mapping relevant application types to this framework

Homework Problem: Can you find specific applications/problems where we can apply Geronimo!?