Download - Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.

Rajiv Gupta

Chen Tian, Min Feng, Vijay Nagarajan

Speculative Parallelization of Applications on Multicores

2

Goal: Exploit parallelism frequently observed --- not guaranteed to be present due the presence of dependences:

Dependences due to Cold Code

Dependences that are Harmless

Dependences that are Silent

Speculative Parallelization

3

Thread-based Execution ModelNon-Speculative thread and state

Speculative threads and state

Committing Results

Rollback-free Recovery

Software Only – Coarse-grained parallelismSpeculative parallelization of loops

Speedups on a real machine

Outline

4

Main Thread Performs Non-speculative Computation Non-Parallelizable Code Parts of Parallelized Code

Controls Parallel Threads Initialization & Memory Allocation Termination & Miss-speculation Checks Commit Results In-Order

Multiple Parallel ThreadsPerform Speculative Computations E.g., Speculative Loop Bodies

Execution Model

5

prologuespeculative bodyepilogue

Static code Sequential Execution

p1

e1

sp1

p2

e2

sp2

p3

e3

sp3

p4

sp4

e4

Execution Model

6

prologuespeculative bodyepilogue

Static code Sequential Execution

p1

e1

sp1

p2

e2

sp2

p3

e3

sp3

p4

sp4

e4

Parallel Execution

Main thread P1 P2

1. In-order commit.2. At any time, only two threads are executing. So the main thread doesn’t require a separate core.

Execution Model

7

Non-Speculative State (D space)Maintained by the main thread.

Speculative State (P space)Allocated by the main thread and used by parallel threads.

Results will be either committed to D space or discarded. Coordinating State (C space)

Version number for variables in D space.

Mapping table for variables in P space.

Main ThreadD space

D space

C spaceC space

C spaceC space

Parallel ThreadP spaceP space

P spaceP spaceParallel Thread

……

Memory State

8

Naïve schemeCopy-in: copy values from D space to P space when work assigned.

Copy-out: copy variable values from P space to D space when the speculation check succeeds.

Optimized schemeUse profiling to discover access pattern of variables in the speculative loop body. In-Out, Only-in, Only-out and Thread-local.

Unknown variables untouched in the profiling run. Copied on-the-fly through message passing.

Mapping tableMapping information of those variables. D space address, P space address, size, version and write-flag.

Updated when variables are copied into P space.

Referred to when variables are copied back to D space.

Copy Operations

9

Version number – maintained by the main threadFor each variable that is potentially read/written by parallel threads.

Version number is copied into mapping table when the corresponding variable is being copied into P space.

Miss-speculation check – performed by the main threadFor every entry in the mapping table, compare its version number with the one maintained by the main thread.

If all are same, the speculation succeeds. Perform the copy-out operations. Update the version numbers accordingly.

If any version number is different, the speculation fails because some earlier thread has changed this variable’s value: Re-execute the speculative body with the latest value. Value-based dependence check.

Rollback Free Recovery

Miss-speculation Check

10

Access Checks to consult the mapping table at:Loads and Stores

Pointer Assignments

Reducing Overhead of Access Checks

Stack & Global variables: Based upon classification. In-Out, Only-in, Only-out, Thread-local.

Heap: Optimizations beyond classification.

Locally created objects require no checks. Once object is copied, other fields accessed without checking. Copy-on-write only: No checks needed at Loads;

Since version number not copied on a read, miss-speculation detection implicitly carried out by another copied variable.

On-the-fly Copying

11

Reducing Thread IdlingScenario: an earlier thread finishes its task, but the main thread has not finished assigning tasks to later threads, and hence cannot handle this earlier thread. Performance fell when 4 or more parallel threads are used.

Solution: assign more work to each thread by loop unrolling.

Reducing Miss-speculation RateScenario: the value of a speculative variable being used by a thread is changed by an earlier thread, and hence the speculation fails. For benchmark 181.mcf, the miss-speculation rate becomes higher when

more threads are used.

Solution: delay copying of some variables - on-the-fly mechanism. Increases the chance of getting the latest version.

Other Enhancements

12

PrologueInput statements (e.g. fgets).

Loop counters.

EpilogueOutput statements (e.g. printf)

Statements highly-dependent on previous iteration. E.g. line_handled

while (){

line = read_one_line(input_file);

if (line cannot be parsed) {

error_num++;

}

else {

result = parse(line);

}

line_handled ++;

print(result);

}

An example from 197.parse

Speculative bodyThe remainder. Loop carried dependence on error_num rarely manifests itself.

Speculative Parallelization

13

while (){

Prologue code;

Speculative body code;

Epilogue code

}

Assign a new iteration

for (i=0; i < Num_Proc; i++) {

allocate P and C space for thread i;

Prologue code;

create thread i to execute thrd_func (i);

} reset i = 0;

while() {

while (speculation_check(i) == FAIL) {

update P and C space for thread i;

re-execute thrd_func (i);

}

commit result and execute Epilogue code;

Prologue code;

update P and C space for thread i;

ask thread i to execute thrd_func (i);

i= (i+1) % Num_Proc;

}

wait for all threads’ completion and execute Epilogue code;

Create thread and initialize their tasks

Handle misspeculation

In-order commit

Main Thread

14

while (){

Prologue code;


Epilogue code

}

void * thrd_func(i) {

while (1) {

wait for the “start” message;


send “finish” message;

}

}

Parallel Thread

Checks preceding/following Loads, Stores, Pointer Assignments

15

Profiling tool(Pin)

Dependence graph and access patterns

Compiler infrastructure(LLVM)

Binary and a small input

Source code

objdump SymbolsTransformation

Template

x86 binary

-native option

Dell PowerEdge 1900Two Intel Xeon Quadcores

3 GHz, 16 GB

Experimental Setup

16

Benchmarks5 SPEC benchmarks 197.parser, 181.mcf, 130.li, 256.bzip2 & 255.vortex.

1 MiBench benchmark CRC32- Best speedup achieved among all benchmarks.

Variables in speculative body (obtained via profiling)

Programs Only-In Only-Out Thread Local In-Out

197.parser 49 6 12 2

181.mcf 5 0 5 3

130.li 30 0 3 6

256.bzip2 12 8 11 1

255.vortex 76 5 4 6

CRC32 1 0 2 1

Dell PowerEdge 1900 server with two quad-core processors, 3GHz, &16 GB.

Experimental Setup

17

All benchmarks get the best speedup when 8 threads are used.

The highest speedups ranges from 3.7 to 7.8 across all benchmarks.

Execution Speedups

18

Thread Idling

19

Without delayed copying, miss-speculation rate of 181.mcf increases from 0.7% to 17.5% as the number of parallel thread increases from 2 to 8.

With delayed copying, miss-speculation rate of 181.mcf is below 10%.

The miss-speculation rate of other benchmarks is less than 2%.

Delayed Copying

Threads

20

Considered three schemes:1. All: all variables copied before parallel thread starts work

Unnecessary copying occurs.

2. On-the-fly: all variables copied on-the-fly via message passing Need to check every variable to see if it has been copied into P space

3. Opt. : profiling used to determine when to copy

Copy Optimization

21

The experiment shows the result when 4 threads are used.

Opt. outperforms other two schemes.

On-the-fly outperforms all when heap accesses dominate (bzip2, mcf).

Copy Optimization

22

Overhead breakdown per core when 8 threads are used.

No more than 7% of total instructions are used for operations related to the execution model.

Programs Copy

On Start

Copy

On-the-fly

Exception

Check

Miss-speculation

Check

Setup

197.parser 3.51% 0.33% 0.02% 1.76% 0.62%

181.mcf 0.08% 0% 0% 1.08% 0.07%

130.li 1.32% 0.25% 0.06% 1.03% 0.48%

256.bzip2 1.97% 0.13% 0.08% 2.81% 2.15%

255.vortex 5.28% 0.04% 0.01% 1.25% 0.39%

CRC32 0.02% 0% 0% 0.01% 0.32%

Overhead – Instruction Count

23

For most benchmarks, the space overhead is around 2-3x.

256.bzip2 - a large chunk of heap needs to be copied to P space.

Overhead – Memory Space