POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask...

28
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Computer Engineering Department University of California, Santa Cruz

Transcript of POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask...

Page 1: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure

Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau† and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

†Computer Engineering DepartmentUniversity of California, Santa Cruz

Page 2: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 2

CMPs are available now!

Chip multiprocessors (CMPs) have arrivedPentium D, Opteron, Power5, Niagara, …

Good speedups for parallel programsHow to speed up single-threaded applications?

Especially the hard-to-parallelize applications, such as SPECint.

Thread-Level Speculation (TLS)

Page 3: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 3

RAW

Iteration J

… = A[4] ……

A[5] = …

for(i=0;i<n;i++){… = A[B[i]] …

…A[C[i]] = …

}

What is Thread Level Speculation (TLS)?

Execute potentially-dependent tasks in parallel• Assume no dependence across tasks will be violated

TLS hardware• Track memory accesses; buffer unsafe state• Detect any violation (RAW)• Squash offending tasks, repair polluted state, restart tasks

Iteration J+1

… = A[2] ……

A[2] = …

Iteration J+2

… = A[5] ……

A[6] = …

Page 4: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

Novel Compiler and Architecture Optimizations for TLS Wei Liu, University of Illinois 4

Key to the Acceptance of TLS

Fully automated TLS compilers

• Do not need to prove the independence between tasks

• Crucial impact on performance• How to break the code into tasks• When to spawn the tasks

Page 5: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

Novel Compiler and Architecture Optimizations for TLS Wei Liu, University of Illinois 5

Contributions

POSH, a fully automated TLS compiler infrastructure• Exploits code structure (loop iterations and

subroutines of any nesting level) for tasks• Uses a simple profiling pass to discard ineffective

tasks• Leverages both parallelism and data prefetching

Speedup: 1.30 on average for SPECint 2000

Detailed characterization of speedup sources and task behavior.

• 26% of the speedup is from prefetching (i.e. 8% w.r.t. the baseline)

Page 6: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 6

Outline

Background of TLSContributions of POSHPOSH: A TLS CompilerExperimental ResultsConclusions

Page 7: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 7

Compiler Passes in gcc-3.5

Flowchart of the POSH Framework

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Binary Code

Profiler

Page 8: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 8

Hardware Assumptions

No special hardware for register communication• Dependence detection through memory

Support spawn and commit instructions

Page 9: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 9

Compiler Passes in gcc-3.5

Task Selection

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Task Code

Profiler

Page 10: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 10

Task Selection

Break the sequential program into tasksSelect the following structures

• Subroutines• Subroutine continuations• Loop iterations• Loop continuations• Continuation == Code region that follows

subroutines or loops

Mark the begin point for each task

A

B

C

D

Begin point A

Begin point B

Begin point C

Begin point D

Page 11: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 11

Compiler Passes in gcc-3.5

Value Prediction

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Task Code

Profiler

Page 12: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 12

Value Prediction

Predict the values of variables that cross task boundaries

• reduce violations

Software Value Predictor• Function return variables • Loop induction variables • Induction-like variables

for (i = 0; i < 100; i++) {…if (result == 0) {

j++; /* induction-like variable */}…

}

Page 13: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 13

Compiler Passes in gcc-3.5

Spawn Hoisting

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Task Code

Profiler

Page 14: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 14

Spawn Hoisting

The spawn instruction is hoisted relative to the beginning of a taskRestrictions on hoisting (See paper for more details)

• Definition of any variables in the task• Control safe location• Reverse sequential order for multiple spawn instructions

A

B

Spawn B

Page 15: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 15

Compiler Passes in gcc-3.5

Refinement

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Task Code

Profiler

Page 16: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 16

Compiler Passes in gcc-3.5

The POSH Profiler

Task Selection

ValuePrediction

Spawn Hoisting Refinement

ProgramStructure

DependenceRestriction

Placement

TaskSize

ProfilingFeedback

Parallelism

Generate Task Code

Profiler

Page 17: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 17

POSH Profiler

Runs a TLS binary with tasks selected by the compiler • Uses train input set

Executes the TLS binary sequentially• No TLS architectural support assumed for generality• With rudimentary timing, it can collect information about

potential run-time violations

Provides a simple L2 cache model to estimate missesNot tied to a fixed number of processors

Page 18: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 18

Profiling Pass

Goal: Minimize Execution TimePreserve parallelism

• Find tasks with the best overlap

Reward prefetching due to task squashes• Keep squashed tasks that prefetch even with small or no

overlap

Overlap PrefetchingBenefit = +

Page 19: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 19

Estimate Overlap

Spawn Task 2

Task 2

Task 1

Pro

filer

Exe

cutio

n

TEnd

TSpawn

TBegin

TStore ST XLD X

Profitable Overlap

TSpawn+Toverhead

TLoad

TLoad < TStore !!!

Page 20: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 20

Estimate Prefetching

Spawn Task 2

Task 1

Pro

filer

Exe

cutio

n

ST X

LD Y

Profitable Prefetching

LD X Task 2

LD Y

LD X

Page 21: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 21

Outline

Background of TLSContributions of POSHPOSH: A TLS CompilerExperimental ResultsConclusions

Page 22: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 22

Methodology

Simulated CMP architecture• 4 GHz• Private 16k L1 per core; 1 MB Shared L2

Unmodified SPECint 2000 applications

TLS CMP

Four 3-issue cores with TLS

Serial

One 3-issue core

Page 23: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 23

Task Characterization

Static Information• On average 32.2 Subr Tasks• On average 7.3 loops with Loop Tasks• About 6 live-ins and 6 live-outs per task

Dynamic Information• Average task size: 476 instructions• Most tasks have 50-1000 dynamic instructions• 45.7% tasks commit

Page 24: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 24

Application Speedups

Using both forms of tasks is simple and effectiveTLS: an average speedup of 1.30!

0

0.5

1

1.5

2

bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean

Spee

dup

over

Ser

ial

Subr Loop Subr+Loop

1.30

Page 25: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 25

Contribution of Prefetching

0

0.5

1

1.5

2

bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean

Spee

dup

over

Ser

ial

TLS_NoPrefetch TLS

1.301.22

About a quarter of the speedup is from prefetching

Page 26: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 26

Effectiveness of Profiler

0

0.5

1

1.5

2

bzip2 crafty gap gzip mcf parser twolf vortex vpr G.mean

Spee

dup

over

Ser

ial

No Profiler With Profiler

Profiling is needed for good performanceProfiler reduces the average number of tasks from 198 to 39

1.301.04

Page 27: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure Wei Liu et al, University of Illinois 27

Conclusions

POSH: A fully automated TLS compiler

• Simple: Use program structure

• Effective: Prefetching + Parallelism

• Speedup: on average 1.30 for SPECint 2000

Page 28: POSH: A TLS Compiler that Exploits Program Structurewahn/papers/PRES/present_ppopp06.pdfTask Characterization Static Information • On average 32.2 Subr Tasks • On average 7.3 loops

POSH: A TLS Compiler that Exploits Program Structure

Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau† and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

†Computer Engineering DepartmentUniversity of California, Santa Cruz