Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program...

56
Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson Purdue University

Transcript of Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program...

Page 1: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Algorithm Composition usingDomain-Specific Libraries

and

Program Decomposition forThread-Level Speculation

Troy A. JohnsonPurdue University

Page 2: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 2

My Background

• ParaMount research group at Purdue ECE– http://cobweb.ecn.purdue.edu/ParaMount– high-performance computing, compilers, software

tools, automatic parallelization

• Computational Science & Engineering Specialization– http://www.cse.purdue.edu

Page 3: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

3

Compilers

ArtificialIntelligence

Scientific& Parallel

Computing

Architecture

Web, Grid, & Peer-to-Peer (P2P)Circuits

ProgrammingLanguages

Cetus – A Source-to-Source Transforming C Compiler

barrier synchronization

Kosha – P2P Enhancement for the Network File System

algorithm composition

thread-level speculation for multi-core

transactions

OS3 – Online Student Survey System double-edge triggered flip-flops

LCPC 03, 04

LCPC International Workshop on Languages and Compilers for Parallel Computing

PLDI 06

PLDI ACM SIGPLAN Conf. on Programming Language Design & Implementation

PLDI 04PPoPP 07

PPoPP ACM SIGPLAN Symp. on the Principles and Practice of Parallel Programming

LCPC 06

WMPP IEEE Workshop on Massively Parallel Processing

WMPP 01

ICECS 01

ICECS IEEE International Conf. on Electronics, Circuits, and Systems

SC 04 JGC 06

SC ACM/IEEE Conference on SupercomputingJGC Journal of Grid Computing

FIE 01 JCAEE 02

JCAEE Journal of Computer Applications in Engineering EducationFIE ASEE/IEEE Frontiers in Education Conference

Page 4: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 4

Outline

• Part I: Algorithm Composition using Domain-Specific Libraries

• Part II: Program Decomposition for Thread-Level Speculation

• Part III: Future Research, Funding, & Opportunities for Collaboration

PLDI 06PLDI 04PPoPP 07

Page 5: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 5

Outline

• Part I: Algorithm Composition using Domain-Specific Libraries

• Part II: Program Decomposition for Thread-Level Speculation

• Part III: Future Research, Funding, & Opportunities for Collaboration

PLDI 06PLDI 04PPoPP 07

Page 6: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 6

Motivation

• Increasing programmer productivity• Typical language approach: increase abstraction

– abstract further from machine; get closer to problem– do more using less code– reduce software development & maintenance costs

• Domain-specific languages / libraries (DSLs) provide a high level of abstraction– e.g., domains are biology, chemistry, physics, etc.

• But, library procedures are most useful when called in sequence

Page 7: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 7

Example DSL: BioPerl

• http://www.bioperl.org

• DSL for Bioinformatics

• Written in the Perl language

• Popular, actively developed since 1995

• Used in the Dept. of Biological Sciences at Purdue

Page 8: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 8

A Typical BioPerl Call Sequence

• Query a remote database and save the result to local storage:

Query q = bio_db_query_genbank_new(“nucleotide”,

“arabidopsis[ORGN] AND topoisomerase[TITL] AND

0:3000[SLEN]”);

DB db = bio_db_genbank_new( );

Stream stream = get_stream_by_query(db, q);

SeqIO seqio = bio_seqio_new(“>sequence.fasta”, “fasta”);

Seq seq = next_seq(stream);

write_seq(seqio, seq); Example adapted fromhttp://www.bioperl.org/wiki/HOWTO:Beginners

5 data types, 6 procedure callsType Procedure

Page 9: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 9

A Library User’s Problem

• Novice users don’t know these call sequences– procedures documented independently– tutorials provide some example code fragments

• not an exhaustive list• may need adjusted for calling context (no copy paste)

• User knows what they want to do, but not how to do it

Page 10: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 10

As Observed by Others

• “most users lack the [programming] expertise to properly identify and compose the routines appropriate to their application”– Mark Stickel et al. Deductive Composition of Astronomical Software

from Subroutine Libraries. International Conference on Automated Deduction, 1994.

• “a common scenario is that the programmer knows what type of object he needs, but does not know how to write the code to get the object”

– David Mandlin et al. Jungloid Mining: Helping to Navigate the API Jungle. ACM SIGPLAN Conf. on Programming Language Design and Implementation, June 2005.

Page 11: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 11

My Solution

• Add an “abstract algorithm” (AA) construct to the programming language– An AA is named and defined by the programmer

• definition is the programmer's goal

– An AA is called like a procedure• compiler replaces the call with a sequence of library calls

• How does the compiler compose the sequence?– short answer: it uses a domain-independent

planner

Page 12: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 12

Defining and Calling the AA

• AA (goal) defined using some properties...algorithmsave_query_result_locally(db_name, query_string, filename, format) => { query_result(result, db_name, query_string), contains(filename, result), in_format(filename, format) }

...and called like a procedure

Seq seq = save_query_result_locally(“nucleotide”, “arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”, “sequence.fasta”, “fasta”);

Properties are not procedure calls.Their order does not matter.result is a named return value.

1 data type, 1 AA call

Type Procedure Property AA

Page 13: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 13

Describing the Programmer's Goal

• Programmer must indicate their goal somehow

• Library author provides a domain glossary– query_result(result, db, query) – result is the

outcome of sending query to the database db– contains(filename, data) – file filename contains

data– in_format(filename, format) – file filename is in

format format

• Glossary terms are properties (facts), whereas procedure names are actions

Page 14: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 14

Composing the Call Sequence

• AI planners solve a similar problem

• Given an initial state, a goal state, and a set of operators, a planner discovers a sequence (a plan) of instantiated operators (actions) that transforms the initial state to the goal state

• Operators define a state-transition system– planner finds a path from initial state to goal state– typically too many states to enumerate– planner searches intelligently

Page 15: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

15

Traditional Planning Example

• Planners are not normally applied to software; they traditionally solve problems like this:

Initial State Plan Goal State

A

CB C

B

A

on(B, table)on(C, table)on(A, C)

on(C, table)on(B, C)on(A, B)

move(A, table)move(B, C)move(A, B)

These are properties.(Planner’s Input)

These are properties.(Planner’s Input)

These are actions.(Planner’s Output)

Actions in the planmodify a “world”of blocks

Page 16: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 16

To Solve Composition using Planning

• Initial state : calling context (from compiler analysis)

• Goal state : AA definition (from the library user)

• Operators : procedure specifications (from the library author)– Actions : procedure calls

• World : program state

Page 17: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 17

My Composition System, DIPACS

Application Code

DIPACSPlanner

Domain-SpecificLibrary

ProcedureSpecifications

Initial State

Goal State

Plan

gcc

C or C++ Code

Binary (Actions)

0010101001000010

Run-TimeProgram State

(World)

Operators

DIPACSCompiler

DIPACS = Domain-Independent Planned Algorithm Composition and Selection

Page 18: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 18

My Composition System, DIPACS

Application Code

DIPACSPlanner

Domain-SpecificLibrary

ProcedureSpecifications

Initial State

Goal State

Plan

gcc

C or C++ Code

Binary (Actions)

0010101001000010

Run-TimeProgram State

(World)

Operators

DIPACSCompiler

DIPACS = Domain-Independent Planned Algorithm Composition and Selection

(Challenge #1)

1. Ontological Engineering – choosing a vocabulary for the domain

Page 19: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 19

Why High-Level Abstraction is OK

• Library author understands the properties

• Library user understands– via prior familiarity with the domain– via some communication from the author (glossary)

• Compiler propagates terms during analysis– meaning of properties does not matter

• Planner matches properties to goals– meaning of properties does not matter

Page 20: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 20

My Composition System, DIPACS

Application Code

DIPACSPlanner

Domain-SpecificLibrary

ProcedureSpecifications

Initial State

Goal State

Plan

gcc

C or C++ Code

Binary (Actions)

0010101001000010

Run-TimeProgram State

(World)

Operators

DIPACSCompiler

DIPACS = Domain-Independent Planned Algorithm Composition and Selection

(Challenge #1)

1. Ontological Engineering – choosing a vocabulary for the domain

(Challenge #4)

(Challenge #3)(Challenge #2)

2. Determine Initial & Goal States – requires flow analysis; translation to a planning language

3. Object Creation – most planners assume a fixed set of objects

4. Merge the Plan into the Program – destructive vs. non-destructive plans

Page 21: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 21

Selection of a Call Sequence

• Multiple call sequences can be found– programmer or compiler can choose

• Incomplete specifications may cause undesirable sequences to be suggested– requiring complete specifications is not practical– permitting incompleteness is a strength– use programmer-compiler interaction for oversight

Page 22: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 22

Related Work

• Languages and Compilers– Jungloids

• David Mandlin et al. Jungloid Mining: Helping to Navigate the API Jungle. ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005.

– Broadway• Samuel Z. Guyer and Calvin Lin. Broadway: A Compiler for Exploiting the

Domain-Specific Semantics of Software Libraries. Proceedings of the IEEE, 93(2):342–357, February 2005.

– Speckle• Mark T. Vandevoorde. Exploiting Specifications to Improve Program

Performance. PhD thesis, Massachusetts Institute of Technology, 1994.

Page 23: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 23

Related Work (continued)

• Automatic Programming– Robert Balzer. A 15-year Perspective on Automatic Programming. IEEE Transactions on

Software Engineering, 11(11):1257–1268, November 1985.– David R. Barstow. Domain-Specific Automatic Programming. IEEE Transactions on

Software Engineering, 11(11):1321–1336, November 1985.– Charles Rich and Richard C. Waters. Automatic Programming: Myths and Prospects. IEEE

Computer, 21(8):40–51, August 1988.

• Automated (AI) Planning– Keith Golden. A Domain Description Language for Data Processing. Proc. of the

International Conference on Automated Planning and Scheduling, 2003.– M. Stickel et al. Deductive Composition of Astronomical Software from Subroutine

Libraries. Proc. of the International Conference on Automated Deduction, 1994.– Other work at NASA Ames Research Center

Page 24: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 24

Conclusion

• A DSL compiler can use a planner to implement a useful language construct

• Gave an example using a real DSL (BioPerl)

• Identified implementation challenges and their general solutions in this talk– for detailed solutions see my PLDI 06 paper

Page 25: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 25

Outline

• Part I: Algorithm Composition using Domain-Specific Libraries

• Part II: Program Decomposition for Thread-Level Speculation

• Part III: Future Research, Funding, & Opportunities for Collaboration

PLDI 06PLDI 04PPoPP 07

Page 26: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 26

Motivation

• Multiple processors on a chip becoming common

• What should we do with them?– Run multiple programs simultaneously?

• possibly more processor cores than programs• individual cores are generally simpler and slower

– Run a single program using multiple cores?• program must be parallel• same problems as automatic parallelization in order to

apply to a wide variety of existing programs– must prove independence of program sections

– Is there a way to avoid these problems?

Page 27: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 27

Thread-Level Speculation

• Decompose a sequential program into threads

• Run the threads in parallel on multiple cores

• Buffer program state and rollback execution to an earlier, correct state if not really parallel

• Inter-thread data dependences are OK, but slow execution due to the rollback overhead

Page 28: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 28

Hardware Support

• Hardware uses a predictor to dispatch a sequence of threads

• Memory writes by threads are buffered

• All writes by a thread commit to main memory after they are known to be correct

• A misprediction or a data dependence violation causes a roll back and restart

Page 29: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 29

Execution Model• Data dependence and thread sequence

misprediction contribute to overhead

T0T1

T2 T3

T2(b)

T3(b)

processors

P0 P1 P2 P3

time

x =

= x

dependenceviolation ormispredictioncauses aroll back

sequentialprogram

x =

= x

Page 30: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 30

Optimizing for Speculation

• Any set of threads is valid, but amounts of overhead and parallelism will differ– thread decomposition is crucial to performance

• Decomposition can be done manually, statically (by a compiler), or dynamically (by the hardware or virtual machine)– I'll discuss two static approaches that use profiling– dynamic approaches have run-time overhead, do

not know the program's high-level structure, & have difficulty performing trade-offs among overheads

Page 31: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 31

Approach #1 (PLDI 04)

• Start with a control-flow graph• Model data dependence and misprediction

overheads as edge weight– use the min-cut algorithm to minimize overheads

while partitioning the graph into threads– a new thread begins after a cut

• Load imbalance hard to model as edge weight– instead use balanced min-cut [Yang & Wong IEEE

Trans. on CAD of ICS 96] to simultaneously minimize cut edge weight and help balance thread size

Page 32: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 32

Classical Min-Cut

• Determines where to partition a graph such that the cost of the cut is minimal

b

c d

e

f5

5

5

55

5

a

min-cutcost = 5

thread 1

thread 2thread dispatchoverhead = 5 cycles

Page 33: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 33

Min-Cut with Overhead Weights

• Edge weights cause the cut to avoid placing the producer and consumer in different threads

b

c d

e

f 5

5 + 25 = 305

5 + 5 + 10 = 205 + 5 = 10

5 + 25 = 30

a

min-cutcost = 10

now suppose thatthere is a dependencebetween a and e,and the branch atb is 50% taken

thread 1

thread 2

x = …

… = x

dependence weightprediction weight

Page 34: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 34

Not Quite That Simple

• Need to factor in parallelism– otherwise zero cuts (sequential program) looks

“best”– estimate ideal execution time then add cut cost

• Edge weights depend on thread size– rollback penalty proportional to the amount of work

thrown away– but when weights are assigned, the cut has not yet

created the threads– solution: assign weights based on where threads

will be, perform a balanced min-cut, repeat, …

Page 35: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 35

Experimental Setup

• Simulator– Multiscalar architecture [Sohi ISCA 95] with 4 cores– core-to-core latency = 10 cycles– memory latency = 300 cycles

• Input– SPEC CPU2000 benchmark suite– compiler based on gcc– profiling data collected with train input

• dependences, branch-taken frequencies, etc.– performance data collected with ref input

Page 36: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 36

SPEC CPU2000 Speedup Results

Benchmark Single Thread [Vijay Micro 98] [Johnson PLDI 04]Insns Per Cycle Speedup Min-Cut Speedup

applu 0.48 1.98 2.11art 0.19 2.06 2.43

equake 0.56 1.49 1.79mgrid 0.35 5.41 5.54swim 0.17 4.51 4.51

geometric mean 2.72 2.97

bzip2 0.70 1.01 1.09gzip 0.72 1.27 1.35mcf 0.07 1.01 1.63

parser 0.51 0.87 1.24vpr 0.63 1.38 1.09

geometric mean 1.09 1.27

Page 37: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 37

Approach #2 (PPoPP 07)

• Position: static decomposition's effectiveness is limited due to the need for so much estimation

• Instead, embed a search algorithm into a profile version of the program and have it try various decompositions while executing

• Profile-time empirical optimization benefits from– compiler-inserted instrumentation that guides the

search based on high-level program structure– run-time system measuring performance

Page 38: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 38

Candidate Threads

• Loop iterations– iterations are naturally balanced and predictable– dependence may cause rollback overhead

• Procedure calls– create larger threads in non-numerical applications

• Elsewhere– tends to make smaller threads; leads to imbalance– not the focus of this work

Page 39: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 39

Decomposition Search Space

• Architecture provides three options for executing loops and calls– Option 0 (fine-grain)

• all loop iterations or all threads in the callee are executed in parallel

– Option 1 (sequential)• the loop or the callee is executed sequentially with the

code before and after the loop or call

– Option 2 (coarse-grain)• the loop or the callee is executed sequentially with the

code before the loop or call but in parallel with what follows

Page 40: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 40

Execution Options

...

call...

... ...

call...

...

call...

0 1 2

threading optionssourcecode

call

(fine-grain) (sequential) (coarse-grain)

...

...

multiplenew threads

no newthreads

one newthread

Page 41: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 41

Specifying a Decomposition

f() {loopA {

g();...

}...g();...h();...

}

g() {loopB {...}...

}

f

loopA

g h

g

g

loopB

e1e2

e3

e4

e5

Each edge has a value {0, 1, 2}. Vertices are procedures or loops; we call this an extended call graph.

This program has 35 = 243 possible decompositions.

=0

=0

=0=0 =0

Multiple Threads

One Threaddecomposition = “00000”

Page 42: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 42

Speculation Locality

T-1

T0

TM-1 T

MT

-2T

-3T

M+1T

M+2...... ...

withincallee or

loop

potentiallyaffect callee

or loop

potentiallyaffectedby calleeor loop

TM+3T

-4

high

low

Potentialto affectT

0 ... T

M-1

Locality shown for a 4-core system.Threads T0 through TM-1 are inside a call or loop.

Page 43: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 43

Example Decomposition

f

loopA

g h

g

g

loopB

e1e2

e3

e4

e5

Within loopA, the call to g runs as a single thread.

Because e4 != 0, any thread spawn points withing (and therefore e5) are ignored by the hardwareduring that call to g.

The other call to g executes as multiple threads.

=0

=1

=0=1 =0

Multiple Threads

One Thread

f() {loopA {

g();...

}...g();...h();...

}

g() {loopB {...}...

}

f = “001”loopA = “1”

g = “0”

Page 44: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 44

Overall Search Algorithm

• Bottom-up on extended call graph– measure performance of various options for leaf

procedures & loops and keep the best options– when the search moves higher on the graph

• either Option 1 (sequential) or 2 (coarse) is measured as best, such that the values of lower edges are ignored

• or Option 0 (fine) is measured as best and the speculation locality property provides confidence that the values of lower edges are good and not influenced much by the decisions higher up

• the values of lower edges are not reevaluated; once they are set, they remain fixed

Page 45: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 45

Search Per Vertex• Vertices with a small branching factor (number

of outgoing edges) are searched exhaustively

• We evaluate two strategies for the rest– Greedy: tries 2n + 1 solutions

• starts by measuring performance of “00...0”• tries “10...0” and “20...0”, the best picks the first value (0,

1, or 2), then moves on to varying the next edge value

– Hierarchical: tries at most 0.5n2 + 1.5n + 1 solutions• starts by measuring performance of “22...2”• first pass tries changing each value to 1, keeping those

1s that improve performance• second pass tries changing each value to 0

Page 46: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 46

SPEC CFP2000 Speedup Results

Benchmark Single Thread [Vijay Micro 98] [Johnson PLDI 04] [Johnson PPoPP 07]Insns Per Cycle Min-Cut Greedy Hierarchical

applu 0.48 1.98 2.11 3.37 3.21art 0.19 2.06 2.43 3.00 3.00

equake 0.56 1.49 1.79 1.92 1.70mgrid 0.35 5.41 5.54 6.09 6.09swim 0.17 4.51 4.51 4.51 4.51

geometric mean 2.72 2.97 3.51 3.39

speedup factors using 4 cores

Page 47: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 47

SPEC CFP2000 Overheads

applu-orig

applu-mincut

applu-greedy

applu-hier

art-orig

art-mincut

art-greedy

art-hier

equake-orig

equake-mincut

equake-greedy

equake-hier

mgrid-orig

mgrid-m

incut

mgrid-greedy

mgrid-hier

swim

-orig

swim

-mincut

swim

-greedy

swim

-hier

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Function Units Memory Latency Thread Dispatch Load Imbalance Misprediction Dependence

Page 48: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 48

SPEC CFP2000 Mean Thread Size

Benchmark [Vijay Micro 98] [Johnson PLDI 04] [Johnson PPoPP 07]Min-Cut Greedy Hierarchical

applu 36.7 46.2 296.0 223.3art 15.7 19.3 27.4 27.4

equake 15.5 19.6 27.2 26.8mgrid 113.9 115.4 160.6 160.6swim 101.9 112.8 101.9 101.9

arithmetic mean 56.7 62.7 122.6 108.0

mean instructions per thread (dynamic)

Page 49: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 49

SPEC CINT2000 Speedup Results

Benchmark Single Thread [Vijay Micro 98] [Johnson PLDI 04] [Johnson PPoPP 07]IPC Min-Cut Greedy Hierarchical

bzip2 0.70 1.01 1.09 1.07 1.17gzip 0.72 1.27 1.35 1.11 1.17mcf 0.07 1.01 1.63 1.07 1.09

parser 0.51 0.87 1.24 1.20 1.18vpr 0.63 1.38 1.09 1.38 1.38

geometric mean 1.09 1.27 1.16 1.19

speedup factors using 4 cores

Page 50: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 50

Sampling Error in SPEC CINT2000

• Instrumentation varies a value in the solution string that corresponds to a call or loop that is not executed during the measurement

• Not significant for CFP2000– few branches, high coverage of loops and calls– speedup improvement is consistently good

• Significant for CINT2000– many branches, low coverage of loops and calls– speedup improvement is inconsistent

Page 51: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 51

Related Work

• Manual Decomposition– Steffan, Hardware Support for Thread Level

Speculation, CMU PhD Thesis, 2003– Prabhu & Olukotun, Exposing Speculative Thread

Parallelism in SPEC2000, PPoPP 2005

• Static Decomposition– Vijaykumar & Sohi,Task Selection for a Multiscalar

Processor, Micro 1998– Liu et al., POSH: A TLS Compiler that Exploits

Program Structure, PPoPP 2006

Page 52: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 52

Conclusions

• Decomposition performed at profile time via empirical optimization shows significant speedup over previous static methods for SPEC CFP2000

• The technique does not rely on specific architecture parameters, so should apply widely

• Sampling error limits improvement for SPEC CINT2000

Page 53: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 53

Outline

• Part I: Algorithm Composition using Domain-Specific Libraries

• Part II: Program Decomposition for Thread-Level Speculation

• Part III: Future Research, Funding, & Opportunities for Collaboration

Page 54: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 54

Future Research

• Algorithm Composition (Part I of this talk)– show that it applies to more domains

• collaborate with those doing research in the domains– investigate interaction with other language features and how

much programmers benefit from using composition• collaborate with researchers in programming languages and

software engineering– investigate alternative planning algorithms to see if they

work better (faster, solve harder problems)• collaborate with researchers in artificial intelligence

– examine the lower (instruction-level) and upper (application-level, grid-level) limits of composition

• collaborate with researchers in architecture and grid computing

Page 55: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 55

Possible Source of Funding

• National Science Foundation– Computer Systems Research Grant

• Advanced Execution Systems– Application Composition Systems

– “Designing methods for automatically selecting application components suitable to the application problem specifications”

Page 56: Troy A. Johnson - Purdue1 Algorithm Composition using Domain-Specific Libraries and Program Decomposition for Thread-Level Speculation Troy A. Johnson.

Troy A. Johnson - Purdue 56

Future Research (2)

• Multi-core Systems (Part II of this talk)– mitigate the sampling problem for empirical

optimization of non-numerical applications– investigate parallelization strategies (speculation or

otherwise) for these systems• collaborate with researchers in parallel & scientific

computing, compilers, and architecture