Compiling for IA-64

CompilingCompilingforfor

IA-64IA-64Carol ThompsonCarol ThompsonOptimization ArchitectOptimization Architect

Hewlett PackardHewlett Packard

History of ILP CompilersHistory of ILP Compilers

• CISC era: no significant ILPCISC era: no significant ILP– Compiler is merely a tool to enable use of high-Compiler is merely a tool to enable use of high-

level language, at some performance costlevel language, at some performance cost• RISC era: advent of ILPRISC era: advent of ILP

– Compiler-influenced architectureCompiler-influenced architecture– Instruction scheduling becomes importantInstruction scheduling becomes important

• EPIC era: ILP as driving forceEPIC era: ILP as driving force– Compiler-specified ILPCompiler-specified ILP

Increasing Scope for ILP Increasing Scope for ILP CompilationCompilation

• Early RISC CompilersEarly RISC Compilers– Basic block scope (delimited by Basic block scope (delimited by

branches & branch targets)branches & branch targets)• Superscalar RISC and early VLIW Superscalar RISC and early VLIW

CompilersCompilers– Trace scope (single entry, Trace scope (single entry,

single path)single path)– Superblocks & Hyperblocks Superblocks & Hyperblocks

(single entry, multiple path)(single entry, multiple path)• EPIC CompilersEPIC Compilers

– Composite regions: multiple Composite regions: multiple entry, multiple pathentry, multiple path

Composite Regions

Traces

Superblock

Basic Blocks

Unbalanced and UnbiasedUnbalanced and UnbiasedControl FlowControl Flow

• Most code is not well balancedMost code is not well balanced– Many very small blocksMany very small blocks– Some very largeSome very large– Then and else clause frequently Then and else clause frequently

unbalancedunbalanced– Number of instructionsNumber of instructions– PathlengthPathlength

• Many branches are highly biasedMany branches are highly biased– But some are notBut some are not– Compiler can obtain frequency Compiler can obtain frequency

information from profiling or information from profiling or derive heuristically derive heuristically

60

60

0

0

40

55

55

5

5

40

Basic BlocksBasic Blocks

• Basic Blocks are simpleBasic Blocks are simple– No issues with executing No issues with executing

unnecessary instructionsunnecessary instructions– No speculation or No speculation or

predication support requiredpredication support required• But, very limited ILPBut, very limited ILP

– Short blocks offer very little Short blocks offer very little opportunity for parallelismopportunity for parallelism

– Long latency code is unable Long latency code is unable to take advantage of issue to take advantage of issue bandwidth in an earlier bandwidth in an earlier blockblock

60

60

0

0

40

55

55

5

5

40

TracesTraces

60

60

0

0

40

55

55

5

5

40

• Traces allow scheduling of multiple Traces allow scheduling of multiple blocks togetherblocks together

– Increases available ILPIncreases available ILP

– Long latency operations can be Long latency operations can be moved up, as long as they are on moved up, as long as they are on the same tracethe same trace

• But, unbiased branches are a But, unbiased branches are a problemproblem

– Long latency code in slightly less Long latency code in slightly less frequent paths can’t move upfrequent paths can’t move up

– Issue bandwidth may go unused Issue bandwidth may go unused (not enough concurrent (not enough concurrent instructions to fill available instructions to fill available execution units)execution units)

60

60

0

0

40

55

55 5

40

5

5

Superblocks and HyperblocksSuperblocks and Hyperblocks• Superblocks and Hyperblocks Superblocks and Hyperblocks

allow inclusion of multiple allow inclusion of multiple important pathsimportant paths

– Long latency code may migrate Long latency code may migrate up from multiple pathsup from multiple paths

– Hyperblocks may be fully Hyperblocks may be fully predicatedpredicated

– More effective utilization of More effective utilization of issue bandwidthissue bandwidth

• But, requires code duplicationBut, requires code duplication

• Wholesale predication may Wholesale predication may lengthen important pathslengthen important paths

Composite RegionsComposite Regions

• Allow rejoin from non-Region codeAllow rejoin from non-Region code

– Wholesale code duplication is Wholesale code duplication is not requirednot required

– Support full code motion across Support full code motion across regionregion

– Allow all interesting paths to be Allow all interesting paths to be scheduled concurrentlyscheduled concurrently

• Nested, less important Regions Nested, less important Regions bear the burden of the rejoinbear the burden of the rejoin

– Compensation code, as neededCompensation code, as needed

60

60

0

0

40

55

55

5

5

40

Predication ApproachesPredication Approaches

• Full Predication of Full Predication of entire Regionentire Region– Penalizes Penalizes

short pathsshort paths

60

60

0

0

40

55

55

5

5

40

On-Demand PredicationOn-Demand Predication

• Predicate (and Predicate (and Speculate) as Speculate) as neededneeded– reduce critical reduce critical

path(s)path(s)– fully utilize issue fully utilize issue

bandwidthbandwidth• Retain control flow to Retain control flow to

accommodate accommodate unbalanced pathsunbalanced paths

60

60

0

0

40

55

55

5

5

40

Predicate AnalysisPredicate Analysis

• Instruction scheduler requires knowledge of Instruction scheduler requires knowledge of predicate relationshipspredicate relationships– For dependence analysisFor dependence analysis– For code motionFor code motion– ……

• Predicate Query SystemPredicate Query System– Graphical representation of predicate Graphical representation of predicate

relationshipsrelationships– Superset, subset, disjoint, …Superset, subset, disjoint, …

Predicate ComputationPredicate Computation

• Compute all predicates possibly neededCompute all predicates possibly needed• OptimizeOptimize

– to share predicates where possibleto share predicates where possible– to utilize parallel comparesto utilize parallel compares– to fully utilize dual-targetsto fully utilize dual-targets

Predication and Branch CountsPredication and Branch Counts

• Predication reduces branchesPredication reduces branches– at both moderate and aggressive opt. levelsat both moderate and aggressive opt. levels

Normalized Dynamic Branch Counts

00.20.40.60.8

11.2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Predication & Branch PredictionPredication & Branch Prediction

• Comparable misprediction rate with predicationComparable misprediction rate with predication

– despite significantly fewer branchesdespite significantly fewer branches increased mean time between mispredicted branchesincreased mean time between mispredicted branches

Normalized Mispredict Rates

0

0.5

1

1.5

2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Register AllocationRegister Allocation

• Modeled as a graph-coloring Modeled as a graph-coloring problem.problem.– Nodes in the graph Nodes in the graph

represent live ranges of represent live ranges of variablesvariables

– Edges represent a Edges represent a temporal overlap of the temporal overlap of the live rangeslive ranges

– Nodes sharing an edge Nodes sharing an edge must be assigned must be assigned different colors (registers)different colors (registers)

x = ...y = ...

= ... xz = ... = … y = … z

y

zx

Requires Two Colors

y

z

x


x = ...y = ...

x

zy

With Control Flow

z = ... = … z

= … yx = ...

= … x

x

y

z

Requires Two Colors


x

zy

With Predicationxx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Requires Three Colors

y

Predicate AnalysisPredicate Analysis

p0

p2p1

x

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

p1 and p2 are disjointIf p1 is TRUE, p2 is false

and vice versa


x

zy

With Predicate Analysisx

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Back to Two Colors

Effect of Predicate-Aware Effect of Predicate-Aware Register AllocationRegister Allocation

• Reduces register requirements for individual Reduces register requirements for individual procedures by 0% to 75%procedures by 0% to 75%– Depends upon how aggressively predication is Depends upon how aggressively predication is

appliedapplied• Average dynamic reduction in register stack Average dynamic reduction in register stack

allocation for gcc is 4.7%allocation for gcc is 4.7%

Object-Oriented CodeObject-Oriented Code

• ChallengesChallenges– Small Procedures, many Small Procedures, many

indirect (virtual)indirect (virtual)– Limits size of regions, Limits size of regions,

scope for ILPscope for ILP

– Exception HandlingException Handling

– Bounds Checking (Java)Bounds Checking (Java)– Inherently serial - must Inherently serial - must

check before check before executing load or storeexecuting load or store

SolutionsSolutionsInliningInlining

for non-virtual functions or for non-virtual functions or provably unique virtual provably unique virtual functionsfunctionsSpeculative inlining for most Speculative inlining for most common variantcommon variant

Liveness analysis of handlersLiveness analysis of handlersArchitectural support for Architectural support for speculation ensures speculation ensures recoverabilityrecoverability

Speculative executionSpeculative executionGuarantees correct Guarantees correct exception behaviorexception behavior

Dynamic optimization (e..g Java)Dynamic optimization (e..g Java)Make use of dynamic Make use of dynamic

profileprofile

Method CallsMethod Calls• Barrier between execution Barrier between execution

streamsstreams

• Often, location of called Often, location of called method must be determined method must be determined at runtimeat runtime

– Costly “identity check” on Costly “identity check” on object must complete object must complete before method may beginbefore method may begin

– Even if the call nearly Even if the call nearly always goes to the same always goes to the same placeplace

– Little ILPLittle ILP

Resolvetarget

method

Call-dependentcode

Possibletarget

Possibletarget

Possibletarget

Speculating Across Method Speculating Across Method CallsCalls

• Compiler predicts target methodCompiler predicts target method– ProfilingProfiling– Current state of class hierarchyCurrent state of class hierarchy

• Predicted method is inlinedPredicted method is inlined– Full or partialFull or partial

• Speculative execution of called method begins Speculative execution of called method begins while actual target is determinedwhile actual target is determined

Speculation Across Method Speculation Across Method Calls Calls

Resolvetargetmethod

call method

Dominantcalled

method

Othertarget

method

Othertarget

method

call othermethod if needed

Dominantcalled

method

Othertarget

method

Othertarget

method

Resolvetarget

method

Bounds & Null ChecksBounds & Null Checks

• Checks inhibit code motionChecks inhibit code motion• Null checksNull checks

x = y.foo;x = y.foo; if( y == null ) throw NullPointerException;if( y == null ) throw NullPointerException;

x = y.foo;x = y.foo;

• Bounds checksBounds checks

x = a[i];x = a[i]; if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBounds Exception;throw ArrayIndexOutOfBounds Exception;

x = a[i];x = a[i];

Speculating Across Bounds Speculating Across Bounds ChecksChecks

• Bounds checks rarely failBounds checks rarely fail

x = a[i];x = a[i]; ld.sld.st = a[i];t = a[i];

if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBoundsException;throw ArrayIndexOutOfBoundsException;

chk.schk.s tt

x = t;x = t;

• Long latency load can begin before checksLong latency load can begin before checks

Exception HandlingException Handling

• Exception handling inhibits motion of subsequent Exception handling inhibits motion of subsequent codecodeif( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

Speculation in the Presence Speculation in the Presence of Exception Handlingof Exception Handling

• Execution of subsequent instructions may begin Execution of subsequent instructions may begin before exception is resolvedbefore exception is resolved

if( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

ldld t1 = y.foot1 = y.foo

ld.sld.s t2 = y.bart2 = y.bar

ld.sld.s t3 = z.bazt3 = z.baz

addadd x = t2 + t3x = t2 + t3

if( t1 ) throw MyException;if( t1 ) throw MyException;

chk.schk.s xx

Dependence Graph for Dependence Graph for Instruction SchedulingInstruction Scheduling

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

If( n < p->count ) {If( n < p->count ) {

(*log)++;(*log)++;

return p->x[n];return p->x[n];

} else {} else {

return 0;return 0;

}}

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]


chk.a t4

chk.a p

• During dependence graph During dependence graph construction, potentially construction, potentially controlcontrol and and datadata speculative edges and speculative edges and nodes are identifiednodes are identified

• Check nodes are added Check nodes are added where possibly needed where possibly needed (note that only data (note that only data speculation checks are speculation checks are shown here)shown here)

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

(p2) mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]


chk.a t4chk.a p

• Speculative edges may be violated. Here the graph is re-drawn to show the Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismenhanced parallelism

• Note that the speculation of both writes to the out0 register would require Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulinginsertion of a copy. The scheduler must consider this in its scheduling

• Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated

ConclusionsConclusions• IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler

– However, the technology is a logical progression However, the technology is a logical progression from today’sfrom today’s– Today’s RISC compilersToday’s RISC compilers

– are more complex are more complex – are more reliableare more reliable– and deliver more performanceand deliver more performance

than those of the early daysthan those of the early days– Complexity trend is mirrored in both hardware and Complexity trend is mirrored in both hardware and

applicationsapplications– Need a balance to maximize benefits from eachNeed a balance to maximize benefits from each

Compiling for IA-64

Documents

Transcript of Compiling for IA-64