Compiling for IA-64
description
Transcript of Compiling for IA-64
CompilingCompilingforfor
IA-64IA-64Carol ThompsonCarol ThompsonOptimization ArchitectOptimization Architect
Hewlett PackardHewlett Packard
History of ILP CompilersHistory of ILP Compilers
• CISC era: no significant ILPCISC era: no significant ILP– Compiler is merely a tool to enable use of high-Compiler is merely a tool to enable use of high-
level language, at some performance costlevel language, at some performance cost• RISC era: advent of ILPRISC era: advent of ILP
– Compiler-influenced architectureCompiler-influenced architecture– Instruction scheduling becomes importantInstruction scheduling becomes important
• EPIC era: ILP as driving forceEPIC era: ILP as driving force– Compiler-specified ILPCompiler-specified ILP
Increasing Scope for ILP Increasing Scope for ILP CompilationCompilation
• Early RISC CompilersEarly RISC Compilers– Basic block scope (delimited by Basic block scope (delimited by
branches & branch targets)branches & branch targets)• Superscalar RISC and early VLIW Superscalar RISC and early VLIW
CompilersCompilers– Trace scope (single entry, Trace scope (single entry,
single path)single path)– Superblocks & Hyperblocks Superblocks & Hyperblocks
(single entry, multiple path)(single entry, multiple path)• EPIC CompilersEPIC Compilers
– Composite regions: multiple Composite regions: multiple entry, multiple pathentry, multiple path
Composite Regions
Traces
Superblock
Basic Blocks
Unbalanced and UnbiasedUnbalanced and UnbiasedControl FlowControl Flow
• Most code is not well balancedMost code is not well balanced– Many very small blocksMany very small blocks– Some very largeSome very large– Then and else clause frequently Then and else clause frequently
unbalancedunbalanced– Number of instructionsNumber of instructions– PathlengthPathlength
• Many branches are highly biasedMany branches are highly biased– But some are notBut some are not– Compiler can obtain frequency Compiler can obtain frequency
information from profiling or information from profiling or derive heuristically derive heuristically
60
60
0
0
40
55
55
5
5
40
Basic BlocksBasic Blocks
• Basic Blocks are simpleBasic Blocks are simple– No issues with executing No issues with executing
unnecessary instructionsunnecessary instructions– No speculation or No speculation or
predication support requiredpredication support required• But, very limited ILPBut, very limited ILP
– Short blocks offer very little Short blocks offer very little opportunity for parallelismopportunity for parallelism
– Long latency code is unable Long latency code is unable to take advantage of issue to take advantage of issue bandwidth in an earlier bandwidth in an earlier blockblock
60
60
0
0
40
55
55
5
5
40
TracesTraces
60
60
0
0
40
55
55
5
5
40
• Traces allow scheduling of multiple Traces allow scheduling of multiple blocks togetherblocks together
– Increases available ILPIncreases available ILP
– Long latency operations can be Long latency operations can be moved up, as long as they are on moved up, as long as they are on the same tracethe same trace
• But, unbiased branches are a But, unbiased branches are a problemproblem
– Long latency code in slightly less Long latency code in slightly less frequent paths can’t move upfrequent paths can’t move up
– Issue bandwidth may go unused Issue bandwidth may go unused (not enough concurrent (not enough concurrent instructions to fill available instructions to fill available execution units)execution units)
60
60
0
0
40
55
55 5
40
5
5
Superblocks and HyperblocksSuperblocks and Hyperblocks• Superblocks and Hyperblocks Superblocks and Hyperblocks
allow inclusion of multiple allow inclusion of multiple important pathsimportant paths
– Long latency code may migrate Long latency code may migrate up from multiple pathsup from multiple paths
– Hyperblocks may be fully Hyperblocks may be fully predicatedpredicated
– More effective utilization of More effective utilization of issue bandwidthissue bandwidth
• But, requires code duplicationBut, requires code duplication
• Wholesale predication may Wholesale predication may lengthen important pathslengthen important paths
Composite RegionsComposite Regions
• Allow rejoin from non-Region codeAllow rejoin from non-Region code
– Wholesale code duplication is Wholesale code duplication is not requirednot required
– Support full code motion across Support full code motion across regionregion
– Allow all interesting paths to be Allow all interesting paths to be scheduled concurrentlyscheduled concurrently
• Nested, less important Regions Nested, less important Regions bear the burden of the rejoinbear the burden of the rejoin
– Compensation code, as neededCompensation code, as needed
60
60
0
0
40
55
55
5
5
40
Predication ApproachesPredication Approaches
• Full Predication of Full Predication of entire Regionentire Region– Penalizes Penalizes
short pathsshort paths
60
60
0
0
40
55
55
5
5
40
On-Demand PredicationOn-Demand Predication
• Predicate (and Predicate (and Speculate) as Speculate) as neededneeded– reduce critical reduce critical
path(s)path(s)– fully utilize issue fully utilize issue
bandwidthbandwidth• Retain control flow to Retain control flow to
accommodate accommodate unbalanced pathsunbalanced paths
60
60
0
0
40
55
55
5
5
40
Predicate AnalysisPredicate Analysis
• Instruction scheduler requires knowledge of Instruction scheduler requires knowledge of predicate relationshipspredicate relationships– For dependence analysisFor dependence analysis– For code motionFor code motion– ……
• Predicate Query SystemPredicate Query System– Graphical representation of predicate Graphical representation of predicate
relationshipsrelationships– Superset, subset, disjoint, …Superset, subset, disjoint, …
Predicate ComputationPredicate Computation
• Compute all predicates possibly neededCompute all predicates possibly needed• OptimizeOptimize
– to share predicates where possibleto share predicates where possible– to utilize parallel comparesto utilize parallel compares– to fully utilize dual-targetsto fully utilize dual-targets
Predication and Branch CountsPredication and Branch Counts
• Predication reduces branchesPredication reduces branches– at both moderate and aggressive opt. levelsat both moderate and aggressive opt. levels
Normalized Dynamic Branch Counts
00.20.40.60.8
11.2
Benchmark
-O
-O w/pred
+O4+P
+O4 +P w/pred
Predication & Branch PredictionPredication & Branch Prediction
• Comparable misprediction rate with predicationComparable misprediction rate with predication
– despite significantly fewer branchesdespite significantly fewer branches increased mean time between mispredicted branchesincreased mean time between mispredicted branches
Normalized Mispredict Rates
0
0.5
1
1.5
2
Benchmark
-O
-O w/pred
+O4+P
+O4 +P w/pred
Register AllocationRegister Allocation
• Modeled as a graph-coloring Modeled as a graph-coloring problem.problem.– Nodes in the graph Nodes in the graph
represent live ranges of represent live ranges of variablesvariables
– Edges represent a Edges represent a temporal overlap of the temporal overlap of the live rangeslive ranges
– Nodes sharing an edge Nodes sharing an edge must be assigned must be assigned different colors (registers)different colors (registers)
x = ...y = ...
= ... xz = ... = … y = … z
y
zx
Requires Two Colors
y
z
x
Register AllocationRegister Allocation
x = ...y = ...
x
zy
With Control Flow
z = ... = … z
= … yx = ...
= … x
x
y
z
Requires Two Colors
Register AllocationRegister Allocation
x
zy
With Predicationxx = ...
y = ...
z = ...
= …y
x = ...
= …z
= … x
z
Now Requires Three Colors
y
Predicate AnalysisPredicate Analysis
p0
p2p1
x
yx = ...
y = ...
z = ...
= …y
x = ...
= …z
= … x
z
p1 and p2 are disjointIf p1 is TRUE, p2 is false
and vice versa
Register AllocationRegister Allocation
x
zy
With Predicate Analysisx
yx = ...
y = ...
z = ...
= …y
x = ...
= …z
= … x
z
Now Back to Two Colors
Effect of Predicate-Aware Effect of Predicate-Aware Register AllocationRegister Allocation
• Reduces register requirements for individual Reduces register requirements for individual procedures by 0% to 75%procedures by 0% to 75%– Depends upon how aggressively predication is Depends upon how aggressively predication is
appliedapplied• Average dynamic reduction in register stack Average dynamic reduction in register stack
allocation for gcc is 4.7%allocation for gcc is 4.7%
Object-Oriented CodeObject-Oriented Code
• ChallengesChallenges– Small Procedures, many Small Procedures, many
indirect (virtual)indirect (virtual)– Limits size of regions, Limits size of regions,
scope for ILPscope for ILP
– Exception HandlingException Handling
– Bounds Checking (Java)Bounds Checking (Java)– Inherently serial - must Inherently serial - must
check before check before executing load or storeexecuting load or store
SolutionsSolutionsInliningInlining
for non-virtual functions or for non-virtual functions or provably unique virtual provably unique virtual functionsfunctionsSpeculative inlining for most Speculative inlining for most common variantcommon variant
Liveness analysis of handlersLiveness analysis of handlersArchitectural support for Architectural support for speculation ensures speculation ensures recoverabilityrecoverability
Speculative executionSpeculative executionGuarantees correct Guarantees correct exception behaviorexception behavior
Dynamic optimization (e..g Java)Dynamic optimization (e..g Java)Make use of dynamic Make use of dynamic
profileprofile
Method CallsMethod Calls• Barrier between execution Barrier between execution
streamsstreams
• Often, location of called Often, location of called method must be determined method must be determined at runtimeat runtime
– Costly “identity check” on Costly “identity check” on object must complete object must complete before method may beginbefore method may begin
– Even if the call nearly Even if the call nearly always goes to the same always goes to the same placeplace
– Little ILPLittle ILP
Resolvetarget
method
Call-dependentcode
Possibletarget
Possibletarget
Possibletarget
Speculating Across Method Speculating Across Method CallsCalls
• Compiler predicts target methodCompiler predicts target method– ProfilingProfiling– Current state of class hierarchyCurrent state of class hierarchy
• Predicted method is inlinedPredicted method is inlined– Full or partialFull or partial
• Speculative execution of called method begins Speculative execution of called method begins while actual target is determinedwhile actual target is determined
Speculation Across Method Speculation Across Method Calls Calls
Resolvetargetmethod
call method
Dominantcalled
method
Othertarget
method
Othertarget
method
call othermethod if needed
Dominantcalled
method
Othertarget
method
Othertarget
method
Resolvetarget
method
Bounds & Null ChecksBounds & Null Checks
• Checks inhibit code motionChecks inhibit code motion• Null checksNull checks
x = y.foo;x = y.foo; if( y == null ) throw NullPointerException;if( y == null ) throw NullPointerException;
x = y.foo;x = y.foo;
• Bounds checksBounds checks
x = a[i];x = a[i]; if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;
if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)
throw ArrayIndexOutOfBounds Exception;throw ArrayIndexOutOfBounds Exception;
x = a[i];x = a[i];
Speculating Across Bounds Speculating Across Bounds ChecksChecks
• Bounds checks rarely failBounds checks rarely fail
x = a[i];x = a[i]; ld.sld.st = a[i];t = a[i];
if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;
if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)
throw ArrayIndexOutOfBoundsException;throw ArrayIndexOutOfBoundsException;
chk.schk.s tt
x = t;x = t;
• Long latency load can begin before checksLong latency load can begin before checks
Exception HandlingException Handling
• Exception handling inhibits motion of subsequent Exception handling inhibits motion of subsequent codecodeif( y.foo ) throw MyException;if( y.foo ) throw MyException;
x = y.bar + z.baz;x = y.bar + z.baz;
Speculation in the Presence Speculation in the Presence of Exception Handlingof Exception Handling
• Execution of subsequent instructions may begin Execution of subsequent instructions may begin before exception is resolvedbefore exception is resolved
if( y.foo ) throw MyException;if( y.foo ) throw MyException;
x = y.bar + z.baz;x = y.bar + z.baz;
ldld t1 = y.foot1 = y.foo
ld.sld.s t2 = y.bart2 = y.bar
ld.sld.s t3 = z.bazt3 = z.baz
addadd x = t2 + t3x = t2 + t3
if( t1 ) throw MyException;if( t1 ) throw MyException;
chk.schk.s xx
Dependence Graph for Dependence Graph for Instruction SchedulingInstruction Scheduling
add t1 = 8,p
(p1) ld4 t3 = [log]
(p1) add t2 = 1,t2
mov out0 = 0
br.ret rp
(p1) ld4 out0 = [t4]
shladd t4 = n,4,t3
(p1) ld4 t3 = [p]
(p1) st4 [log] = t2
ld4 count = [t1]
cmp4.ge p1,p2=n,count
If( n < p->count ) {If( n < p->count ) {
(*log)++;(*log)++;
return p->x[n];return p->x[n];
} else {} else {
return 0;return 0;
}}
Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation
add t1 = 8,p
(p1) ld4 t3 = [log]
(p1) add t2 = 1,t2
mov out0 = 0
br.ret rp
(p1) ld4 out0 = [t4]
shladd t4 = n,4,t3
(p1) ld4 t3 = [p]
(p1) st4 [log] = t2
ld4 count = [t1]
cmp4.ge p1,p2=n,count
chk.a t4
chk.a p
• During dependence graph During dependence graph construction, potentially construction, potentially controlcontrol and and datadata speculative edges and speculative edges and nodes are identifiednodes are identified
• Check nodes are added Check nodes are added where possibly needed where possibly needed (note that only data (note that only data speculation checks are speculation checks are shown here)shown here)
Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation
add t1 = 8,p(p1) ld4 t3 = [log]
(p1) add t2 = 1,t2
(p2) mov out0 = 0
br.ret rp
(p1) ld4 out0 = [t4]
shladd t4 = n,4,t3
(p1) ld4 t3 = [p]
(p1) st4 [log] = t2
ld4 count = [t1]
cmp4.ge p1,p2=n,count
chk.a t4chk.a p
• Speculative edges may be violated. Here the graph is re-drawn to show the Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismenhanced parallelism
• Note that the speculation of both writes to the out0 register would require Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulinginsertion of a copy. The scheduler must consider this in its scheduling
• Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated
ConclusionsConclusions• IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler
– However, the technology is a logical progression However, the technology is a logical progression from today’sfrom today’s– Today’s RISC compilersToday’s RISC compilers
– are more complex are more complex – are more reliableare more reliable– and deliver more performanceand deliver more performance
than those of the early daysthan those of the early days– Complexity trend is mirrored in both hardware and Complexity trend is mirrored in both hardware and
applicationsapplications– Need a balance to maximize benefits from eachNeed a balance to maximize benefits from each