Compiling for EDGE Architectures: The TRIPS Prototype Compiler
description
Transcript of Compiling for EDGE Architectures: The TRIPS Prototype Compiler
ASPLOS XIIApril 22, 2023
Compiling for EDGE Architectures:The TRIPS Prototype Compiler
Kathryn McKinleyDoug Burger, Steve Keckler,
Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith,
Bill Yoderet al.
The University of Texas at Austin1University of Massachusetts, Amherst
ASPLOS XIIApril 22, 2023
130 nm
100 nm
70 nm35 nm
20 mm chip edge
Analytically … Qualitatively …
Either way … Partitioning for on-chip communication is key
Technology Scaling Hitting the Wall
ASPLOS XIIApril 22, 2023
Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint
Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency
Major vendors ending processor lines
OO SuperScalars Out of Steam
ASPLOS XIIApril 22, 2023
Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint
Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency
Major vendors ending processor lines
OO SuperScalars Out of Steam
What’s next?
ASPLOS XIIApril 22, 2023
Post-RISC Solutions
CMP - An evolutionary path Replicate what we already have 2 to N times on a chip Coarse grain parallelism Exposes the resources to the programmer and compiler
Explicit Data Graph Execution (EDGE) 1. Program graph is broken into sequence of blocks
Blocks commit atomically or not - a block never partially commits
2. Dataflow within a block, ISA support for direct producer-consumer communication
No shared named registers (point-to-point dataflow edges only) Memory is still a shared namespace The block’s dataflow graph (DFG) is explicit in the architecture
ASPLOS XIIApril 22, 2023
Outline TRIPS Execution Model & ISA TRIPS Architectural Constraints Compiler Structure Spatial Path Scheduling
ASPLOS XIIApril 22, 2023
Block Atomic Execution Model
ldshlswbr
addaddld
cmp
shlldcmpbr
• TRIPS block - single entry constrained hyperblock• Dataflow execution w/ target position encoding
write
write
read
read
readswswaddbr
write
DataflowGraph
TRIPS blockFlow Graph
ExecutionSubstrate
read
Register File
Data
Cac
hes D[0]
Gtile write readwrite
bro_t addiaddi lw_f mov addi
read
lw_f
Gtile D[0]
addi
write
bro_t
addi
mov
addi
write
ASPLOS XIIApril 22, 2023
TRIPS Block Constraints
Fixed Size: 128 instructions Padded with no-ops if needed
Load/Store Identifiers: 32 load or store queue identifiers More than 32 static loads and stores is
possible
Registers: 32 reads and 32 writes, 8 to each of 4 banks (in addition to 128)
1 - 128instruction
DFG
Register banks
Mem
ory
Mem
ory
PC
PC
32 reads32 writes
32 loads 32 stores
PC read
terminatingbranch
Constant Output: all stores and writes execute, one branch Simplifies hardware logic for detecting block completion Every path of execution through a block must produce the same stores and
register writes
Simplifies the hardware, more work for the compiler
ASPLOS XIIApril 22, 2023
Compiler Phases (Classic)
C
InliningUnrolling/FlatteningScalar Optimizations
Frontend
FORTRAN
Code Generation
Alpha SPARC PPC TRIPS TIL
TIL: TRIPS Intermediate Language - RISC-like three-address form
TASL: TRIPS Assembly Language - dataflow target form w/ locations encoded
PREGlobal Value NumberingScalar ReplacementGlobal Variable ReplacementSCCCopy PropagationArray Access Strength ReductionLICMTree Height ReductionUseless Copy RemovalDead Variable Elimination
Scale Compiler (UTexas/UMass)
ASPLOS XIIApril 22, 2023
Backend Compiler Flow
If-conversionLoop peeling
While loop unrollingInstruction merging
Predicate optimizations
Hyperblock Formation
Register allocationReverse if-conversion & split
Load/Store ID assignmentSSA for constant outputs
Fanout insertionInstruction placement
Target form generation
ResourceAllocation
SchedulingTASL
TIL
ASPLOS XIIApril 22, 2023
Correctness:Progressively Satisfy Constraints
If-conversionLoop peeling
While loop unrollingInstruction merging
Predicate optimizations
Hyperblock Formation
Register allocationReverse if-conversion & split
Load/Store ID assignmentSSA for constant outputs
Fanout insertionInstruction placement
Target form generation
ResourceAllocation
SchedulingTASL
TIL
Constraint128 instructions32 load/store IDs
32 reg. read/write(8 per 4 banks)constant output
ASPLOS XIIApril 22, 2023
Predication & Hyperblock Formation
Predication Convert control dependence to data dependence Improves instruction fetch bandwidth Eliminates branch mispredictions Adds overhead Any instruction can have a predicate, but... Predicate head (low power) or bottom (speculative)
Hyperblock Scheduling region (set of basic blocks) Single entry, multiple exit, predicated instructions Expose parallelism w/o over saturating resources Must satisfy block constraints
P
PP
head
bottom
ASPLOS XIIApril 22, 2023
Accuracy?
If-conversionLoop peeling
While loop unrollingInstruction merging
Predicate optimizations
Hyperblock Formation
Register allocationReverse if-conversion & split
Load/Store ID assignmentSSA for constant outputs
Fanout insertionInstruction placement
Target form generation
ResourceAllocation
SchedulingTASL
TIL
Constraint128 instructions32 load/store IDs
32 reg. read/write(8 per 4 banks)constant output
ASPLOS XIIApril 22, 2023
Block Atomic Execution Model
ldshlswbr
addaddld
cmp
shlldcmpbr
TRIPS block - single entry constrained hyperblockDataflow execution w/ target position encoding
write
write
read
read
readswswaddbr
write
DataflowGraph
TRIPS blockFlow Graph
ExecutionSubstrate
read
Register File
Data
Cac
hes D[0]
Gtile write readwrite
bro_t addiaddi lw_f mov addi
read
lw_f
Gtile D[0]
addi
write
bro_t
addi
mov
addi
write
ASPLOS XIIApril 22, 2023
Spatial Scheduling Problem
Partitioned microarchitectureadd
ldld
mul
ldld
mul
mul mul
add
st
ASPLOS XIIApril 22, 2023
Spatial Scheduling Problem
Partitioned microarchitectureadd
ldld
mul
ldld
mul
mul mul
add
st
st
ld
ld
ld ld
Anchor points
ASPLOS XIIApril 22, 2023
Spatial Scheduling Problem
Partitioned microarchitectureadd
ldld
mul
ldld
mul
mul mul
add
st
st
ld
mul
add
mul
ld
ld
mul
add
mul
ld
Anchor points
Balance latency and concurrency
ASPLOS XIIApril 22, 2023
Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work
ASPLOS XIIApril 22, 2023
Scheduling can have two components Placement: Where an instruction executes Issue: When an instruction executes
Dissecting the Problem
VLIW(SPSI)
Bad idea(DPSI)
TRIPS(SPDI)
Superscalars(DPDI)
Static DynamicS
tatic
Dyn
amic
PlacementIs
sue
EDGE
ASPLOS XIIApril 22, 2023
Block-atomic execution Instruction groups fetch, execute, and commit atomically
Direct instruction communication Explicitly encode dataflow graph by specifying targets
add r1, r4, r5add r2, r5, r6add r3, r1, r2
i1: add i3i2: add i3i3: add i4
Explicit Data Graph Execution
Centralized Register
File
RISC EDGE
add
add
R5 R6R4add
i2i1 i2i2
i2i3
ASPLOS XIIApril 22, 2023
Scheduling for TRIPS TRIPS ISA
Up to 128 instructions/block Any instruction can be in any slot
TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between
adjacent ALUs
Known Execution latencies Lower bound for
communication latency
Unknown (estimated) Memory access latencies Resource conflicts
D0
D1
D2
D3
Ctrl
E0
E4
E8
E12
E1
E5
E9
E13
E2
E6
E10
E14
E3
E7
E11
E15
R0 R1 R2 R3
Register File
Dat
a C
ache
ASPLOS XIIApril 22, 2023
Scheduling for TRIPS TRIPS ISA
Up to 128 instructions/block Any instruction can be in any slot
TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between
adjacent ALUs
Known Execution latencies Lower bound for
communication latency
Unknown Memory access latencies Resource conflicts
Register File
Dat
a C
ache
D0
D1
D2
D3
Ctrl
E4
E2
R0 R1 R2 R3
ASPLOS XIIApril 22, 2023
Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics
1. Prioritizes critical path (C)2. Reprioritizes after each placement (R)3. Accounts for data cache locality (L)4. Accounts for register output locality (O)5. Load balancing for local issue contention (B)
Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific
Replace heuristics with elegant approach designed for spatial scheduling
ASPLOS XIIApril 22, 2023
Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics
1. Prioritizes critical path (C)2. Reprioritizes after each placement (R)3. Accounts for data cache locality (L)4. Accounts for register output locality (O)5. Load balancing for local issue contention (B)
Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific
Replace heuristics with elegant approach designed for spatial scheduling
ASPLOS XIIApril 22, 2023
Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Overviewread
add
br
mul
ld ld
mul read
add
write
D0 D1ctrl
D0
D1
ctrl
ld
ld
br
add
mul
mul
add
R1 R2
Register Data cache Execution Control
Legend
SchedulerDataflow
Graph
Topology
Placement
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Overviewread
add
br
mul
ld ld
mul read
add
write
D0 D1ctrl
SchedulerR2 mul
ld
ld
R1 add mul
ctrl D0 D1
DataflowGraph
Topology
Register Data cache Execution Control
Legend
Placement
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Overviewread
add
br
mul
ld ld
mul read
add
write
D0 D1ctrl
D0 D1
addR1
mulR2
add
ld mul ld brScheduler
DataflowGraph
Topology
Register Data cache Execution Control
Legend
Placement
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Overview
Initialize all known anchor points
Until all instructions are scheduled:1. Populate the open list2. Find placement costs3. Choose the minimum cost
location4. Schedule the instruction
whose minimum placement cost is largest
(Choose the max of the mins)
read R2
add
br
mul
ld ld
mul read R1
add
write R1
ASPLOS XIIApril 22, 2023
D0
D1
ctrl R1 R2
Register File
Dat
a C
ache
read R2
add
br
mul
ld ld
mul read R1
add
write R1
D0 D1ctrl
Spatial Path Scheduling Example Initialize all known anchor points
Register
Data cache
Execution
Control
Legend
Unplaced
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Example
Open list: Instructions that are candidates for scheduling
We include: Instructions with no parents, or with at least one placed parent
read R2
add
br
mul
ld ld
mul read R1
add
write R1
D0 D1ctrl
Populate the open list (marked in yellow)
ASPLOS XIIApril 22, 2023
Placement cost(i,slot): Longest path length through i if placed at slot
cost = inputCost + execCost + outputCost(includes communication and execution latencies)
Spatial Path Scheduling Example Calculate placement cost for
each instruction in the openlist at each slot read
R2
add
br
mul
ld ld
mul read R1
add
write R1
D0 D1ctrl
1
3
1
3
3
1
1
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Example Calculate placement cost for
each instruction in the openlist at each slot read
R2
mul
ld
mul
add
write R1
D1
1
3
1
3
3
1
1
Register File
Dat
a C
ache
5
3
1
D0
D1
ctrl
mulE1
R1 R2
Total placement cost = 16 + 3 + 3 = 22
5 cycles
3 cycles
1 cycle
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Example
D0
D1
ctrl
22
22
24
26
22
22
24
26
24
24
26
28
26
26
28
30
R1 R2
Register File
Dat
a C
ache
add10 8 8 1010 10 10 1212 12 12 1414 14 14 16
Calculate placement cost for each instruction in the openlist at each slot
mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28
mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30
add22 22 24 2622 22 24 2624 24 26 2826 26 28 30
read R2
add
br
mul
ld ld
read R1
write R1
D0 D1ctrl
mul
add
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling Example
Register File
Dat
a C
ache
add10 8 8 1010 10 10 1212 12 12 1414 14 14 16
ld
D1
read R2
add
br
mul
ld
mul read R1
add
write R1
D0ctrl
Choose the minimum cost location for each instruction
mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28
D0
D1
ctrl
22
22
24
26
22
22
24
26
24
24
26
28
26
26
28
30
R1 R2
mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30
add22 22 24 2622 22 24 2624 24 26 2826 26 28 30
ASPLOS XIIApril 22, 2023
D0
D1
ctrl
22
22
24
26
22
22
24
26
24
24
26
28
26
26
30
30
R1 R2
Register File
Dat
a C
ache
add10 8 8 1010 10 10 1212 12 12 1414 14 14 16
Spatial Path Scheduling Example Break ties Example heuristics:
Links consumed ALU utilization
mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28
ld
D1
read R2
add
br
mul
ld
mul read R1
add
write R1
D0ctrl
mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30
add22 22 24 2622 22 24 2624 24 26 2826 26 28 30
ASPLOS XIIApril 22, 2023
add10 8 8 1010 10 10 1212 12 12 1414 14 14 16
read R2
add
br
mul
ld ld
mul read R1
add
write R1
D0 D1ctrl
Spatial Path Scheduling Example Place the instruction with the
highest minimum cost
(Choose the max of the mins)
D0
D1
ctrl
mul
R1 R2
Register File
Dat
a C
ache
mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28
mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30
add22 22 24 2622 22 24 2624 24 26 2826 26 28 30
ASPLOS XIIApril 22, 2023
Spatial Path Scheduling AlgorithmSchedule (block, topology)initialize known anchor pointswhile (not all instructions scheduled)
for each instruction in open list, ifor each available location, n
calculate placement cost for (i, n)keep track of n with min placement cost
keep track of i with highest min placement costschedule i with highest min placement cost
Per-block complexity:
SPS: O(i2 * n) i = # of instructionsn = # of ALUs
GRST: O(i2 + i * n)
Exhaustive search: i !
ASPLOS XIIApril 22, 2023
SPS Benefits and Limitations Benefits
Automatically exploits known communication latencies Designed for spatial scheduling Minimizes critical path length at each step Naturally encompasses four of five GRST heuristics
Limitations of basic algorithm Does not account for resource contention Uses no global information Minimum communication latencies may be optimistic
ASPLOS XIIApril 22, 2023
Experimental Methodology 26 hand-optimized microbenchmarks
Extracted from SPEC2000, EEMBC, Livermore Loops, MediaBench, and C libraries
Average dynamic instructions fetched/block: 67.3 (Ranges from 14.5 to 117.5)
Cycle-accurate simulator Within 4% of RTL on average Models communication and contention delays
Comparison points Greedy Scheduling for TRIPS (GRST) Simulated annealing
ASPLOS XIIApril 22, 2023
SPS PerformanceGeometric mean of speedup over GRST: 1.19
0.8
1
1.2
1.4
1.6
1.8
2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. MeanHand-coded microbenchmark
Speedup
Basic SPS
ASPLOS XIIApril 22, 2023
SPS Performance
0.8
1
1.2
1.4
1.6
1.8
2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. MeanHand-coded microbenchmark
Speedup
Geometric mean of speedup over GRST: 1.19
Basic SPS
ASPLOS XIIApril 22, 2023
SPS Performance
0.8
1
1.2
1.4
1.6
1.8
2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. MeanHand-coded microbenchmark
Speedup
Geometric mean of speedup over GRST: 1.19
Basic SPS
ASPLOS XIIApril 22, 2023
Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work
ASPLOS XIIApril 22, 2023
How well can we do? Simulated annealing
Artificial intelligence search technique Uses random perturbations to avoid local optima Approximates a global optimum
Cost function: simulated cycles Uncertainty makes static cost functions insufficient Best cost function
Purpose Optimization Discover performance upper bound Tool to improve scheduler
ASPLOS XIIApril 22, 2023
Speedup with Simulated Annealing
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40
Basic SPS Annealed
ASPLOS XIIApril 22, 2023
Speedup with Simulated Annealing
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40
Basic SPS Annealed
ASPLOS XIIApril 22, 2023
Speedup with Simulated Annealing
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40
Basic SPS Annealed
ASPLOS XIIApril 22, 2023
Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work
ASPLOS XIIApril 22, 2023
Extending SPS Contention
Network link contention Local and Global ALU contention
Global register prioritization Path volume scheduling
ASPLOS XIIApril 22, 2023
Register File
Dat
a C
ache
read R2
add
br
mul
ld ld
mul read R1
add
write R1
D0 D2ctrl
ALU Contention What if two instructions are ready to execute on the
same ALU at the same time?
D0
D2
ctrl R1 R2
add addbr
mul
ld mul
ld
ASPLOS XIIApril 22, 2023
Local vs. Global ALU Contention Local ALU contention
Keep track of expected issue time Increase placement cost if conflict occurs
Global ALU contention Resource utilization in previous/next block Weighting function
Modify placement cost
ASPLOS XIIApril 22, 2023
Speedup over GRST
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Basic SPS AnnealedSPS extended
Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40
ASPLOS XIIApril 22, 2023
Speedup over GRST
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Basic SPS AnnealedSPS extended
Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40
ASPLOS XIIApril 22, 2023
Speedup over GRST
0.8
1
1.2
1.4
1.6
1.8
2
2.2
a2time01ammp_2art_1art_2art_3
bzip2_1cfar conv
ct
equake_1genalggzip_1gzip_2
matrix_1memchrmemcpymemsetparser_1
pm qrrbtree
shastrcmp
svd
transpose_GMTI
vadd
Geo. meanHand-coded microbenchmark
Speedup
Basic SPS AnnealedSPS extended
Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40
ASPLOS XIIApril 22, 2023
Related Work Scheduling for VLIW [Ellis, Fisher]
Scheduling for other partitioned architectures Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea] RAW [Lee] Wavescalar [Mercaldi]
ASIC and FPGA place and route [Paulin] Resource conflicts known statically Substrate may not be fixed Simulated annealing [Betz]
ASPLOS XIIApril 22, 2023
Conclusions and Future Work Future work
Register allocation Memory placement Reliability-aware scheduling
Conclusions General spatial instruction scheduling algorithm Reasons explicitly about anchor points Performance within 4% of annealed results
ASPLOS XIIApril 22, 2023
Questions?
ASPLOS XIIApril 22, 2023
Mapping instructions to Physical Locations
Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture
translates ID -> Physical location
TIL (operand format): TASL(target format)read t0, g1read t1, g2muli t2, t1, 4ld t3, 0(t2)ld t4, 4(t2)mul t5, t3, t4add t6, t5, t0addi t7, t1, 8br t7write g1, t6
R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]
Scheduler
ASPLOS XIIApril 22, 2023
Mapping instructions to Physical Locations
Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture
translates ID -> Physical location
TASL(target format)
D0
D1
D2
D3
ctrl R0 R1 R2 R3
0
32
64
96
1
33
65
97
2
34
66
98
3
35
67
99
R0 R1 R2 R3R0 R1 R2 R3 R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]
ASPLOS XIIApril 22, 2023
Mapping instructions to Physical Locations
Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture
translates ID -> Physical location
TASL(target format)
D0
D1
D2
D3
ctrl R0 R1 R2 R3
4
36
68
100
5
37
69
101
6
38
70
102
7
39
71
103
R4 R5 R6 R7R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]
ASPLOS XIIApril 22, 2023
Mapping instructions to Physical Locations
Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture
translates ID -> Physical location
TASL(target format)
D0
D1
D2
D3
ctrl
0,4,8,… 28
32,36,… 60
64,68,… 92
96,100,… 124
1,5,9,… 29
33,37,… 61
65,69,… 93
97,101,… 125
2,6,10,… 30
34,38,… 62
66,70,… 94
98,101,… 126
3,7,11,… 31
35,39,… 63
67,69,… 95
99,102,… 127
R0,R4,… R28
R1,R5,… R29
R2,R6,… R30
R3,R7,… R31
R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]
ASPLOS XIIApril 22, 2023
Simulated Annealing Over Time
60000
65000
70000
75000
80000
85000
90000
95000
100000
1 74 147 220 293 366 439 512 585 658 731 804 877 950 1023 1096 1169 1242 1315 1388 1461 1534 1607 1680Annealing Iterations
Simulation Cycles
random acceptedrandom bestguided acceptedguided best
ASPLOS XIIApril 22, 2023
Simulated Annealing Cost function: Simulated cycles Prune space further with critical path tool
76000
77000
78000
79000
80000
81000
82000
83000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
Annealing Times
Simulation Cycles
Random MoveGuided Move
Guided vs. unguided Annealing for memset_hand
ASPLOS XIIApril 22, 2023
Contention ALU contention
Local (within a block) - Estimate temporal schedule Global (between blocks) - Probabilistic - use weighting function
Network link contention Precise measurements too inaccurate Estimate with threshold, weighting function
Weight network link and global ALU contention based on annealed results
weight = (1 - fullness) * (1 - ) criticality
concurrency
ASPLOS XIIApril 22, 2023
Global Register Prioritization Problem: Any register dependence may be important
with speculative execution Solution: Extend path lengths through registers
Register prioritization:
1) Schedule smaller loops before larger loops
2) Schedule loop-carried dependences first
3) Extend placement cost through registers to previous/next block
ASPLOS XIIApril 22, 2023
Path Volume Scheduling Problem: The basic SPS algorithm does not account for
the number of instructions in the path
Solution: Perform a depth-first search with iterative deepening to find the shortest path that holds all instructions