Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Compiling for EDGE Architectures:The TRIPS Prototype Compiler

Kathryn McKinleyDoug Burger, Steve Keckler,

Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith,

Bill Yoderet al.

The University of Texas at Austin1University of Massachusetts, Amherst


130 nm

100 nm

70 nm35 nm

20 mm chip edge

Analytically … Qualitatively …

Either way … Partitioning for on-chip communication is key

Technology Scaling Hitting the Wall


Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint

Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency

Major vendors ending processor lines

OO SuperScalars Out of Steam


Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint

Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency

Major vendors ending processor lines

OO SuperScalars Out of Steam

What’s next?


Post-RISC Solutions

CMP - An evolutionary path Replicate what we already have 2 to N times on a chip Coarse grain parallelism Exposes the resources to the programmer and compiler

Explicit Data Graph Execution (EDGE) 1. Program graph is broken into sequence of blocks

Blocks commit atomically or not - a block never partially commits

2. Dataflow within a block, ISA support for direct producer-consumer communication

No shared named registers (point-to-point dataflow edges only) Memory is still a shared namespace The block’s dataflow graph (DFG) is explicit in the architecture


Outline TRIPS Execution Model & ISA TRIPS Architectural Constraints Compiler Structure Spatial Path Scheduling


Block Atomic Execution Model

ldshlswbr

addaddld

cmp

shlldcmpbr

• TRIPS block - single entry constrained hyperblock• Dataflow execution w/ target position encoding

write

write

read

read

readswswaddbr

write

DataflowGraph

TRIPS blockFlow Graph

ExecutionSubstrate

read

Register File

Data

Cac

hes D[0]

Gtile write readwrite

bro_t addiaddi lw_f mov addi

read

lw_f

Gtile D[0]

addi

write

bro_t

addi

mov

addi

write


TRIPS Block Constraints

Fixed Size: 128 instructions Padded with no-ops if needed

Load/Store Identifiers: 32 load or store queue identifiers More than 32 static loads and stores is

possible

Registers: 32 reads and 32 writes, 8 to each of 4 banks (in addition to 128)

1 - 128instruction

DFG

Register banks

Mem

ory

Mem

ory

PC

PC

32 reads32 writes

32 loads 32 stores

PC read

terminatingbranch

Constant Output: all stores and writes execute, one branch Simplifies hardware logic for detecting block completion Every path of execution through a block must produce the same stores and

register writes

Simplifies the hardware, more work for the compiler


Compiler Phases (Classic)

C

InliningUnrolling/FlatteningScalar Optimizations

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS TIL

TIL: TRIPS Intermediate Language - RISC-like three-address form

TASL: TRIPS Assembly Language - dataflow target form w/ locations encoded

PREGlobal Value NumberingScalar ReplacementGlobal Variable ReplacementSCCCopy PropagationArray Access Strength ReductionLICMTree Height ReductionUseless Copy RemovalDead Variable Elimination

Scale Compiler (UTexas/UMass)


Backend Compiler Flow

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

Hyperblock Formation

Register allocationReverse if-conversion & split

Load/Store ID assignmentSSA for constant outputs

Fanout insertionInstruction placement

Target form generation

ResourceAllocation

SchedulingTASL

TIL


Correctness:Progressively Satisfy Constraints









ResourceAllocation

SchedulingTASL

TIL

Constraint128 instructions32 load/store IDs

32 reg. read/write(8 per 4 banks)constant output


Predication & Hyperblock Formation

Predication Convert control dependence to data dependence Improves instruction fetch bandwidth Eliminates branch mispredictions Adds overhead Any instruction can have a predicate, but... Predicate head (low power) or bottom (speculative)

Hyperblock Scheduling region (set of basic blocks) Single entry, multiple exit, predicated instructions Expose parallelism w/o over saturating resources Must satisfy block constraints

P

PP

head

bottom


Accuracy?









ResourceAllocation

SchedulingTASL

TIL

Constraint128 instructions32 load/store IDs

32 reg. read/write(8 per 4 banks)constant output


Block Atomic Execution Model

ldshlswbr

addaddld

cmp

shlldcmpbr

TRIPS block - single entry constrained hyperblockDataflow execution w/ target position encoding

write

write

read

read

readswswaddbr

write

DataflowGraph

TRIPS blockFlow Graph

ExecutionSubstrate

read

Register File

Data

Cac

hes D[0]

Gtile write readwrite

bro_t addiaddi lw_f mov addi

read

lw_f

Gtile D[0]

addi

write

bro_t

addi

mov

addi

write


Spatial Scheduling Problem

Partitioned microarchitectureadd

ldld

mul

ldld

mul

mul mul

add

st




ldld

mul

ldld

mul

mul mul

add

st

st

ld

ld

ld ld

Anchor points




ldld

mul

ldld

mul

mul mul

add

st

st

ld

mul

add

mul

ld

ld

mul

add

mul

ld

Anchor points

Balance latency and concurrency


Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work


Scheduling can have two components Placement: Where an instruction executes Issue: When an instruction executes

Dissecting the Problem

VLIW(SPSI)

Bad idea(DPSI)

TRIPS(SPDI)

Superscalars(DPDI)

Static DynamicS

tatic

Dyn

amic

PlacementIs

sue

EDGE


Block-atomic execution Instruction groups fetch, execute, and commit atomically

Direct instruction communication Explicitly encode dataflow graph by specifying targets

add r1, r4, r5add r2, r5, r6add r3, r1, r2

i1: add i3i2: add i3i3: add i4

Explicit Data Graph Execution

Centralized Register

File

RISC EDGE

add

add

R5 R6R4add

i2i1 i2i2

i2i3


Scheduling for TRIPS TRIPS ISA

Up to 128 instructions/block Any instruction can be in any slot

TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between

adjacent ALUs

Known Execution latencies Lower bound for

communication latency

Unknown (estimated) Memory access latencies Resource conflicts

D0

D1

D2

D3

Ctrl

E0

E4

E8

E12

E1

E5

E9

E13

E2

E6

E10

E14

E3

E7

E11

E15

R0 R1 R2 R3

Register File

Dat

a C

ache


Scheduling for TRIPS TRIPS ISA

Up to 128 instructions/block Any instruction can be in any slot

TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between

adjacent ALUs

Known Execution latencies Lower bound for

communication latency

Unknown Memory access latencies Resource conflicts

Register File

Dat

a C

ache

D0

D1

D2

D3

Ctrl

E4

E2

R0 R1 R2 R3


Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics

1. Prioritizes critical path (C)2. Reprioritizes after each placement (R)3. Accounts for data cache locality (L)4. Accounts for register output locality (O)5. Load balancing for local issue contention (B)

Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific

Replace heuristics with elegant approach designed for spatial scheduling


Spatial Path Scheduling Overviewread

add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

D0

D1

ctrl

ld

ld

br

add

mul

mul

add

R1 R2

Register Data cache Execution Control

Legend

SchedulerDataflow

Graph

Topology

Placement



add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

SchedulerR2 mul

ld

ld

R1 add mul

ctrl D0 D1

DataflowGraph

Topology


Legend

Placement



add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

D0 D1

addR1

mulR2

add

ld mul ld brScheduler

DataflowGraph

Topology


Legend

Placement


Spatial Path Scheduling Overview

Initialize all known anchor points

Until all instructions are scheduled:1. Populate the open list2. Find placement costs3. Choose the minimum cost

location4. Schedule the instruction

whose minimum placement cost is largest

(Choose the max of the mins)

read R2

add

br

mul

ld ld

mul read R1

add

write R1


D0

D1

ctrl R1 R2

Register File

Dat

a C

ache

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Spatial Path Scheduling Example Initialize all known anchor points

Register

Data cache

Execution

Control

Legend

Unplaced


Spatial Path Scheduling Example

Open list: Instructions that are candidates for scheduling

We include: Instructions with no parents, or with at least one placed parent

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Populate the open list (marked in yellow)


Placement cost(i,slot): Longest path length through i if placed at slot

cost = inputCost + execCost + outputCost(includes communication and execution latencies)

Spatial Path Scheduling Example Calculate placement cost for

each instruction in the openlist at each slot read

R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

1

3

1

3

3

1

1


Spatial Path Scheduling Example Calculate placement cost for

each instruction in the openlist at each slot read

R2

mul

ld

mul

add

write R1

D1

1

3

1

3

3

1

1

Register File

Dat

a C

ache

5

3

1

D0

D1

ctrl

mulE1

R1 R2

Total placement cost = 16 + 3 + 3 = 22

5 cycles

3 cycles

1 cycle



D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

28

30

R1 R2

Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

Calculate placement cost for each instruction in the openlist at each slot

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30

read R2

add

br

mul

ld ld

read R1

write R1

D0 D1ctrl

mul

add



Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

ld

D1

read R2

add

br

mul

ld

mul read R1

add

write R1

D0ctrl

Choose the minimum cost location for each instruction

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

28

30

R1 R2

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30


D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

30

30

R1 R2

Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

Spatial Path Scheduling Example Break ties Example heuristics:

Links consumed ALU utilization

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

ld

D1

read R2

add

br

mul

ld

mul read R1

add

write R1

D0ctrl

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30


add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Spatial Path Scheduling Example Place the instruction with the

highest minimum cost

(Choose the max of the mins)

D0

D1

ctrl

mul

R1 R2

Register File

Dat

a C

ache

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30


Spatial Path Scheduling AlgorithmSchedule (block, topology)initialize known anchor pointswhile (not all instructions scheduled)

for each instruction in open list, ifor each available location, n

calculate placement cost for (i, n)keep track of n with min placement cost

keep track of i with highest min placement costschedule i with highest min placement cost

Per-block complexity:

SPS: O(i2 * n) i = # of instructionsn = # of ALUs

GRST: O(i2 + i * n)

Exhaustive search: i !


SPS Benefits and Limitations Benefits

Automatically exploits known communication latencies Designed for spatial scheduling Minimizes critical path length at each step Naturally encompasses four of five GRST heuristics

Limitations of basic algorithm Does not account for resource contention Uses no global information Minimum communication latencies may be optimistic


Experimental Methodology 26 hand-optimized microbenchmarks

Extracted from SPEC2000, EEMBC, Livermore Loops, MediaBench, and C libraries

Average dynamic instructions fetched/block: 67.3 (Ranges from 14.5 to 117.5)

Cycle-accurate simulator Within 4% of RTL on average Models communication and contention delays

Comparison points Greedy Scheduling for TRIPS (GRST) Simulated annealing


SPS PerformanceGeometric mean of speedup over GRST: 1.19

0.8

1

1.2

1.4

1.6

1.8

2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. MeanHand-coded microbenchmark

Speedup

Basic SPS


SPS Performance

0.8

1

1.2

1.4

1.6

1.8

2


bzip2_1cfar conv

ct



pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. MeanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRST: 1.19

Basic SPS


How well can we do? Simulated annealing

Artificial intelligence search technique Uses random perturbations to avoid local optima Approximates a global optimum

Cost function: simulated cycles Uncertainty makes static cost functions insufficient Best cost function

Purpose Optimization Discover performance upper bound Tool to improve scheduler


Speedup with Simulated Annealing

0.8

1

1.2

1.4

1.6

1.8

2

2.2


bzip2_1cfar conv

ct



pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40

Basic SPS Annealed


Extending SPS Contention

Network link contention Local and Global ALU contention

Global register prioritization Path volume scheduling


Register File

Dat

a C

ache

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D2ctrl

ALU Contention What if two instructions are ready to execute on the

same ALU at the same time?

D0

D2

ctrl R1 R2

add addbr

mul

ld mul

ld


Local vs. Global ALU Contention Local ALU contention

Keep track of expected issue time Increase placement cost if conflict occurs

Global ALU contention Resource utilization in previous/next block Weighting function

Modify placement cost


Speedup over GRST

0.8

1

1.2

1.4

1.6

1.8

2

2.2


bzip2_1cfar conv

ct



pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Basic SPS AnnealedSPS extended

Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40


Related Work Scheduling for VLIW [Ellis, Fisher]

Scheduling for other partitioned architectures Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea] RAW [Lee] Wavescalar [Mercaldi]

ASIC and FPGA place and route [Paulin] Resource conflicts known statically Substrate may not be fixed Simulated annealing [Betz]


Conclusions and Future Work Future work

Register allocation Memory placement Reliability-aware scheduling

Conclusions General spatial instruction scheduling algorithm Reasons explicitly about anchor points Performance within 4% of annealed results


Questions?


Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture

translates ID -> Physical location

TIL (operand format): TASL(target format)read t0, g1read t1, g2muli t2, t1, 4ld t3, 0(t2)ld t4, 4(t2)mul t5, t3, t4add t6, t5, t0addi t7, t1, 8br t7write g1, t6

R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]

Scheduler





TASL(target format)

D0

D1

D2

D3

ctrl R0 R1 R2 R3

0

32

64

96

1

33

65

97

2

34

66

98

3

35

67

99

R0 R1 R2 R3R0 R1 R2 R3 R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]





TASL(target format)

D0

D1

D2

D3

ctrl R0 R1 R2 R3

4

36

68

100

5

37

69

101

6

38

70

102

7

39

71

103

R4 R5 R6 R7R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]





TASL(target format)

D0

D1

D2

D3

ctrl

0,4,8,… 28

32,36,… 60

64,68,… 92

96,100,… 124

1,5,9,… 29

33,37,… 61

65,69,… 93

97,101,… 125

2,6,10,… 30

34,38,… 62

66,70,… 94

98,101,… 126

3,7,11,… 31

35,39,… 63

67,69,… 95

99,102,… 127

R0,R4,… R28

R1,R5,… R29

R2,R6,… R30

R3,R7,… R31

R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]


Simulated Annealing Over Time

60000

65000

70000

75000

80000

85000

90000

95000

100000

1 74 147 220 293 366 439 512 585 658 731 804 877 950 1023 1096 1169 1242 1315 1388 1461 1534 1607 1680Annealing Iterations

Simulation Cycles

random acceptedrandom bestguided acceptedguided best


Simulated Annealing Cost function: Simulated cycles Prune space further with critical path tool

76000

77000

78000

79000

80000

81000

82000

83000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

Annealing Times

Simulation Cycles

Random MoveGuided Move

Guided vs. unguided Annealing for memset_hand


Contention ALU contention

Local (within a block) - Estimate temporal schedule Global (between blocks) - Probabilistic - use weighting function

Network link contention Precise measurements too inaccurate Estimate with threshold, weighting function

Weight network link and global ALU contention based on annealed results

weight = (1 - fullness) * (1 - ) criticality

concurrency


Global Register Prioritization Problem: Any register dependence may be important

with speculative execution Solution: Extend path lengths through registers

Register prioritization:

1) Schedule smaller loops before larger loops

2) Schedule loop-carried dependences first

3) Extend placement cost through registers to previous/next block


Path Volume Scheduling Problem: The basic SPS algorithm does not account for

the number of instructions in the path

Solution: Perform a depth-first search with iterative deepening to find the shortest path that holds all instructions

Compiling for EDGE Architectures: The TRIPS Prototype Compiler

Documents

Transcript of Compiling for EDGE Architectures: The TRIPS Prototype Compiler