Compiling for EDGE Architectures: The TRIPS Prototype Compiler

67
ASPLOS XII June 21, 2022 Compiling for EDGE Architectures: The TRIPS Prototype Compiler Kathryn McKinley Doug Burger, Steve Keckler, Jim Burrill 1 , Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder et al. The University of Texas at Austin 1 University of Massachusetts, Amherst

description

Compiling for EDGE Architectures: The TRIPS Prototype Compiler. Kathryn McKinley Doug Burger, Steve Keckler, Jim Burrill 1 , Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder et al. The University of Texas at Austin - PowerPoint PPT Presentation

Transcript of Compiling for EDGE Architectures: The TRIPS Prototype Compiler

Page 1: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Compiling for EDGE Architectures:The TRIPS Prototype Compiler

Kathryn McKinleyDoug Burger, Steve Keckler,

Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith,

Bill Yoderet al.

The University of Texas at Austin1University of Massachusetts, Amherst

Page 2: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

130 nm

100 nm

70 nm35 nm

20 mm chip edge

Analytically … Qualitatively …

Either way … Partitioning for on-chip communication is key

Technology Scaling Hitting the Wall

Page 3: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint

Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency

Major vendors ending processor lines

OO SuperScalars Out of Steam

Page 4: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Clock ride is overWire and pipeline limitsQuadratic out-of-order issue logicPower, a first order constraint

Problems for any architectural solution ILP - instruction level parallelismMemory and on-chip latency

Major vendors ending processor lines

OO SuperScalars Out of Steam

What’s next?

Page 5: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Post-RISC Solutions

CMP - An evolutionary path Replicate what we already have 2 to N times on a chip Coarse grain parallelism Exposes the resources to the programmer and compiler

Explicit Data Graph Execution (EDGE) 1. Program graph is broken into sequence of blocks

Blocks commit atomically or not - a block never partially commits

2. Dataflow within a block, ISA support for direct producer-consumer communication

No shared named registers (point-to-point dataflow edges only) Memory is still a shared namespace The block’s dataflow graph (DFG) is explicit in the architecture

Page 6: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Outline TRIPS Execution Model & ISA TRIPS Architectural Constraints Compiler Structure Spatial Path Scheduling

Page 7: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Block Atomic Execution Model

ldshlswbr

addaddld

cmp

shlldcmpbr

• TRIPS block - single entry constrained hyperblock• Dataflow execution w/ target position encoding

write

write

read

read

readswswaddbr

write

DataflowGraph

TRIPS blockFlow Graph

ExecutionSubstrate

read

Register File

Data

Cac

hes D[0]

Gtile write readwrite

bro_t addiaddi lw_f mov addi

read

lw_f

Gtile D[0]

addi

write

bro_t

addi

mov

addi

write

Page 8: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

TRIPS Block Constraints

Fixed Size: 128 instructions Padded with no-ops if needed

Load/Store Identifiers: 32 load or store queue identifiers More than 32 static loads and stores is

possible

Registers: 32 reads and 32 writes, 8 to each of 4 banks (in addition to 128)

1 - 128instruction

DFG

Register banks

Mem

ory

Mem

ory

PC

PC

32 reads32 writes

32 loads 32 stores

PC read

terminatingbranch

Constant Output: all stores and writes execute, one branch Simplifies hardware logic for detecting block completion Every path of execution through a block must produce the same stores and

register writes

Simplifies the hardware, more work for the compiler

Page 9: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Compiler Phases (Classic)

C

InliningUnrolling/FlatteningScalar Optimizations

Frontend

FORTRAN

Code Generation

Alpha SPARC PPC TRIPS TIL

TIL: TRIPS Intermediate Language - RISC-like three-address form

TASL: TRIPS Assembly Language - dataflow target form w/ locations encoded

PREGlobal Value NumberingScalar ReplacementGlobal Variable ReplacementSCCCopy PropagationArray Access Strength ReductionLICMTree Height ReductionUseless Copy RemovalDead Variable Elimination

Scale Compiler (UTexas/UMass)

Page 10: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Backend Compiler Flow

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

Hyperblock Formation

Register allocationReverse if-conversion & split

Load/Store ID assignmentSSA for constant outputs

Fanout insertionInstruction placement

Target form generation

ResourceAllocation

SchedulingTASL

TIL

Page 11: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Correctness:Progressively Satisfy Constraints

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

Hyperblock Formation

Register allocationReverse if-conversion & split

Load/Store ID assignmentSSA for constant outputs

Fanout insertionInstruction placement

Target form generation

ResourceAllocation

SchedulingTASL

TIL

Constraint128 instructions32 load/store IDs

32 reg. read/write(8 per 4 banks)constant output

Page 12: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Predication & Hyperblock Formation

Predication Convert control dependence to data dependence Improves instruction fetch bandwidth Eliminates branch mispredictions Adds overhead Any instruction can have a predicate, but... Predicate head (low power) or bottom (speculative)

Hyperblock Scheduling region (set of basic blocks) Single entry, multiple exit, predicated instructions Expose parallelism w/o over saturating resources Must satisfy block constraints

P

PP

head

bottom

Page 13: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Accuracy?

If-conversionLoop peeling

While loop unrollingInstruction merging

Predicate optimizations

Hyperblock Formation

Register allocationReverse if-conversion & split

Load/Store ID assignmentSSA for constant outputs

Fanout insertionInstruction placement

Target form generation

ResourceAllocation

SchedulingTASL

TIL

Constraint128 instructions32 load/store IDs

32 reg. read/write(8 per 4 banks)constant output

Page 14: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Block Atomic Execution Model

ldshlswbr

addaddld

cmp

shlldcmpbr

TRIPS block - single entry constrained hyperblockDataflow execution w/ target position encoding

write

write

read

read

readswswaddbr

write

DataflowGraph

TRIPS blockFlow Graph

ExecutionSubstrate

read

Register File

Data

Cac

hes D[0]

Gtile write readwrite

bro_t addiaddi lw_f mov addi

read

lw_f

Gtile D[0]

addi

write

bro_t

addi

mov

addi

write

Page 15: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Scheduling Problem

Partitioned microarchitectureadd

ldld

mul

ldld

mul

mul mul

add

st

Page 16: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Scheduling Problem

Partitioned microarchitectureadd

ldld

mul

ldld

mul

mul mul

add

st

st

ld

ld

ld ld

Anchor points

Page 17: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Scheduling Problem

Partitioned microarchitectureadd

ldld

mul

ldld

mul

mul mul

add

st

st

ld

mul

add

mul

ld

ld

mul

add

mul

ld

Anchor points

Balance latency and concurrency

Page 18: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work

Page 19: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Scheduling can have two components Placement: Where an instruction executes Issue: When an instruction executes

Dissecting the Problem

VLIW(SPSI)

Bad idea(DPSI)

TRIPS(SPDI)

Superscalars(DPDI)

Static DynamicS

tatic

Dyn

amic

PlacementIs

sue

EDGE

Page 20: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Block-atomic execution Instruction groups fetch, execute, and commit atomically

Direct instruction communication Explicitly encode dataflow graph by specifying targets

add r1, r4, r5add r2, r5, r6add r3, r1, r2

i1: add i3i2: add i3i3: add i4

Explicit Data Graph Execution

Centralized Register

File

RISC EDGE

add

add

R5 R6R4add

i2i1 i2i2

i2i3

Page 21: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Scheduling for TRIPS TRIPS ISA

Up to 128 instructions/block Any instruction can be in any slot

TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between

adjacent ALUs

Known Execution latencies Lower bound for

communication latency

Unknown (estimated) Memory access latencies Resource conflicts

D0

D1

D2

D3

Ctrl

E0

E4

E8

E12

E1

E5

E9

E13

E2

E6

E10

E14

E3

E7

E11

E15

R0 R1 R2 R3

Register File

Dat

a C

ache

Page 22: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Scheduling for TRIPS TRIPS ISA

Up to 128 instructions/block Any instruction can be in any slot

TRIPS microarchitecture Up to 8 blocks in flight 1 cycle latency between

adjacent ALUs

Known Execution latencies Lower bound for

communication latency

Unknown Memory access latencies Resource conflicts

Register File

Dat

a C

ache

D0

D1

D2

D3

Ctrl

E4

E2

R0 R1 R2 R3

Page 23: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics

1. Prioritizes critical path (C)2. Reprioritizes after each placement (R)3. Accounts for data cache locality (L)4. Accounts for register output locality (O)5. Load balancing for local issue contention (B)

Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific

Replace heuristics with elegant approach designed for spatial scheduling

Page 24: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics

1. Prioritizes critical path (C)2. Reprioritizes after each placement (R)3. Accounts for data cache locality (L)4. Accounts for register output locality (O)5. Load balancing for local issue contention (B)

Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific

Replace heuristics with elegant approach designed for spatial scheduling

Page 25: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work

Page 26: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Overviewread

add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

D0

D1

ctrl

ld

ld

br

add

mul

mul

add

R1 R2

Register Data cache Execution Control

Legend

SchedulerDataflow

Graph

Topology

Placement

Page 27: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Overviewread

add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

SchedulerR2 mul

ld

ld

R1 add mul

ctrl D0 D1

DataflowGraph

Topology

Register Data cache Execution Control

Legend

Placement

Page 28: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Overviewread

add

br

mul

ld ld

mul read

add

write

D0 D1ctrl

D0 D1

addR1

mulR2

add

ld mul ld brScheduler

DataflowGraph

Topology

Register Data cache Execution Control

Legend

Placement

Page 29: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Overview

Initialize all known anchor points

Until all instructions are scheduled:1. Populate the open list2. Find placement costs3. Choose the minimum cost

location4. Schedule the instruction

whose minimum placement cost is largest

(Choose the max of the mins)

read R2

add

br

mul

ld ld

mul read R1

add

write R1

Page 30: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

D0

D1

ctrl R1 R2

Register File

Dat

a C

ache

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Spatial Path Scheduling Example Initialize all known anchor points

Register

Data cache

Execution

Control

Legend

Unplaced

Page 31: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Example

Open list: Instructions that are candidates for scheduling

We include: Instructions with no parents, or with at least one placed parent

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Populate the open list (marked in yellow)

Page 32: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Placement cost(i,slot): Longest path length through i if placed at slot

cost = inputCost + execCost + outputCost(includes communication and execution latencies)

Spatial Path Scheduling Example Calculate placement cost for

each instruction in the openlist at each slot read

R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

1

3

1

3

3

1

1

Page 33: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Example Calculate placement cost for

each instruction in the openlist at each slot read

R2

mul

ld

mul

add

write R1

D1

1

3

1

3

3

1

1

Register File

Dat

a C

ache

5

3

1

D0

D1

ctrl

mulE1

R1 R2

Total placement cost = 16 + 3 + 3 = 22

5 cycles

3 cycles

1 cycle

Page 34: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Example

D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

28

30

R1 R2

Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

Calculate placement cost for each instruction in the openlist at each slot

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30

read R2

add

br

mul

ld ld

read R1

write R1

D0 D1ctrl

mul

add

Page 35: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling Example

Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

ld

D1

read R2

add

br

mul

ld

mul read R1

add

write R1

D0ctrl

Choose the minimum cost location for each instruction

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

28

30

R1 R2

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30

Page 36: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

D0

D1

ctrl

22

22

24

26

22

22

24

26

24

24

26

28

26

26

30

30

R1 R2

Register File

Dat

a C

ache

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

Spatial Path Scheduling Example Break ties Example heuristics:

Links consumed ALU utilization

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

ld

D1

read R2

add

br

mul

ld

mul read R1

add

write R1

D0ctrl

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30

Page 37: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

add10 8 8 1010 10 10 1212 12 12 1414 14 14 16

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D1ctrl

Spatial Path Scheduling Example Place the instruction with the

highest minimum cost

(Choose the max of the mins)

D0

D1

ctrl

mul

R1 R2

Register File

Dat

a C

ache

mul24 24 22 2422 22 22 2424 24 24 2826 26 26 28

mul22 22 24 2622 22 24 2624 24 26 2826 26 28 30

add22 22 24 2622 22 24 2624 24 26 2826 26 28 30

Page 38: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Spatial Path Scheduling AlgorithmSchedule (block, topology)initialize known anchor pointswhile (not all instructions scheduled)

for each instruction in open list, ifor each available location, n

calculate placement cost for (i, n)keep track of n with min placement cost

keep track of i with highest min placement costschedule i with highest min placement cost

Per-block complexity:

SPS: O(i2 * n) i = # of instructionsn = # of ALUs

GRST: O(i2 + i * n)

Exhaustive search: i !

Page 39: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

SPS Benefits and Limitations Benefits

Automatically exploits known communication latencies Designed for spatial scheduling Minimizes critical path length at each step Naturally encompasses four of five GRST heuristics

Limitations of basic algorithm Does not account for resource contention Uses no global information Minimum communication latencies may be optimistic

Page 40: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Experimental Methodology 26 hand-optimized microbenchmarks

Extracted from SPEC2000, EEMBC, Livermore Loops, MediaBench, and C libraries

Average dynamic instructions fetched/block: 67.3 (Ranges from 14.5 to 117.5)

Cycle-accurate simulator Within 4% of RTL on average Models communication and contention delays

Comparison points Greedy Scheduling for TRIPS (GRST) Simulated annealing

Page 41: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

SPS PerformanceGeometric mean of speedup over GRST: 1.19

0.8

1

1.2

1.4

1.6

1.8

2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. MeanHand-coded microbenchmark

Speedup

Basic SPS

Page 42: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

SPS Performance

0.8

1

1.2

1.4

1.6

1.8

2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. MeanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRST: 1.19

Basic SPS

Page 43: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

SPS Performance

0.8

1

1.2

1.4

1.6

1.8

2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. MeanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRST: 1.19

Basic SPS

Page 44: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work

Page 45: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

How well can we do? Simulated annealing

Artificial intelligence search technique Uses random perturbations to avoid local optima Approximates a global optimum

Cost function: simulated cycles Uncertainty makes static cost functions insufficient Best cost function

Purpose Optimization Discover performance upper bound Tool to improve scheduler

Page 46: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup with Simulated Annealing

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40

Basic SPS Annealed

Page 47: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup with Simulated Annealing

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40

Basic SPS Annealed

Page 48: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup with Simulated Annealing

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Geometric mean of speedup over GRSTBasic SPS: 1.19 Annealed: 1.40

Basic SPS Annealed

Page 49: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions and Future Work

Page 50: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Extending SPS Contention

Network link contention Local and Global ALU contention

Global register prioritization Path volume scheduling

Page 51: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Register File

Dat

a C

ache

read R2

add

br

mul

ld ld

mul read R1

add

write R1

D0 D2ctrl

ALU Contention What if two instructions are ready to execute on the

same ALU at the same time?

D0

D2

ctrl R1 R2

add addbr

mul

ld mul

ld

Page 52: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Local vs. Global ALU Contention Local ALU contention

Keep track of expected issue time Increase placement cost if conflict occurs

Global ALU contention Resource utilization in previous/next block Weighting function

Modify placement cost

Page 53: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup over GRST

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Basic SPS AnnealedSPS extended

Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Page 54: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup over GRST

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Basic SPS AnnealedSPS extended

Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Page 55: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Speedup over GRST

0.8

1

1.2

1.4

1.6

1.8

2

2.2

a2time01ammp_2art_1art_2art_3

bzip2_1cfar conv

ct

equake_1genalggzip_1gzip_2

matrix_1memchrmemcpymemsetparser_1

pm qrrbtree

shastrcmp

svd

transpose_GMTI

vadd

Geo. meanHand-coded microbenchmark

Speedup

Basic SPS AnnealedSPS extended

Geometric mean of speedup over GRSTBasic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Page 56: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Related Work Scheduling for VLIW [Ellis, Fisher]

Scheduling for other partitioned architectures Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea] RAW [Lee] Wavescalar [Mercaldi]

ASIC and FPGA place and route [Paulin] Resource conflicts known statically Substrate may not be fixed Simulated annealing [Betz]

Page 57: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Conclusions and Future Work Future work

Register allocation Memory placement Reliability-aware scheduling

Conclusions General spatial instruction scheduling algorithm Reasons explicitly about anchor points Performance within 4% of annealed results

Page 58: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Questions?

Page 59: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture

translates ID -> Physical location

TIL (operand format): TASL(target format)read t0, g1read t1, g2muli t2, t1, 4ld t3, 0(t2)ld t4, 4(t2)mul t5, t3, t4add t6, t5, t0addi t7, t1, 8br t7write g1, t6

R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]

Scheduler

Page 60: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture

translates ID -> Physical location

TASL(target format)

D0

D1

D2

D3

ctrl R0 R1 R2 R3

0

32

64

96

1

33

65

97

2

34

66

98

3

35

67

99

R0 R1 R2 R3R0 R1 R2 R3 R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]

Page 61: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture

translates ID -> Physical location

TASL(target format)

D0

D1

D2

D3

ctrl R0 R1 R2 R3

4

36

68

100

5

37

69

101

6

38

70

102

7

39

71

103

R4 R5 R6 R7R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]

Page 62: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture

translates ID -> Physical location

TASL(target format)

D0

D1

D2

D3

ctrl

0,4,8,… 28

32,36,… 60

64,68,… 92

96,100,… 124

1,5,9,… 29

33,37,… 61

65,69,… 93

97,101,… 125

2,6,10,… 30

34,38,… 62

66,70,… 94

98,101,… 126

3,7,11,… 31

35,39,… 63

67,69,… 95

99,102,… 127

R0,R4,… R28

R1,R5,… R29

R2,R6,… R30

R3,R7,… R31

R[1] read, G[1], N[5]R[2] read, N[2], N[6]N[2] muli, N[34], N[1]N[34] ld, N[32]N[1] ld, N[32]N[32] mul, N[5]N[5] add, W[1]N[6] addi, N[0]N[0] brW[1] write, G[1]

Page 63: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Simulated Annealing Over Time

60000

65000

70000

75000

80000

85000

90000

95000

100000

1 74 147 220 293 366 439 512 585 658 731 804 877 950 1023 1096 1169 1242 1315 1388 1461 1534 1607 1680Annealing Iterations

Simulation Cycles

random acceptedrandom bestguided acceptedguided best

Page 64: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Simulated Annealing Cost function: Simulated cycles Prune space further with critical path tool

76000

77000

78000

79000

80000

81000

82000

83000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

Annealing Times

Simulation Cycles

Random MoveGuided Move

Guided vs. unguided Annealing for memset_hand

Page 65: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Contention ALU contention

Local (within a block) - Estimate temporal schedule Global (between blocks) - Probabilistic - use weighting function

Network link contention Precise measurements too inaccurate Estimate with threshold, weighting function

Weight network link and global ALU contention based on annealed results

weight = (1 - fullness) * (1 - ) criticality

concurrency

Page 66: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Global Register Prioritization Problem: Any register dependence may be important

with speculative execution Solution: Extend path lengths through registers

Register prioritization:

1) Schedule smaller loops before larger loops

2) Schedule loop-carried dependences first

3) Extend placement cost through registers to previous/next block

Page 67: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

ASPLOS XIIApril 22, 2023

Path Volume Scheduling Problem: The basic SPS algorithm does not account for

the number of instructions in the path

Solution: Perform a depth-first search with iterative deepening to find the shortest path that holds all instructions