Hybrid Optimization/Heuristic Instruction Scheduling for ...

33
Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign Tony Nowatzki Newsha Ardalani Karthikeyan Sankaralingam Jian Weng PACT 2018, Nov. 3rd

Transcript of Hybrid Optimization/Heuristic Instruction Scheduling for ...

Page 1: Hybrid Optimization/Heuristic Instruction Scheduling for ...

Hybrid Optimization/Heuristic Instruction Scheduling for

Programmable Accelerator Codesign

Tony Nowatzki∗ Newsha Ardalani†

Karthikeyan Sankaralingam‡ Jian Weng∗

∗ † ‡

PACT 2018, Nov. 3rd

Page 2: Hybrid Optimization/Heuristic Instruction Scheduling for ...

2

> CPU

Google TPU

ReconfigurableArchitectures

CoarseGrain

Page 3: Hybrid Optimization/Heuristic Instruction Scheduling for ...

3

ReconfigurableArchitectures

CoarseGrain PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Systolic Array

What if I don’t exactly want to do dense matrix multiplies

with large matrices?

Google TPU

Systolic CGRA(dedicated PE CGRA)

Problem: How can we map computation to this effectively?

Circuit Switched Network

Processing Element (ALU etc.)

Perfectly-pipelined execution (1 computation per cycle)

Examples: Charm, Camel, Stream-Dataflow, FPCA,

DySER (almost), LSSD, Morphosys, etc …

Page 4: Hybrid Optimization/Heuristic Instruction Scheduling for ...

4

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

+ + + +

+ +

+

>

+

A[8] B[8] c[4]

?

o[1]

× × × × × × × × --

Mapping TimingRouting

Computation Graph Hardware Graph

Page 5: Hybrid Optimization/Heuristic Instruction Scheduling for ...

5

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

+ + + +

+ +

+

>

+

A[8] B[8] c[4]

?

o[1]

× × × × × × × × --

Mapping TimingRouting

Computation Graph Hardware Graph

Page 6: Hybrid Optimization/Heuristic Instruction Scheduling for ...

6

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

+ + + +

+ +

+

>

+

A[8] B[8] c[4]

?

o[1]

× × × × × × × × --

Mapping TimingRouting

Computation Graph Hardware Graph

>

?

Page 7: Hybrid Optimization/Heuristic Instruction Scheduling for ...

7

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

+ + + +

+ +

+

>

+

A[8] B[8] c[4]

?

o[1]

× × × × × × × × --

Mapping TimingRouting

Computation Graph Hardware Graph

?

>

>

Page 8: Hybrid Optimization/Heuristic Instruction Scheduling for ...

8

Mapping TimingRoutingJoint Optimization

(days for solution)

Incremental Optimization

Mapping TimingRouting

Mapping TimingRouting

Our Approach:1. Phase Overlapping2. Hybrid Scheduling

(high area increase or integer factors performance loss)

Joint Heuristic

TimingRouting

(full throughput & seconds to minutes & low area overhead)

Mapping Routing

Technique Overview

Page 9: Hybrid Optimization/Heuristic Instruction Scheduling for ...

9

Outline

• “Systolic” CGRAs• Throughput sensitivity and fractional initiation intervals.

• Techniques for delay matching

• Scheduling Approach 1: Phasing-Overlapping• Tractable Optimization through Overlapping

• Scheduling Approach 2: Hybrid Scheduling• Heuristic to reduce search space

• Optimization to deal with tricky cases

• Evaluation

Page 10: Hybrid Optimization/Heuristic Instruction Scheduling for ...

10

Requirements for Full-Pipelining

1. Sufficient Data Bandwidth: Data has to arrive at a bandwidth of sufficient for one set of inputs / cycle.

2. No Compute Contention: Each instruction gets a dedicated compute unit.

3. No Routing Contention: Each dependence gets a dedicated routing path.

4. No Storage Contention: Each operand is guaranteed a free storage location at each step.

Page 11: Hybrid Optimization/Heuristic Instruction Scheduling for ...

11

Fully-pipelined Systolic CGRA

PE

PE

PE

PE

PE

PE

×

/

+

Must move forward! Can’t be consumed!

Delay FIFO5

2

Mismatch=3

3

1

Mismatch=0

1 2 3 4 5 6Cycle:

Page 12: Hybrid Optimization/Heuristic Instruction Scheduling for ...

12

Fully-pipelined Systolic CGRA

PE

PE

PE

PE

PE

PE

×

/

+

Delay FIFO5

3

1

(2+2)

4

Mismatch=1

1 2 3 4 5 6Cycle:

Page 13: Hybrid Optimization/Heuristic Instruction Scheduling for ...

13

Throughput Formula

• Inverse of initiation interval in compiler terminology.

= FIFO Length

Max. Mismatch + FIFO Length

Delay-FIFOs Help Reduce Latency Mismatch

Provide buffer space which increases throughput

But at the cost of area…

Throughput

Page 14: Hybrid Optimization/Heuristic Instruction Scheduling for ...

14

Alternatives to Delay-FIFOs (1)

PE

PE

PE

PE

PE

PE

×

/

+5

5

Use a longer route: Costs Routing Resources

Page 15: Hybrid Optimization/Heuristic Instruction Scheduling for ...

15

Alternatives to Delay-FIFOs (2)

PE

PE

PE

PE

PE

PE

×

/

+5

5

Pass through a PE: Costs Mapping Resources

Do a NOP

This is also just good for improving routing bandwidth …

Page 16: Hybrid Optimization/Heuristic Instruction Scheduling for ...

16

“Systolic” CGRA Challenge Summary

• Systolic CGRA throughput is very sensitive to timing constraints

• Several ways to balance delays:• Pass-through (affects mapping)

• Long route (affects routing)

• Delay FIFOs (affects timing)

• Mapping / Routing / Timing responsibilities are highly interdependent

• Codesign: Either more delay FIFOs or more advanced scheduler

Page 17: Hybrid Optimization/Heuristic Instruction Scheduling for ...

17

Outline

• Challenges for pipelining “Systolic” CGRAs• Throughput sensitivity and fractional initiation intervals.

• Techniques for delay matching

• Scheduling Approach 1: Phasing-Overlapping• Tractable Optimization through Overlapping

• Scheduling Approach 2: Hybrid Scheduling• Heuristic to reduce search space

• Optimization to deal with tricky cases

• Evaluation

Page 18: Hybrid Optimization/Heuristic Instruction Scheduling for ...

18

Optimization Background• Prior work demonstrates optimization-based spatial

scheduling: Integer Linear Programming [PLDI 2013], SMT [TOPLAS 2014]

• Basic Integer Linear Programming Approach:• Decision Variables:

• Assigning instructions to PEs (binary)

• Assigning dependences to a set of Routes (binary)

• Assigning delays to each PE (integer)

• Linear Constraints: • Limit schedule to be legal

• We added three forms of delay-matching.

Page 19: Hybrid Optimization/Heuristic Instruction Scheduling for ...

19

Optimization Phases

• Joint Optimization: (50% Failure)• Timeout -- Very restricted hardware of systolic CGRA

• Separate Phases: (85% Failure)• Fast, but doesn’t capture inter-dependence!

• In-between Joint: (75% Failure, poor throughput)

Mapping TimingRouting

Mapping TimingRouting

Mapping TimingRouting

Page 20: Hybrid Optimization/Heuristic Instruction Scheduling for ...

20

Our Approach: Phase Overlapping• Insight: Phase Interdependence Matters

• But relationship between adjacent phases is more relevant…

• Phase Overlapping:• Perform Mapping and Routing together• Discard Routing (keep Mapping)• Perform Routing and Timing together

• Good: Only 15% fail, mostly fully-pipelined

• Remaining Problems:• Mapping/routing phase takes time (90% or more)• Sometimes, we get an unlucky mapping/routing (looks like

high potential, but cannot be fixed for timing)

TimingRouting

Mapping Routing

Page 21: Hybrid Optimization/Heuristic Instruction Scheduling for ...

21

Overview

• “Systolic” CGRAs• Throughput sensitivity and fractional initiation intervals.

• Techniques for delay matching

• Scheduling Approach 1: Phasing-Overlapping• Tractable Optimization through Overlapping

• Scheduling Approach 2: Hybrid Scheduling• Heuristic to reduce search space

• Optimization to deal with tricky cases

• Evaluation

Page 22: Hybrid Optimization/Heuristic Instruction Scheduling for ...

22

Hybrid Scheduling: Integrate Heuristic

• Insight: Mitigate the problems overlapped scheduling by trying things repeatedly if we got an unlucky mapping.

• But … Optimization won’t work for mapping/routing• Still too slow (and do we really need it?)

• ILP solver’s don’t tend to generate unique solutions

• Hybrid Approach: Use a heuristic for the mapping/routing phase, to quickly get a plausible mapping, then optimize the routing and timing together.

TimingRouting

Mapping RoutingHeuristic:

Optimization:

Page 23: Hybrid Optimization/Heuristic Instruction Scheduling for ...

23

A Heuristic for Hybrid Scheduling

• Need a heuristic that quickly generates uniqueschedules!

• Basic Approach: Stochastic Scheduling• Iteratively build schedule in topological order (fast)

• Normally be rational: Choose a location for each instruction which minimizing routing resources and timing mismatch.

• Occasionally be irrational: Choose a purposely bad mapping, hoping that it might help later on (unique)

• Repeat a few times to find a good schedule (good)

• Objective function: Minimize total mismatch (to make routing/timing scheduling easier)

Page 24: Hybrid Optimization/Heuristic Instruction Scheduling for ...

24

Intuition for Solution Quality

0

2

4

6

8

10

12

14

16

0 200 400 600 800

Ob

ject

ive:

Mis

+ L

at/5

0

Seconds into scheduling (32-1 reduction in CNN)

Heuristic

Overlapped

Hybrid

Page 25: Hybrid Optimization/Heuristic Instruction Scheduling for ...

25

Outline

• “Systolic” CGRAs• Throughput sensitivity and fractional initiation intervals.

• Techniques for delay matching

• Scheduling Approach 1: Phasing-Overlapping• Tractable Optimization through Overlapping

• Scheduling Approach 2: Hybrid Scheduling• Heuristic to reduce search space

• Optimization to deal with tricky cases

• Evaluation

Page 26: Hybrid Optimization/Heuristic Instruction Scheduling for ...

26

Methodology – Accelerator Modeling

• Softbrain Accelerator• 5x5 Processing Elem (PE)• 64-bit datapath (subword)• Heterogeneous Grid• All FUs are fully pipelined

• Simulation: All blocks simulated at cycle-level in gem5, single “core”

• Area Analysis• Synthesize Chisel-based design in

55nm tech library.• Assumes Control core (single-issue

inorder 16KB I$/D$) + 4KB SPAD

• Workloads (accelerator centric)• MachSuite, CNN/DNN, Dense Linear

Algebra QR/Cholesky/FFT)

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

IO In

terf

ace

Softbrain [ISCA 2017]

Stream SPAD

StreamingCache

CtrlCore

Page 27: Hybrid Optimization/Heuristic Instruction Scheduling for ...

27

Does delay FIFO area matter?

FIFOLength

Scheduling Difficulty

Area (mm2) Overhead

2 Very Hard 0.528 9.67%

3 Hard 0.543 12.90%

7 Medium 0.606 25.85%

15 Easy 0.736 52.86%

Page 28: Hybrid Optimization/Heuristic Instruction Scheduling for ...

28

Methodology – Experimental Setup

• Scheduler Designs:• Joint Optimization

• Overlapped Optimization

• Heuristic Only

• Hybrid (overlap/part heuristic)

• Scheduling Timeout: 20 Minutes

• Problem: How to compare schedulers that across a set of workloads that may fail?• We don’t! Instead, we seed all schedulers with an initial

solution from the heuristic.

• Joint and Overlapped are also hybrid schedulers in the evaluation, just using a more traditional approach.

Page 29: Hybrid Optimization/Heuristic Instruction Scheduling for ...

Performance vs Time vs Complexity

0.7

0.75

0.8

0.85

0.9

0.95

1

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

Joint Overlapped Heuristic Hybrid

Firi

ng

Rat

e (I

tera

tio

ns/

Cyc

le)

Throughput

0

100

200

300

400

500

600

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

Joint Overlapped Heuristic Hybrid

Ave

rage

Sch

ed

ulin

g Ti

me

(Sec

.)

Average Scheduling Time

Page 30: Hybrid Optimization/Heuristic Instruction Scheduling for ...

30

Conclusions

• Systolic CGRAs gain efficiency by simplifying the execution model.

• Consequence is tradeoff between easy scheduling low hardware overhead, and high performance:• Minimizing delay-FIFO size is key factor in reducing area

and increasing scheduler difficulty in finding minimum II.

• Two novel techniques:• Phase Overlapping (capture the phase interdependence)

• Hybrid Scheduling (using heuristic to generate solutions)

• Broadly, codesign is the key to success of future reconfigurable architectures.

Page 31: Hybrid Optimization/Heuristic Instruction Scheduling for ...

31

Thank you

Page 32: Hybrid Optimization/Heuristic Instruction Scheduling for ...

Scheduling-time Sensitivity

0.7

0.75

0.8

0.85

0.9

0.95

1

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

Joint Overlapped Heuristic Hybrid

Firi

ng

Rat

e

(Ite

rati

on

s/C

ycle

)

Throughput 20 min 40 min 1 hr

1

10

100

1000

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

FIFO

: 15

FIFO

: 7

FIFO

: 3

FIFO

: 2

Joint Overlapped Heuristic Hybrid

Avg

. Sch

ed

ulin

g Ti

me

(Sec

.)

Average Scheduling Time 20 min 40 min 1 hr

Page 33: Hybrid Optimization/Heuristic Instruction Scheduling for ...

33

Modeled vs MeasuredThroughput

Heuristic Hybrid Overlapped Joint