Forrest Brewer [email protected] UCSB CAD and Test Group ECE/UCSB Santa Barbara CA 93106 NDFA...

60
Forrest Brewer [email protected] UCSB CAD and Test Group ECE/UCSB Santa Barbara CA 93106 CA D H L S C U SB B D D NDFA Based Scheduling Forrest Brewer, Steve Haynal University of California Santa Barbara
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Forrest Brewer [email protected] UCSB CAD and Test Group ECE/UCSB Santa Barbara CA 93106 NDFA...

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

NDFA Based Scheduling

Forrest Brewer, Steve Haynal

University of California

Santa Barbara

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Scheduling is Behavioral Synthesis

• Exploits fundamental freedom -- ordering and binding of operations, operands– Subdivided into DFG transformation, resource allocation, time-scheduling, operation binding,

memory binding, communication binding, resource modeling, reallocation...

– Complexity of tasks requires top-down flow -- yet evaluations/constraints are bottom-up

• Behavioral Synthesis difficult to use!– Seemingly trivial changes cause vast output changes

– Design tradeoffs tied to a particular point language (~VHDL, ~Verilog, Silage, Esterel...)

– No direct control of implementation

– No direct control of binding, mapping

– No distinction between problem statement and constraints

– No canonical representation of design space

• Fundamental problem covers enormous scope– Universality issues in specification

– How to capture design mapping knowledge?

– How to create verifiable design representation without canonical model?

• Our viewpoint -- wrong problem

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Simpler Problem

• Assume Designer creates the design– Support incremental refinement of design at all levels of representation

– Support incremental design synthesis when possible

– Provide well defined hierarchy on which to place constraints, trial implementations ...

– Provide mechanism for subsystem abstraction, modeling and evaluation at each level

• How to do this?– Drop representation distinction between logic, module, and sub-system levels

– Drop potential for universality in internal representations

– Create mechanism for automatic design abstraction within designer's design decomposition

– Use efficient representation of fundamental model

– Provide feedback to designer for evaluating both the design itself and the representation

• Where do we start?– Interface Protocols are key complexity growth problem

– Designer constructs system model with abstract protocols, required data-flows, possible maps

– Generalize scheduling to provide possible sequencing of sub-systems into systems meeting external protocol constraints (models)

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Protocol Constrained Scheduling

• Problem: Conventional scheduling algorithms cannot accommodate the typical complex sequencing and timing constraints of modern design.

• Three Problems: Specification, Scheduling, and Problem Scale• Specification: How to specify the required timing in an concise, explicit way?• Scheduling: How to systematically exploit mapping freedom while meeting the

timing requirements?• Problem Scale: Problems of interest to industry are enormously complex!

• Idea: Protocol specification is amenable to NDFA modeling -- so create automata-based model to represent Control/Data-flow freedom => All possible implementations exist as sequences of states of the joint automaton

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Protocol Specification

• Sequencing complexity of digital system interfaces increasing• Specification languages Verilog?, VHDL require implicit protocol specification• Alternative specification via NDFA automata (e.g. PBS, Esterel, Custom point

language)– Representation is finite

– Synthesis can be very efficient -- can handle very complex designs

– Provides mechanism for time sequence specification relatively independent of data-flow control semantics

• Protocol + CDFG semantics + mapping abstractions make a complete model– No ad-hoc mapping library (beyond control of designer)

– No convenient dependency binding assumptions (to be worked around by designer)

– No encrypting desired sequential FSM in higher level language!

• Designer specifies event sequences he wants• System evaluates/synthesizes ensemble FSM

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Design Representation

• Model System as hierarchy of design frames• Frames have external protocol specification NDFA, CDFG, and allowed Mappings• Frames contain instances of other frames abstractions (abstracted NDFA/CDFG

model)• Resource utilization and sharing restricted to within a design frame

Sub-frame Model

Control Data Flow Graph

External ProtocolFrame

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Hierarchy of Refinement

• Exact protocol scheduling intractable for practical large problems

• Hierarchy of Refinement– partition the problem into manageable abstractions– hides lower level details– allows systematic high-level pruning of designs before more detailed treatment– Completed sub-frame designs can be abstracted to high level component

models– allows incremental design change/refinement at any level– --provides mechanism for consistency verification

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Protocol Scheduling Implementation

• Represent CDFG model as Causal (NDFA) Automaton– Generalization of current scheduling model

– Models all valid data flows

– Models code hoisting, unrolling, transformations...

• Represent External Protocol as NDFA automaton– Very general, efficient model

– Synchronous timing model (can be generalized-- future work)

– Alternative behavior as NDFA alternatives

• CDFG maps I/O operations among sub-frames• Sub-frames have interface protocols, abstracted CDFG semantics• Construct ensemble automata model with all valid sequences of events meeting

internal and external protocols and causal data-flow constraints• Need only find complete sub-set of all possible states for solution

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Scheduling Solution

• Every schedule is some subset of states of the ensemble automaton• Must construct causal and complete set of states• Exact solution strategies:

– Construct all states up to resource bounds

– Depth-first search of states

– Heuristic search -- choose good path, complete schedule automatically

– Prune solution space

– Additional constraints or objectives -- technique works best when highly constrained

• Heuristic strategies:– Sub-set BDD representation of reachable states

– Incremental search (this is not verification!)

• Possible objectives:– Communication

– Temporary storage (memory)

– Performance

– Control complexity

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

DFA Model of Two Stage Pipe

• Input = 1 indicates operands are supplied to the pipe

• Output = 1 indicates operand is produced by the pipe

State a b c b d c b d d c a

Input 0 1 0 1 1 0 1 1 1 0 0Output 0 0 0 1 0 1 1 0 1 1 1

a

b

c

d

0/0

1/11/0

0/1

0/11/0

0/0

1/1

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

NDFA Protocol for Two Stage Pipe

• Inputs and outputs same as DFA model

• Some transitions produce no outputs

Input 0 1 0 1 1 0 1 1 1 0 0Output 0 0 0 1 0 1 1 0 1 1 1State a a,b a,c a,b a,b,c a,c a,b a,b,c a,b,c a,c a

-/

1/ -/a b c

-/1

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Operand Scheduling a CDFG on NDFA Protocol

• CDFG to Schedule:

• Two stage NDFA protocol description for component• Protocol alone is insufficient -- need internal data-flow requirements• Mapping is trivial (in this case)• Protocol + CDFG is sufficient -- but also describes information not needed

externally• Solution: Simplify scheduling solution of sub-frame to make abstracted model

A CB

D E

* *

*

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Operand Schedule on NDFA Protocol

• Optimal one multiplier schedule (co-execution of protocol and causal automata):

A C

B

D

E

* *

*

-/

1/ -/a b c

-/1

Input 0 1,A,B 1,D,E 1,t,C 0 0 0Output 0 0 0 1,t,C 1,D*E 1,A*B*C 0State a a,b a,b,c a,b,c a,c a a

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Causal Automaton Formulation of Scheduling

• Scheduling Problem (V, E, C, R)• vertex v V is an operation• edge (u,v) E is a directed edge representing a data dependency

• hyper-edge {vc,VTc,VFc} groups a control operation and corresponding subsets of

operations• hyper-edge {bound, (T V)} R represents a resource bound applied to a subset of

(mapped) operations• The edge set is partitioned into a forest of forward edges and a subset of looping

edges which point backward

• Scheduling solution is a complete, compatible set of deterministic sequences of vertices such that all dependencies are causal and all resource bounds are met at each state, and the set has sequences for each possible future value of the set of controls.

• In the following, we will discuss minimum latency and maximal throughput as objective functions.

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Single-Cycle Operation Modeling Automata

• 00 Operation unscheduled and remains so

00

1 0

01

• 01 Operation scheduled next cycle

• 11 Operation scheduled and remains so

j1

• 10 Operation scheduled but result lost

0 1

1 1

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Scheduling Automata:

• State represents current set of available operands and state of modeling protocol automata

• Constraints on transitions• Representation Compact• Product of Mapped Modeling automata for each resource protocol

0 1 0 10 1

h i k….

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

00 01 11h

00 01 11i

00 01 11j

Resource Bounds

• 01 indicates resource• Resource bounds constrain simultaneous 01

transitions• Iterative constraint on CA

01

01

01

01

01

11

01

01

00• ROBDD representation:– 2 |bound| |operations|

One Resource

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Dependency Implication

• All transitions in which j is active before all of its predecessors are known are removed

• BDD Complexity is O(|predecessors| * |operations|)

h

ij

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

i j

h

Example NFA

• Assume 1 resource

• Transition relation induces graph

• Any path from all operations unknown to all known is a valid schedule

• Shortest paths are minimum latency schedules

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

000

i00

ij0

00h

i0h

ijh

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

3

2

1

0

All Minimum Latency Schedules

• Symbolic reachable state analysis

000

i00

ij0

00h

i0h

ijh

– Newly reached states are saved each cycle

– Backward pruning preserves transitions used in all shortest paths

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

2

1

0

All Minimum Latency Schedules

• Symbolic reachable state analysis

000

i00

ij0

00h

i0h

ijh

– Backward pruning preserves transitions used in all shortest paths

– Newly reached states are saved each cycle

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

1

0

All Minimum Latency Schedules

• Symbolic reachable state analysis

000

i00

ij0

00h

i0h

ijh

– Backward pruning preserves transitions used in all shortest paths

– Newly reached states are saved each cycle

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

0

All Minimum Latency Schedules

• Symbolic reachable state analysis

000

i00

ij0

00h

i0h

ijh

– Backward pruning preserves transitions used in all shortest paths

– Newly reached states are saved each cycle

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

All Minimum Latency Schedules

• Described construction is Exact --

• Suitable heuristics are available and since they can use arbitrary subsets of the potential schedules are powerful

000

i00

ij0

00h

i0h

ijh

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

CDFG Representation

Operation

Control Dependency

DataDependency

Fork

Join

i2

h1 i1

j2

j1

k1

Resource Class

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

CDFGs: Multiple Control Paths

• Guard automata differentiate control paths– Before control operation scheduled:

0 1 Control value unknown

– After control operation scheduled:

0 1 Control value known

• Guards are implemented as modified operation automata

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

CDFGs: Multiple Control Paths

• All control paths form ensemble schedule– Possibly 2c control paths to schedule (non-looping case)

• Dummy operation identifies when control path terminates– Only one termination operation

• Ensemble schedule need not be causal!– Need solution for each control path (Completeness)– Need compatibility between paths whose control is not resolved (Causality)– Solution: validation algorithm– Validation is a path to path property for all control paths in ensemble schedule– Fixed Point Iteration

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

0000

0i00

c000

00j0

ci00

0ij0

c0j0

ci0t

cij0

c0jt

cijt

CDFG Example

• One green resource

i jc

t

• Shortest paths

• False termination

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Validated CDFG Example

• Validation algorithm ensures control paths don’t bifurcate before control value is known 0000

0i00

c000

00j0

ci00

0ij0

c0j0

cij0cijt

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Validated CDFG Example

• Validation algorithm ensures control paths don’t bifurcate before control value is known

• Pruned for all shortest paths as before

0000

0i00 00j0

ci00 c0j0

cij0cijt

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Validation Algorithm

• Validation Proceeds on potential traces• Re-traverse Automata, Dynamically Modifying Transition Relation based on

current available states in each time step: Allow guard computation only for states with matching histories if the guard is true or false.

• Iterate until fixed point on all paths• Apply the following non-linear filter to each transition:

kkGkkjjvalid NPNPVSVSk

'')'(),( '1

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Selected CDFG Benchmarks

05

1015202530354045

EWF1 FDCT1 MAHA KIM ROTOR

Operations Min. Latency CPU Seconds Condition

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Large Benchmarks

0

50

100

150

200

250

EWF3 EWF6 EWF2x2 FDCT1x2 S2R

Operations Min. Latency CPU Seconds Conditions

957

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Comparison of CPU Times

1

392

512

15

176

3

103

0

100

200

300

400

500

600

EWF1 EWF3 ROTOR

Current Radivojevic Yang

Heuristic

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Required CPU Seconds

0

10

20

30

40

50

60

70

80

EWF1 EWF3 EWF6 FDCT1

1A 1M 1A 2M 2A 2M 3A 3M 1A 1PM 2A 1PM 2A 2PM 3A 2PM

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Construction for Looping DFG’s

• Use trick: 0/1 representation of the MA could be interpreted as 2 mutually exclusive operand productions

• Schedule from ~know -> known -> ~known where each 0->1 or 1->0 transition requires a resource.

• Since dependencies are on operands, add new dependencies in 1 ->0 sense as well

• Idea is to remove all transitions which do not have complete set of known or ~known predecessors for respective sense of operation

• So -- get looping DFG automata as nearly same automata as before– preserve efficient representation

• Selection of “Minimal Latency” solutions is more difficult

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Loop construction: resources

• Resources: we now count both 0 -> 1 and 1 ->0 transition as requiring a resource.

• Use “Tuple” BDD construction: at most k bits of n BDD• Despite exponential number of product terms, BDD complexity: O(bound * |V|)

0 1

A

B B

C C C

D D D

E E

F F

G

E

(n-k)

(k+1)

1

1 1

1 1 1

1 1 1

1 1 1

1 1

1

0

0 0

0 0 0

0 0 0

0 0 0

0 0

0

Resource Bound (at most 4 out of 7)

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Example CA

• State order (v1,v2,v3,v4)

• Path 0,9,C,7,2,9,C,7,2,…is a valid schedule.• By construction, only 1 instance of any operator can occur in a state.

v1 v2 v3

v4

Present State Next State0,1 0,1,8,92,3 0,1,2,3,8,9,A,B

4,5,C,D 4,5,6,7,C,D,E,F6,7,A,B 2,3,6,7,A,B,E,F

8,9 0,1,4,5,8,9,C,DE,F 6,7,E,F

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Strategy to Find Maximal Throughput

• CA automata construction simple

• How to find closed subset of paths guaranteeing optimal throughput

• Could start from known initial state and prune slow paths as before-- but this is not optimal!

• Instead: find all reachable states (without resource bounds)

• Use state set to prune unreachable transitions from CA

• Choose operator at random to be pinned (marked)

• Propagate all states with chosen operator until it appears again in same sense

• Verify closure of constructed paths by Fixed Point iteration

• If set is empty -- add one clock to latency and verify again

• Result is maximal closed set of paths for which optimal throughput is guaranteed

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Maximal Throughput Example

• DFG above has closed 3-cycle solution (2 resources)• However- average latency is 2.5-cycles• (a,d) (b,e) (a,c) (b,d) (c,e) (a,d) …• Requires 5 states to implement optimal throughput instance• In general, it is possible that a k-cycle closed solution may exist, even if no

k-state solution can be found• Current implementation finds all possible k-cycle solutions

a b c

d e

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

EWF Looping Benchmarks

0

5

10

15

20

25

30

35

EWF 3,2 EWF 2,2 2P 3,2 2S 3,2

cycles

CPU

268

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Synthetic Benchmarks

• Over 100 synthetic benchmarks tested– Sizes 50 operator, 100 operator, randomly assigned dependency chains,

resources

• 32% had no causal schedule

• 35% had all maximum throughput schedules found in 15 minute timeout (1 minute Reachable States, 14 minute Fixed Point)

• 33% Timed Out– Analysis of timeout cases: most included disconnected independent sub-

graphs– Trial partitioning of the Transition Relation looks very promising on these

cases (time/space reduction nearly quadratic!)

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Synthetic Loop Benchmarks

0

10

20

30

40

50

60

70

80

223-1,4 262-4,2 213-5,1 206-2,5

Cycles

Ln CPU

Resources

Loop Dep

Ln Nodes

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Schedule Exploration: Loops

• Idea: Use partial symbolic traversal to find states bounding minimal latency paths

• Latency-- Identify all paths completing cycle in given number of steps

• Repeatability-- Fixed Point Algorithm to eliminate all paths which cannot repeat in given latency

• Validation-- Ensure all possible control paths are present for each remaining path

• Optimization-- Selection of Performance Objective

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Kernel Execution Sequence Set

• Path from Loop cut to first repeating states

• Represents candidates for loop kernel

Loop Kernel

I~

L~k~j~

Loop Cut

i

lkj

a~

d~c~b~

a

dcb

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Repeatable Kernel Execution Sequence Set

• Fixed-point prunes non-repeating states

Only repeatable loop kernels remain

Paths not all same length

Average latency <= shortest Repeating Kernel

Loop Cut

Repeatable Loop Kernel

i

lkj

a~

c~b~

a

cb

i~

l~K~j~

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Validation I

• Schedule Consists of bundle of compatible paths for each possible future

• Not Feasible to identify all schedules

• Instead, eliminate all states which do not belong to some ensemble schedule

• Fragile since any further pruning requires re-validation

• Double fixed point

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Validation II

• Path Divergence -- Control Behavior

Ensure each path is part of some complete set for each control outcome

Ensure that each set is Causal

i

lkj

c~b~

cb

i~

l~k~j~

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Loop Cuts and Kernels

• Method Covers all Conventional Loop Transformations

Sequential Loop

Loop winding

Loop Pielining

Loop Kernel

Loop Cut

Loop Cut

Loop Kernel

Loop Cut

Loop Kernel

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Results

• Conventional Scheduling100-500x speedup over ILP

• Control Scheduling: Complexity typically pseudo polynomial in number of branching variables

• Cyclic Scheudling:Reduced preamble complexity

Capacity: 200-500 operands in exact implementation

• General Control Dominated Scheduling:Implicit formulation of all forms of CDFG transformation

Exact Solutions with Millions of Control paths

• Protocol Constrained Scheduling:Exact for small instances – needs sensible pruning of domain

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Model

• SimpleScalar (MIPS IV superset) Model

• Trace Probabilities from MediaBench

• Hierarchical Model

Collection of Instruction Tasks in Flight

Each Instruction Task is Complete Behavioral Model of Instruction Execution, including all instruction types, hazards, controls, and Contention for Physical Resources

Additional Sequential Protocols for Memory Subsystem, both Fetch and Load/Store

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Processor Composition

• Ordered Fetch/Commit

• 3 Simultaneous Instruction Executions

• Sequencing of Instructions separated from pipeline

• Out of Order Prefetch or Commit can be Modeled

Bypass Next ins Next PPC

Bypass Next ins Next PPC

Bypass Next ins Next PPC

Bypass Instruction PPC

Bypass Instruction PPC

Bypass Instruction PPC

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

PC update: Speculative Fetch• Speculate Joins to allow early prefetch and address

computation

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Transaction Dependencies

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Results: Constraints

•Scenario A1/2 cycle tasks, Single Bypass

2 cycle Pipelined Double Word Memory Fetch

2 cycle Pipelined Multiply

2R/1W Register File, 2 ALU's, 2 port Memory

•Scenario B2 cycle Memory Read/Write/Fetch

2R -1R/1W Register File, 1 ALU, 1 port Memory

Cache 1 cycle hit/3 cycle miss, Deferred Pipeline

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Results: Instruction Mix

•Media Bench Tuning:88% reg-reg, reg-imm, br taken, load single

80% branch taken

35% Single Bypass Hazard

1% Multiple Bypass (Stall in model)

•Two Sets of Priority MixesMix1: favors (reg-reg, reg-imm, br-taken)

Mix 2: favors (load-sw, br-taken)

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Results: Mix 1

• Mix 1 favors reg-reg, reg-imm, and br-taken

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

MIPS Results: Mix 2

• Mix 2 Favors loads, reg-reg w. branches

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Cache and I/O Protocol

• For 3 instructions in flight > 542,000 control paths!• Schedules still exact – every optimal sequence is constructed

Forrest Brewer [email protected]

UCSB CAD and Test GroupECE/UCSB Santa Barbara CA 93106

C A D

HL

S

CU S B

BD

D

Conclusions

• NFA protocol modeling shown to be effective representation for generalized scheduling problem

• Efficiency of algorithms so far is comparable or superior to any known exact technique

• Potential for powerful heuristics based on sub-set representation• First exact solutions for a wide variety of generalized scheduling problems