Post on 18-Jan-2018
description
CALTECH CS137 Spring2002 -- DeHon
CS137:Electronic Design Automation
Day 13: May 20, 2002Page Generation
(Area and IO Constraints)
[working problem with Eylon Caspi]
CALTECH CS137 Spring2002 -- DeHon
Today
• Cover/clustering– Minimize Weight– W/ area and IO constraints
• Motivation: SCORE Page generation– Also energy minimization
• Techniques• Current Results• FPGA/hardware implementation?
CALTECH CS137 Spring2002 -- DeHon
Abstract Problem
• Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges.
• Cluster nodes into subsets Vi, such that (Cost(Vi)) minimized IO(Vi) < IO limit A(Vi) < Area limit Cost(Vi) = (cost(e) | e E st. e1 Vi and e2 Vi)
CALTECH CS137 Spring2002 -- DeHon
SCORE CompilationProgramming Model Execution Model
• Graph of TDF FSMD operators • Graph of page configs
- unlimited size, # IOs - fixed size, # IOs- no timing constraints - timed, single-cycle firing
Compile
memorysegment
TDFoperator
stream
memorysegment
compute page
stream
CALTECH CS137 Spring2002 -- DeHon
How Big is an Operator?
• Wavelet Decode• Wavelet Encode• JPEG Encode• MPEG Encode
Area for 47 Operators(Before Pipeline Extraction)
0
500
1000
1500
2000
2500
3000
3500
Operator (sorted by area)
Are
a (4
-LU
Ts)
FSM AreaDF Area
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
CALTECH CS137 Spring2002 -- DeHon
Clustering is Critical
• Inter-page comm. latency may be long• Inter-page feedback loops are slow• Cluster to:
– Fit feedback loops within page– Fit feedback loops on device
CALTECH CS137 Spring2002 -- DeHon
Pipeline Extraction• Hoist uncontrolled FF data-flow out of
FSMD• Benefits:
– Shrink FSM cyclic core– Extracted pipeline has more freedom for
scheduling and partitioning
Extract
state foo(i): acc=acc+2*i
state foo(two_i): acc=acc+two_i
i
stat
e
DFCF
*2
two_ii
pipeline pipeline
CALTECH CS137 Spring2002 -- DeHon
Pipeline Extraction – Extractable AreaExtractable Data-Path Area
for 47 Operators
0
500
1000
1500
2000
2500
3000
3500
Operator (sorted by data-path area)
Are
a (4
-LU
Ts)
Extracted DF AreaResidual DF Area
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
CALTECH CS137 Spring2002 -- DeHon
Page Generation
• Pipeline extraction – removes dataflow can freely extract from
FSMD control• Still have to partition potentially large
FSMs– approach: turn into a clustering problem
CALTECH CS137 Spring2002 -- DeHon
State Clustering• Start: consider each state to be a unit• Cluster states into page-size sub-
FSMDs– Inter-page transitions become streams
• Possible clustering goals:– Minimize delay (inter-page latency)– Minimize IO (inter-page BW)– Minimize area (fragmentation)
IA IB
OA OB
CALTECH CS137 Spring2002 -- DeHon
State Clustering to Minimize Inter-Page State
Transfer• Inter-page state transfer is slow• Cluster to:
– Contain feedback loops– Minimize frequency of
inter-page state transfer• Previously used in:
– VLIW trace scheduling [Fisher ‘81]– FSM decomposition for low power
[Benini/DeMicheli ISCAS ‘98]
– VM/cache code placement– GarpCC code selection [Callahan ‘00]
CALTECH CS137 Spring2002 -- DeHon
Clustering Problem
• SCORE Page – Fixed area (# of LUTs)– Fixed IO
• Cost on edges is probability take state transition
• Clustering Goal is to minimize page-to-page transition– Maximize expected transitions within same page– Find page-count/page-transition tradeoff curve
CALTECH CS137 Spring2002 -- DeHon
Abstract Problem
• Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges.
• Cluster nodes into subsets Vi, such that (Cost(Vi)) minimized IO(Vi) < IO limit A(Vi) < Area limit Cost(Vi) = (cost(e) | e E st. e1 Vi and e2 Vi)
Pages
Inter-Page Communication Frequency
CALTECH CS137 Spring2002 -- DeHon
DSM
• Possibly relevant for minimizing delay in DSM
• Previously discussed:– Larger area longer wires, slower– Want to cluster logic locally
• Maybe:– Cluster common computations together– Make distant computation transfer
uncommon
CALTECH CS137 Spring2002 -- DeHon
Island Packing for Energy
• Note: Modern FPGAs pack cluster of LUTs into an endpoint– e.g. Altera LAB
CALTECH CS137 Spring2002 -- DeHon
Island Packing for Energy
• Modern FPGAs pack cluster of LUTs into an endpoint– e.g. Altera LAB
• Local wiring less energy cost than long wiring
• Covering for energy:– minimize exposed activity factor– same covering problem
CALTECH CS137 Spring2002 -- DeHon
Abstract Problem
• Given: Graph (V,E) with a single weight (area) on each node and two weights (IO, cost) on the edges.
• Cluster nodes into subsets Vi, such that (Cost(Vi)) minimized IO(Vi) < IO limit A(Vi) < Area limit Cost(Vi) = (cost(e) | e E st. e1 Vi and e2 Vi)
Clusters/Islands
Switching Activity
CALTECH CS137 Spring2002 -- DeHon
First Try
• Use FBB (flow cut) [Wong/cs137a:day7]• Pick seed element• Compute mincut
– On mix of IO, cost edge weights?• If too small,
– Cluster in node and repeat• Else
– Cluster out node and repeat
CALTECH CS137 Spring2002 -- DeHon
Mincut lessons
• Couldn’t consistently control IO– Non-monotonic results adjusting weight
• Not clear what to cluster in
CALTECH CS137 Spring2002 -- DeHon
Idea #2
• If we had an ordering of nodes– (wishful thinking)
• Then easy to know how to include more– Just pick the next node
• Order: 1D list of nodes• Cluster: a contiguous sequence of
nodes in list– Specify start, finish
CALTECH CS137 Spring2002 -- DeHon
From Sequence to Clusters
• Easy to know if a contiguous subsequence– Meets area constraints– Meets io constraints
• Cover– Set of (non-overlapping) subsequences– Include all nodes
CALTECH CS137 Spring2002 -- DeHon
Feasible Clusters (mult16a)
CALTECH CS137 Spring2002 -- DeHon
Covering
• Not clear when to put more or less stuff in a cluster…versus leave with next cluster– Can’t build clusters greedily
• Like associative/parthesization problem saw earlier [day 5]
CALTECH CS137 Spring2002 -- DeHon
Parenthesis Matching
• Similar• But compute from all
breaks across a diagonal– Not just nearest
neighbor• Hence extra O(N)
Day 5
CALTECH CS137 Spring2002 -- DeHon
Dynamic Programming
• For each subsequence start,end– Either the area and io match – OR want to find a breakpoint between cluster
sets• Cluster sets startmidpoint, midpointend may
each either be single or multiple clusters
• Different splits may– Minimize number of clusters– Minimize cost– Keep dominator set [day11]
CALTECH CS137 Spring2002 -- DeHon
Algorithm
• Compute Linear Order• Compute IO, Area on each
subsequence – Think NxN table (but sparse)
• Use Dynamic Programming to cover
CALTECH CS137 Spring2002 -- DeHon
Compute Order?
• Could experiment with various techniques
• Considering: Spectral Ordering – [Hall/cs137a:day7]
• How weight edges?– IO, cost, mix?– Try linear mix…vary mix weighting
CALTECH CS137 Spring2002 -- DeHon
Weight Mix
• Why unclear?– IO weight good to cluster connectivity
• If Ios limited, allows to use fewer clusters• Pack more stuff into pageless cases need to
transition– Cost weight what we’re minimizing
• Cluster high cost edges together• Hide in page
– But, cost ordering may get less stuff in page if poorly IO clustered…
CALTECH CS137 Spring2002 -- DeHon
spp results
• [see HTML]
CALTECH CS137 Spring2002 -- DeHon
Versus Weighting (w by 0.01)
CALTECH CS137 Spring2002 -- DeHon
Discussion
• Promising Results– New capability not clear what compare to
• Maybe LUT clustering to validate algorithm– Absolutes look promising
• Weighting– Not clear how to search for best– Maybe should try other ways of weighting?
• [Michael suggests try taking log(trans)]
CALTECH CS137 Spring2002 -- DeHon
Spatial/Hdw Implementation?
• Compute Linear Order– Use 1D FDSA?
• Compute IO, Area on each subsequence – Parallel prefix sum scan
• One for each start point?
• Use Dynamic Programming to cover– Like parenthesis– Maybe 1D and combine with area/io scan?
CALTECH CS137 Spring2002 -- DeHon
Promising Ideas
• Compute good ordering– Easy to vary inclusion when know what’s
next to include/exclude• Mix weights• Cluster to minimize exposed (cut) costs