Flexible Hardware Design at Low Levels of Abstraction · Fast, low-power prefix networks Mary...
Transcript of Flexible Hardware Design at Low Levels of Abstraction · Fast, low-power prefix networks Mary...
Flexible Hardware Design at Flexible Hardware Design at Low Levels of AbstractionLow Levels of Abstraction
Emil Axelsson
Hardware Description and Verification
May 2009
Why low-level?Why low-level?
gadget a b = case a of2 -> thing (b+10)3 -> thing (b+20)_ -> fixNumber a
Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware)
Ideal:Software-like code → magic compiler → chip masks
Why low-level?Why low-level?
Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware)
Ideal:Software-like code → magic compiler → chip masks
gadget a b = case a of2 -> thing (b+10)3 -> thing (b+20)_ -> fixNumber a
Why low-level?Why low-level?
Reality:“Ascii schematic” → chain of synthesis tools → chip masks
Why low-level?Why low-level?
Reality:“Ascii schematic” → chain of synthesis tools → chip masks
Reiterate to improve timing/power/area/etc.Very costly / time-consuming
Each fabrication costs ≈ $1.000.000
Failing abstractionFailing abstraction
Realistic flow cannot avoid low-level awareness
ParadoxModern designs require higher abstraction level...but...Modern chip technologies make abstraction harder
Main problem: Routing wires are dominant in signal delays and power consumption
Controlling the wires is key to the performance!
Gate vs. wire delay under scalingGate vs. wire delay under scaling
Process technology node [nm]
Rel
ativ
e de
lay
Physical design levelPhysical design level
Certain high-performance components (e.g. arithmetic) need to be designed at even lower level
Physical level:A set of connected standard cells (implemented gates)Absolute or relative positions of cells (placement)Shape of connecting wires (routing)
Physical design levelPhysical design level
Design by interfacing to physical CAD toolsCall automatic tools for certain tasks (mainly routing)
Often done through scripting codeTediousHard to explore design spaceLimited design reuse
Aim of this work:Raise the abstraction level of physical design!Raise the abstraction level of physical design!
Two ways to raise abstractionTwo ways to raise abstraction
Automatic synthesis+ Powerful abstraction– May not be optimal for e.g. high-performance arithmetic– Opaque (hard to control the result)– Unstable (heuristics-based)
Language-based techniques (higher-order functions, recursion, etc.)
+ Transparent, stable– Still quite low-level– Somewhat limited to regular circuits
Two ways to raise abstractionTwo ways to raise abstraction
Automatic synthesis+ Powerful abstraction– May not be optimal for e.g. high-performance arithmetic– Opaque (hard to control the result)– Unstable (heuristics-based)
Language-based techniques (higher-order functions, recursion, etc.)
+ Transparent, stable– Still quite low-level– Somewhat limited to regular circuits
Our approach
LavaLava
Gate-level hardware description in Haskell
Parameterized module generators: Haskell programs that generate circuits
Can be smart, e.g. optimize for speed in a given environment
Basic placement expressed through combinators
Used successfully to generate high-performance FPGA cores
Wired: Extension to LavaWired: Extension to Lava
Finer control over geometry
More accurate performance modelsFeedback from timing/power analysis enables self-optimizing generators
Wire-awareness (unique for Wired)Performance analysis based on wire length estimatesControl routing through “guides” (experimental)
...
Monads in HaskellMonads in Haskell
Haskell functions are pure
Side-effects can be “simulated” using monads
add a b = do as < get put (a:as) return (a+b)
*Main> runState prog [](26, [18,11,5])
Monads can also be used to model e.g. IO, exceptions,
non-determinism etc.
prog = do a < add 5 6 b < add a 7 add b 8
Syntactic sugar, expands to a pure
program with explicit state passing
Result Side-effect
Monad combinatorsMonad combinators
Haskell has a general and well-understood combinator library for monadic programs
*Main> runState (mapM (add 2) [11..13]) []([13,14,15],[2,2,2])
*Main> runState (mapM (add 2 >=> add 4) [11..13]) []([17,18,19],[4,2,4,2,4,2])
Example: Parallel prefixExample: Parallel prefix
Given inputs
compute
for ∘, an associative (but not necessarily commutative) operator
x1, x2, … xn
y1 = x1
y2 = x1 ∘ x2
…
yn = x1 ∘ x2 ∘ … ∘ xn
Parallel prefixParallel prefix
Very central component in microprocessors
Most common use: Computing carries in fast adders
Trying different operators:
Addition: prefix (+) [1,2,3,4]
Parallel prefixParallel prefix
Very central component in microprocessors
Most common use: Computing carries in fast adders
Trying different operators:
Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]
Parallel prefixParallel prefix
Very central component in microprocessors
Most common use: Computing carries in fast adders
Trying different operators:
Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]
Boolean OR: prefix (||) [F,F,F,T,F,T,T,F]
Parallel prefixParallel prefix
Very central component in microprocessors
Most common use: Computing carries in fast adders
Trying different operators:
Addition: prefix (+) [1,2,3,4]= [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]
Boolean OR: prefix (||) [F,F,F,T,F,T,T,F]
= [F,F,F,T,T,T,T,T]
Parallel prefixParallel prefix
Implementation choices (relying on associativity):
prefix (∘) [x1,x2,x3,x4] = [y1,y2,y3,y4]
Serial: y4 = ((x1 ∘ x2) ∘ x3) ∘ x4
Parallel: y4 = (x1 ∘ x2) ∘ (x3 ∘ x4)
Sharing: y4 = y3 ∘ x4
There are many of them...There are many of them...
Sklansky
Brent-Kung
Ladner-Fischer
Parallel prefix: SklanskyParallel prefix: Sklansky
sklansky op [a] = return [a]
sklansky op as = do
let k = length as `div` 2 (ls,rs) = splitAt k as'
ls' < sklansky op ls rs' < sklansky op rs
rs'' < sequence [op (last ls', r) | r < rs'] return (ls' ++ rs'')
Simplest approach (divide-and-conquer)
Purely structural (no geometry)
Could have been (monadic) Lava
Refinement: Add placementRefinement: Add placement
sklansky op [a] = space cellWidth [a]
sklansky op as = downwards 1 $ do
let k = length as `div` 2 (ls,rs) = splitAt k as'
(ls',rs') < rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs)
rs'' < rightwards 0 $ sequence [op (last ls', r) | r < rs'] return (ls' ++ rs'')
Sklansky with placementSklansky with placement
Simple postscript allows interactive development of
placement
Refinement: Add routing guidesRefinement: Add routing guides
bus = rightwards 0 . mapM bus1 where bus1 = space 2750 >=> guide 3 500 >=> space 1250
sklanskyIO op = downwards 0 $ inputList 16 "in" >>= bus >>= space 1000 >>= sklansky op >>= space 1000 >>= bus >>= output "out"
Reusing standard (monadic) Haskell combinators
(nothing Wired-specific)
Sklansky with guidesSklansky with guides
Refinement: More guidesRefinement: More guides
sklansky op [a] = space cellWidthD [a]
sklansky op as = downwards 1 $ do
bus as let k = length as `div` 2 (ls,rs) = splitAt k as
(ls',rs') < rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs)
rs'' < rightwards 0 $ sequence [op (last ls', r) | r < rs'] bus (ls' ++ rs'')
Sklansky with guidesSklansky with guides
Experiment: CompactionExperiment: Compaction
sklansky op [a] = space cellWidthD [a]
sklansky op [a] = return [a]
Buses were compacted separately
Export to CAD tool Export to CAD tool (Cadence Soc Encounter)(Cadence Soc Encounter)
Auto-routed in Encounter
Odd rows flipped to sharepower rails
Simple change in recursive call:
sklansky (flipY.op) ls
Exchanged using DEF file
format
Fast, low-power prefix networksFast, low-power prefix networks
Mary Sheeran has developed circuit generators in Lava that search for fast, low-power parallel prefix networks
Initially, crude performance modelsDelay: Logical depthPower: Number of operators
Still good results
Now using Wired to improve accuracyStatic timing/power analysis using models from cell library
Minimal change to search algorithmMinimal change to search algorithm
prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w)
= (bestOn is f . dropFail) [ wrpC ds (prefix f p) (prefix p p) | ds < igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2)
...
Minimal change to search algorithmMinimal change to search algorithm
prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w)
= (bestOn is f . dropFail) [ wrpC ds (prefix f p) (prefix p p) | ds < igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2)
... Plug in cost functions that analyze the placed network through Wired
85 bits, depth 885 bits, depth 8
85 bits, depth 885 bits, depth 8
Design explorationDesign exploration
85 inputs, depth 8, varying allowed fanout
At 128 bits, minimum depth is slower than going one deeper (crude delay model fails)
Accurate model consistent with timing report from Encounter
Fanout7 0,646 15,28 0,628 15,79 0,624 15,9
10 0,620 16,1
Delay [ns] Power [mW]
Fanout 7
Fanout 8
Fanout 9
Fanout 10
Binary multiplicationBinary multiplication
101100 * 001011 101100 101100 000000 101100 000000+ 000000 000111100100
“Partial products”
484
1) Generate the partial products (PPs)
2) Sum the partial productsa) Sum until two terms leftb) Add the two remaining terms
44 * 11
Binary multiplicationBinary multiplication
101100 * 001011 101100 101100 000000 101100 000000+ 000000 000111100100
“Partial products”
484
1) Generate the partial products (PPs)
2) Sum the partial productsa) Sum until two terms leftb) Add the two remaining terms
44 * 11
Not in this talk
Column compression multipliersColumn compression multipliers
101100 * 001011 101100 101100 000000 101100 000000+ 000000
Use full adders to compress the bits in each column until only two bits remain
Each full adder produces a carry which is forwarded to the next column
Different strategies for which order to process the bits yields very different characteristics (e.g. linear vs. logarithmic depth)
High-performance multiplier (HPM)High-performance multiplier (HPM)
Multiplier reduction tree with logarithmic logic depth and regular connectivity.Eriksson, Sheeran, et al. ISCAS '06.
Simple scheme:Process PP signals firstProcess full adder output bits “as late as possible”Prioritize carry bits
Purely structural version (Purely structural version (≈≈Lava)Lava)
Show code...
Refinement 1Refinement 1
Refinement 2Refinement 2
Refinement 3Refinement 3
Rectangular transformRectangular transform
Using reduction tree in real designUsing reduction tree in real design
Using reduction tree in real designUsing reduction tree in real design
By Kasyab, Ph.D. Student at Computer
Engineering
SummarySummary
Wire-aware hardware design methods needed
Wired offers flexible hardware design at low levels of abstraction
SklanskyAt Intel: 1000 lines of scripting code (Perl)In Wired: <50 lines (though fewer details)
Layout-/wire-aware design exploration
Get WiredGet Wired
Install Haskell Platform (to get the Cabal tool):http://hackage.haskell.org/platform/
Install Wired:
Manual download:http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Wired
> cabal install Wired