Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow
description
Transcript of Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow
Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow
Ajay K. Verma1, Philip Brisk2, Paolo Ienne1
International Conference on Computer-Aided Design
November 5, 2009
1Processor Architecture LaboratorySchool of Computer and Communication SciencesEcole Polytechnique Fédérale de Lausanne (EPFL)
2Department of Computer Science and EngineeringBourns College of Engineering
University of California, Riverside
Logic Optimization Strategies
Ripple-Carry Adder Carry-Lookahead Adder
• Logic synthesis tools– Local optimization via Boolean minimization
• Architectural transformation– Not with “traditional” logic synthesis
1
Leading Zero Detector
2
16% faster, 8% smaller
[Oklobdzija, TLVSI 1994]
Naïve Implementation
Optimized Implementation
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
3
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
3
4
Decomposition
4
• Optimize the red block locally• Recursively decompose the remaining circuit
Decomposition
4
• Input condensation– At each step, fewer input bits remain– Imposes hierarchy on the circuit
Decomposition
4
• The result is a well-structured hierarchical circuit
Decomposition
Disjoint Decomposition
Non-disjoint Decomposition
5
Disjoint Decomposition Example:8:4 Parallel Counter
sc
(Full Adder)
6
4x4-bit Multiplier
7
y0 x0y1 x0y2 x0y3 x0
y0 x1y1 x1y2 x1y3 x1
y0 x2y1 x2y2 x2y3 x2
y0 x3y1 x3y2 x3y3 x3
4 bits
Σ
X YPPG
X Y
4 bits
16 bits
4x4-bit Multiplier
7
4 bits
Σ
X YPPG
X Y
4 bits
16 bits
Partial product reduction tree has a disjoint decomposition
4x4-bit Multiplier
7
4 bits
Σ
X YPPG
X Y
4 bits
16 bits
Partial product reduction tree has a disjoint decomposition
The partial product generator requires a non-disjoint
decomposition
M1 M2
48
E1 E2
19 19
4
sign
negs1 s2
xor
out
Compound CircuitsM1 M2
48
E1 E2
19 19
sign
not
out
and
1
4
s1 s2
xor
g72x
12% faster, 55% larger
8
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
9
• Successfully structured some arithmetic circuits
– Ripple-carry adder Inferred parallel prefix adder
– 3-input ripple-carry adder Inferred carry-save adder
– Leading zero detector Inferred design of [Oklobdzija 1994]
– Various counters, majority Inferred carry-free structures functions, etc. based on carry-save addition
10
Progressive Decomposition[Verma et al., DAC 2007]
Progressive Decomposition[Verma et al., DAC 2007]
• Disjoint decomposition– Forget about multipliers– Cannot handle compound arithmetic circuits
• Entire algorithm based on Reed-Muller Form– Rewrite ‘your’ optimizer, e.g., if you use AIGs or BDDs.– Exponential size for leading one detector
• Leading zero detector remains polynomial
10
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
11
12
• Non-disjoint decomposition– Yields disjoint decompositions when appropriate
• Not tied to any specific circuit representation– Our implementation uses BDDs
• SAT-based functional dependence test [Lee et al., ICCAD 2007]– Requires efficient conversion to CNF– Functional dependence is inherent to any decomposition
Iterative Layering
13
• Bricks– Definition and algorithmic overview– Evaluation metrics
• Brick Enumeration– Cofactor enumeration– Generate bricks from cofactors
• Brick Selection– Problem formulation related to Set Cover
Iterative Layering Outline
13
• Bricks– Definition and algorithmic overview– Evaluation metrics
• Brick Enumeration– Cofactor enumeration– Generate bricks from cofactors
• Brick Selection– Problem formulation related to Set Cover
Iterative Layering Outline
Bricks
14
• A subcircuit with < k inputs and one output
– Any functional dependence may exist between a brick and the original expression
– Kernels and co-kernels are bricks• The dependence is disjunctive by definition
E = ac + ad + bc + bd 7 gatesBrick: p = a + b (1 gate)
E = pc + pd 4 gatesE = p(c + d) 3 gates
Iterative Layering Algorithm
15
• Enumerate all bricks having < k inputs– k=6 in our implementation
• Evaluate all bricks based on a merit function
• Select a subset of bricks – The subset must contain all of the information about the circuit– The subset should be optimal w.r.t. some optimization criteria
• The selected bricks form a “layer”
• Stack layers on top of one another to structure the circuit
Information Fitness
16
• Estimated gate reduction– Size of BDD of input expression [Macii et al., GLS-VLSI 1999]
f gp
Info. Fitness =
Size(BDDf)
Size(BDDg) + Size(BDDp)
Information Coverage
17
E – expression to optimizep – brick under consideration
D = on-set(E) off-set(E)N = {(x, y)D| p(x) p(y)}
Intuition: • Attempt to quantify the functional dependency from p to E
Limitation: • Requires completely specified truth table
– Size is exponential in the number of inputs
Our Approach:• Randomly sample the truth table of E• Theorem 1 in the paper includes some probabilistic justification
Info. Coverage = |N|
|D|
18
• Bricks– Definition and algorithmic overview– Evaluation metrics
• Brick Enumeration– Cofactor enumeration– Generate bricks from cofactors
• Brick Selection– Problem formulation related to Set Cover
Iterative Layering: Outline
Brute Force Cofactor Enumeration
19
Enumerate every combination of k input bits
E = ab cd (a b)(c d)B = {a, b, c} R = {d}
Enumerate the set of cofactors with respect to R
S = {Ed Ed } = {ab bc ac, ab bc ac a b c}
Problem: |S| = 2|R|
Cofactor Enumeration via Sampling and SAT-based
Functional Dependence Testing
20
1. Generate an initial set of cofactors using random sampling
2. Test if E depends on the cofactors and any remaining variables[Lee et al., ICCAD 2007]
• SAT = FALSE implies a full dependence• SAT = TRUE implies a partial dependence
• Satisfying assignment of input variables yields one missing cofactor
3. Repeat Step 2 until SAT = FALSE
Brick Computation: Summary
21
For every combination of at most k input bits
• Generate the cofactors of the remaining bits– Random sampling + SAT-based functional dependence testing
• Discard useless cofactors– Details are in the paper
• Recursively apply iterative layering with a smaller value of k to generate the bricks from the cofactors
That’s a lot of bricks!• Which bricks do I really need?
22
• Bricks– Definition and algorithmic overview– Evaluation metric
• Brick Enumeration– Cofactor enumeration– Generate bricks from cofactors
• Brick Selection– Problem formulation related to Set Cover
Iterative Layering: Outline
Brick Selection: Overview
23
Goal: Find a minimal set of bricks that covers all points in on-set(E) off-set(E)
• Greedy heuristic based on [Johnson, HCSS 1974]
– Select a brick that maximizes Info.Fitness Info.Coverage
– Update Info.Fitness and Info.Coverage for the remaining bricks
– Stop when E is functionally dependent on the chosen bricks[Lee et al., ICCAD 2007]
• See the paper for details on the data structures used
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
24
Experimental Setup
Circuit written by hand
Known Arithmetic
Circuits
Progressive Decomposition
[Verma et al., DAC 2007]
Synopsis Design Compiler
- compile_ultra - minimize delay
Artisan Standard CellsUMC (90 nm)
1 2 3
Iterative Layering
4
25
Critical Path Delay
26
0
0.2
0.4
0.6
0.8
1
1.2
1.4
16-b
itA
DD
12-b
it3A
DD
8x8
-bit
MU
L
9x9
-bit
MU
L
10x1
0-
bit
MU
L
MA
X84
16-b
itS
HIF
T
9sy
mm
l
sqrt8
rd84
rd73
t481
adpcm
enco
der
G.7
21
(fm
ult)
SA
D
Arithmetic Components MCNC Circuits CompoundArithmetic Circuits
Original Progressive Decomposition
Iterative Layering Library/Manual Implementation
Optimized for Area, Not Delay
Progressive Decomposition Fails
ns
Area
27
0
2000
4000
6000
8000
10000
12000
1400016-b
itA
DD
12-b
it3A
DD
8x8
-bit
MU
L
9x9
-bit
MU
L
10x1
0-
bit
MU
L
MA
X84
16-b
itS
HIF
T
9sy
mm
l
sqrt8
rd84
rd73
t481
adpcm
enco
der
G.7
21
(fm
ult)
SA
D
Arithmetic Components MCNC Circuits CompoundArithmetic Circuits
Original Progressive Decomposition
Iterative Layering Library/Manual Implementation
Optimized for Area, Not Delay
Progressive Decomposition Fails
μm2
n-bit, k-input MAX Function
28
Pairwise Comparison of Inputs
½k(k - l) comparators
Delay: O(log n + log k)
Area: O(k2n)
0.21ns, 3479 m2
Iterative Layering
0.22ns, 1331 m2
Binary Tree of Comparators
k - l comparators
Delay: O(log n log k)
Area: O(kn)
0.46ns, 1755 m2
(Circuit structure was unknown to us)
Integer Domination Table Count Leading 1’s
8-bit, 4-input MAX Example
29
(22) 0010110 (59) 0111011(62) 0111110(61) 0111101
1010110 111101111111101111101
Replace any all-zero column with ones!
001 (1)100 (4)110 (6)101 (5)
(1) 001(4) 100(6) 110(5) 101
001100110101
00 (0)01 (1)10 (2)01 (1)
(0) 00(1) 01(2) 10(1) 01
00011001
001
MAX0
Outline• Decomposition Techniques
• Progressive Decomposition and its Shortcomings– [Verma et al., DAC 2007]
• Iterative Layering Algorithm
• Experimental Results
• Conclusion
30
Conclusion• Iterative Layering structures arithmetic circuits
– Automatically infer well-known manual designs from arithmetic literature
– Fixes shortcomings of Progressive Decomposition • Non-disjoint decomposition• Usable with any circuit representation
31
PDIL
ADD 3-ADD LZD MUL SHFT MAX
CompoundArithmeticCircuits