Dejan MarkovicElectrical Engineering Department
University of California, Los Angeles
FPGA-based ASIC Design and Verification
Cisco Green Research Symposium5 March 2008
2
Power efficiency = energy efficiency ~ C·V2
Design complexity
Design re-entry– Algorithm (Matlab or C)– Fixed point description– RTL (behavioral, structural)– Test vectors for logic analysis
In this talk, I will demonstrate– Power efficiency of 2.1GOPS/mW (90nm CMOS)– 70GOPS in 3.5mm2
– FPGA-based design and verification
The Issues I am Going to Address
3
Optimization ApproachPower efficiency: circuit-level (C,V)Performance and area: architectural techniquesUnified Simulink description
ASIC
FPGA
Micro Arch.
E
Circuit
E
Macro Arch.
E & A D
E
A
EN
tr.per
in outy
A Zy [4x1]
r [4x4]
y [4x4]
ky [4x1]
A Z
RY
y
r [4x4]
U [4x4]
Sigma
W [4x4]
PE U-Sigma
A Z
KY
12,9 10,8
14,9
8,5
GPIO
optimization hardware design I/O verification
Automated environment for optimal hardwaredesign and verification
4
topology A
topology B
Delay
Ener
gyConstraints
Circuit-Level Optimization Framework
Goal: find optimal E-D tradeoff for a datapath
Reference design– Dmin sizing @ Vdd
max, Vthref
Sensitivity based optimization– Balance sensitivity to all variables– Variables: gate size, VDD, VTH SB
SA
f (A0,B)
f (A,B0)
Delay
Ener
gy
D0
(A0,B0)
5
[D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, JSSC Aug’04]
E-D space is the key for architecture optimization
Circuit-Level Results: Tree Adder
SW=22SVth=22SVdd=16
SW=1SVth=1SVdd=1
SW→∞SVth=0.2SVdd=1.5
D/Dref
E/E r
ef
0 0.5 1 1.5
1.5
1
0.5
0
65%
ref
S 0
S 15
(A 0, B
0)
(A 15, B
15)
C in
Energy map
Optimiza
tion
6
Scaling Impacts Architecture
1 100.10.01
0.1
1
10 Architecture:P2: parallel 2T2: time-mux 2
Top / Tref @ Vddmax
E op
/ Ere
f@
Vdd
max
Scaling
P2
T2P2
T2S-VtL-Vt
H-Vt
Process:
7
Simulink
Fix-pt lib
MDLSpeedPowerArea
Customtool 1
ASICbackend
[R. Davis et al., JSSC Mar’02]
MDL to RTL conversion, automated P&R flow
Simulink to Silicon Mapping
8
Simulink
Hw lib
RTLSpeedPowerArea
Customtool 2
ASICbackend
FPGAbackend
[K. Kuusilinna et al., book chap. in SoC Revolution, KAP 2003]
XSG hardware library, RTL translation scripts
ASIC and FPGAare I/O equivalent
Including FPGA Emulation
9
Simulink
Hw lib
SpeedPowerArea
Customtool 2
ASICbackend
I/O lib
FPGAbackend
Customtool 3
[D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, R.W. Brodersen, CICC’07]
I/O hardware library, automated FPGA flow
RTL
FPGA implementsASIC logic analysis
Closing the Loop: I/O Verification
10
Design Approach
Unified Simulink design environment– Enter design once!– Algorithm verification– Macro-architecture– FPGA based ASIC
debug
Hardware-equivalent Simulink blocks– Add, mult, shift, mux…
●Word-size, latency
11
Block Characterization
Latency
Cycle Time
0
mult
add
Energy
VDD scaling
VDDref
TClk @ VDDopt
Library blocks / macrossynthesized @ VDD
refPipeline logic scaling
FO4 inv simulation
SpeedPowerArea
TClk @ VDD
ref
gate sizing
Goal: balanced logic depth and E/D sensitivity
12
Energy
DelayArea 0
Initial design
gate sizing
VDD scalingOptimaldesign
Initial designpa
ralle
l
time-mux
intl,foldintl, fold
pipe
line
time-mux
par, pipgate sizingOptimal design
DatapathBlock-level
Methodology for Architecture Selection
Energy-Area-Delay space for architecture comparison– Time-mux, parallelism, pipelining, VDD scaling, sizing…
13
Example: 4x4 SVD AlgorithmThis complexity is hard to optimize in RTL– 270 adders, 370 multipliers, 8 sqrt, 8 div– Recursive LMS-based algorithm (nested feedback loops)
wi(k) = wi(k–1) +µi · [ yi(k) ·yi†(k) ·wi(k–1) –σi
2(k–1) ·wi(k–1)]σi
2(k) = wi†(k) ·wi(k)
ui(k) = wi(k) / σi2(k)
yi+1(k) = yi(k) – [ wi†(k) ·yi(k) ·wi(k)] /σi
2(k)
y1(k)
UΣ LMSDeflationAntenna 1
UΣ LMSDeflation
Antenna 2
UΣ LMSDeflation
Antenna 3
UΣ LMS
Antenna 4
( i=1,2,3,4)
14
Energy/Area Optimization
Energy
DelayArea 0
16b designInterl.13.8x
Fold2.6x
Starting point: fixed architecture
15
Energy/Area Optimization
Energy
DelayArea 0
16b design
word-size 30%30%
Interl.13.8x
Fold2.6x
Step 1: Word-length optimization
16
Energy/Area Optimization
Energy
DelayArea 0
40%
16b design
word-size
sizing
30%Initial synthesis
30%
20%
Interl.13.8x
Fold2.6x
Step 2: Gate size & VDD optimization
17
Energy/Area Optimization
Energy
DelayArea 0
40%
16b design
word-size
sizing
30%Initial synthesis
7x
VDD scalingOptim.VDD, W
30%
20%
Interl.13.8x
Fold2.6x
Final design
Step 2: Gate size & VDD optimization
18
2.1 GOPS/mW– 70 GOPS @ 100MHz– Power = 34mW
20 GOPS/mm2
– 70 GOPS in 3.5mm2
0.01 0.1 1 100.01
0.1
1
10
100
Energy efficiency(GOPS/mW)
Are
a ef
ficie
ncy
(GO
PS/m
m2 )
199818-6
SVD
20004-2 1999
15-5
19987-6
200418-5
200014-8
199818-3
200014-5
Comparison with ISSCC chips
[D. Markovic, B. Nikolic, R.W. Brodersen, JSSC Apr’07]
Hardware Results
(90nm ST Micro)
Result of Energy-Area-Performance Optimization
Functional test was performed with FPGA
19
FPGA Based ASIC Verification
Goal: use Simulink testbench (TB) for ASIC verification– Develop custom interface blocks (I/O)– Place I/O and ASIC RTL into TB model
+ + =TB TB
I/OASIC
I/OASIC
Simulink implicitly provides the testbench
20
Emulation-based ASIC I/O test
Simulink I/O Test Model for the SVD
21
Experimental Setup
FPGA board
ASIC board
Real-time at-speed ASIC verification
GPIO
22
Measured Functionality
σ12
σ22
σ32
σ42
0 8 16 24 320
2
4
6
8
10
12
Number of Symbols [k]
Eige
nval
ues
4x4 MIMO channel trackingtheoreticalhardware
Up to 10 b/s/Hz with adaptive PSK
23
From Simulink to Optimized Hardware
Resulting Simulink/SynDSPArchitectures
Arc
hite
ctur
e 2
Fold
ing
N =
4
Arc
hite
ctur
e 1
Fold
ing
N =
2
Initial DFG+
×+
+
+
×
×
×c d
a b
D
2D2D
DD+
×+
+
+
×
×
×c d
a b
D
2D2D
DD
Automated Architecture Generation FlowR
efer
ence
Dire
ct-m
appi
ng
ILP Scheduling & Bellman-Ford Retiming: optimal + reduced CPU time
Direct mapped DFG Scheduler Architecture Solutions Hardware(Simulink) (C++ / MOSEK) (Simulink/SynDSP) (FPGA/ASIC)
24
EnergyPerformance
Are
a
parallelism
VDD scaling
retimingValidarchitectures
Constraints
Direct-mapping(reference)
time-mux
0.20.4
0.60.8
1
0.20.4
0.60.8
10.2
0.4
0.6
0.8
1
Energy-Area-Performance MapEach point on the surface is an optimal architecture automatically generated in Simulink after modified ILP scheduling and retiming
System designer can choose from feasible optimal solutionsIt is not just about functionality, but how good a solution is, and how many alternatives exist
25
ConclusionsSimulink provides level of abstraction needed for complete ASIC development– Hardware emulation of algorithms– Technology-driven architecture selection– FPGA-based ASIC verification
● Logic analysis can be fully ported onto FPGA
Energy-area-delay space is a compact way for comparing multiple architectural realizations– ILP-based formulation automates architecture design
Complex algorithms in 90nm can achieve– 2.1 GOPS/mW, 20 GOPS/mm2
26
ReferencesASIC design and verification– D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. Brodersen,
"Methods for True Energy‐Performance Optimization," IEEE J. Solid‐State Circuits, vol. 39, no. 8, pp. 1282‐1293, Aug. 2004.
– D. Markovic, R.W. Brodersen, and B. Nikolic, "A 70GOPS 34mW Multi‐Carrier MIMO Chip in 3.5mm2," in Proc. IEEE Int'l Symp. on VLSI Circuits (VLSI'06), June 2006, pp. 196‐197.
– D. Markovic, B. Nikolic, R.W. Brodersen, “Power and Area Minimization for Multidimensional Signal Processing,” IEEE J. Solid‐State Circuits, vol. 42, no. 4, pp. 922‐934, April 2007.
– D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, and R.W. Brodersen, “ASIC Design and Verification in an FPGA Environment,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC’07), Sept. 2007, pp. 737‐740.
More publications available online– www.ee.ucla.edu/~dejan
27
AcknowledgmentsFunding support– C2S2 Focus Center Research Program,
contract 2003‐CT‐888
Infrastructure support– ST Microelectronics, Xilinx (hardware)– Synplicity, Synopsys, Cadence (software)
Top Related