Download - FPGA-based ASIC Design and Verification - unitn.itfontana/GreenInternet/CISCO Workshop on Green... · FPGA-based ASIC Design and Verification ... ASIC design and verification –

Dejan MarkovicElectrical Engineering Department

University of California, Los Angeles

FPGA-based ASIC Design and Verification

Cisco Green Research Symposium5 March 2008

2

Power efficiency = energy efficiency ~ C·V2

Design complexity

Design re-entry– Algorithm (Matlab or C)– Fixed point description– RTL (behavioral, structural)– Test vectors for logic analysis

In this talk, I will demonstrate– Power efficiency of 2.1GOPS/mW (90nm CMOS)– 70GOPS in 3.5mm2

– FPGA-based design and verification

The Issues I am Going to Address

3

Optimization ApproachPower efficiency: circuit-level (C,V)Performance and area: architectural techniquesUnified Simulink description

ASIC

FPGA

Micro Arch.

E

Circuit

E

Macro Arch.

E & A D

E

A

EN

tr.per

in outy

A Zy [4x1]

r [4x4]

y [4x4]

ky [4x1]

A Z

RY

y

r [4x4]

U [4x4]

Sigma

W [4x4]

PE U-Sigma

A Z

KY

12,9 10,8

14,9

8,5

GPIO

optimization hardware design I/O verification

Automated environment for optimal hardwaredesign and verification

4

topology A

topology B

Delay

Ener

gyConstraints

Circuit-Level Optimization Framework

Goal: find optimal E-D tradeoff for a datapath

Reference design– Dmin sizing @ Vdd

max, Vthref

Sensitivity based optimization– Balance sensitivity to all variables– Variables: gate size, VDD, VTH SB

SA

f (A0,B)

f (A,B0)

Delay

Ener

gy

D0

(A0,B0)

5

[D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, R.W. Brodersen, JSSC Aug’04]

E-D space is the key for architecture optimization

Circuit-Level Results: Tree Adder

SW=22SVth=22SVdd=16

SW=1SVth=1SVdd=1

SW→∞SVth=0.2SVdd=1.5

D/Dref

E/E r

ef

0 0.5 1 1.5

1.5

1

0.5

0

65%

ref

S 0

S 15

(A 0, B

0)

(A 15, B

15)

C in

Energy map

Optimiza

tion

6

Scaling Impacts Architecture

1 100.10.01

0.1

1

10 Architecture:P2: parallel 2T2: time-mux 2

Top / Tref @ Vddmax

E op

/ Ere

f@

Vdd

max

Scaling

P2

T2P2

T2S-VtL-Vt

H-Vt

Process:

7

Simulink

Fix-pt lib

MDLSpeedPowerArea

Customtool 1

ASICbackend

[R. Davis et al., JSSC Mar’02]

MDL to RTL conversion, automated P&R flow

Simulink to Silicon Mapping

8

Simulink

Hw lib

RTLSpeedPowerArea

Customtool 2

ASICbackend

FPGAbackend

[K. Kuusilinna et al., book chap. in SoC Revolution, KAP 2003]

XSG hardware library, RTL translation scripts

ASIC and FPGAare I/O equivalent

Including FPGA Emulation

9

Simulink

Hw lib

SpeedPowerArea

Customtool 2

ASICbackend

I/O lib

FPGAbackend

Customtool 3

[D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, R.W. Brodersen, CICC’07]

I/O hardware library, automated FPGA flow

RTL

FPGA implementsASIC logic analysis

Closing the Loop: I/O Verification

10

Design Approach

Unified Simulink design environment– Enter design once!– Algorithm verification– Macro-architecture– FPGA based ASIC

debug

Hardware-equivalent Simulink blocks– Add, mult, shift, mux…

●Word-size, latency

11

Block Characterization

Latency

Cycle Time

0

mult

add

Energy

VDD scaling

VDDref

TClk @ VDDopt

Library blocks / macrossynthesized @ VDD

refPipeline logic scaling

FO4 inv simulation

SpeedPowerArea

TClk @ VDD

ref

gate sizing

Goal: balanced logic depth and E/D sensitivity

12

Energy

DelayArea 0

Initial design

gate sizing

VDD scalingOptimaldesign

Initial designpa

ralle

l

time-mux

intl,foldintl, fold

pipe

line

time-mux

par, pipgate sizingOptimal design

DatapathBlock-level

Methodology for Architecture Selection

Energy-Area-Delay space for architecture comparison– Time-mux, parallelism, pipelining, VDD scaling, sizing…

13

Example: 4x4 SVD AlgorithmThis complexity is hard to optimize in RTL– 270 adders, 370 multipliers, 8 sqrt, 8 div– Recursive LMS-based algorithm (nested feedback loops)

wi(k) = wi(k–1) +µi · [ yi(k) ·yi†(k) ·wi(k–1) –σi

2(k–1) ·wi(k–1)]σi

2(k) = wi†(k) ·wi(k)

ui(k) = wi(k) / σi2(k)

yi+1(k) = yi(k) – [ wi†(k) ·yi(k) ·wi(k)] /σi

2(k)

y1(k)

UΣ LMSDeflationAntenna 1

UΣ LMSDeflation

Antenna 2

UΣ LMSDeflation

Antenna 3

UΣ LMS

Antenna 4

( i=1,2,3,4)

14

Energy/Area Optimization

Energy

DelayArea 0

16b designInterl.13.8x

Fold2.6x

Starting point: fixed architecture

15


Energy

DelayArea 0

16b design

word-size 30%30%

Interl.13.8x

Fold2.6x

Step 1: Word-length optimization

16


Energy

DelayArea 0

40%

16b design

word-size

sizing

30%Initial synthesis

30%

20%

Interl.13.8x

Fold2.6x

Step 2: Gate size & VDD optimization

17


Energy

DelayArea 0

40%

16b design

word-size

sizing

30%Initial synthesis

7x

VDD scalingOptim.VDD, W

30%

20%

Interl.13.8x

Fold2.6x

Final design

Step 2: Gate size & VDD optimization

18

2.1 GOPS/mW– 70 GOPS @ 100MHz– Power = 34mW

20 GOPS/mm2

– 70 GOPS in 3.5mm2

0.01 0.1 1 100.01

0.1

1

10

100

Energy efficiency(GOPS/mW)

Are

a ef

ficie

ncy

(GO

PS/m

m2 )

199818-6

SVD

20004-2 1999

15-5

19987-6

200418-5

200014-8

199818-3

200014-5

Comparison with ISSCC chips

[D. Markovic, B. Nikolic, R.W. Brodersen, JSSC Apr’07]

Hardware Results

(90nm ST Micro)

Result of Energy-Area-Performance Optimization

Functional test was performed with FPGA

19

FPGA Based ASIC Verification

Goal: use Simulink testbench (TB) for ASIC verification– Develop custom interface blocks (I/O)– Place I/O and ASIC RTL into TB model

+ + =TB TB

I/OASIC

I/OASIC

Simulink implicitly provides the testbench

20

Emulation-based ASIC I/O test

Simulink I/O Test Model for the SVD

21

Experimental Setup

FPGA board

ASIC board

Real-time at-speed ASIC verification

GPIO

22

Measured Functionality

σ12

σ22

σ32

σ42

0 8 16 24 320

2

4

6

8

10

12

Number of Symbols [k]

Eige

nval

ues

4x4 MIMO channel trackingtheoreticalhardware

Up to 10 b/s/Hz with adaptive PSK

23

From Simulink to Optimized Hardware

Resulting Simulink/SynDSPArchitectures

Arc

hite

ctur

e 2

Fold

ing

N =

4

Arc

hite

ctur

e 1

Fold

ing

N =

2

Initial DFG+

×+

+

+

×

×

×c d

a b

D

2D2D

DD+

×+

+

+

×

×

×c d

a b

D

2D2D

DD

Automated Architecture Generation FlowR

efer

ence

Dire

ct-m

appi

ng

ILP Scheduling & Bellman-Ford Retiming: optimal + reduced CPU time

Direct mapped DFG Scheduler Architecture Solutions Hardware(Simulink) (C++ / MOSEK) (Simulink/SynDSP) (FPGA/ASIC)

24

EnergyPerformance

Are

a

parallelism

VDD scaling

retimingValidarchitectures

Constraints

Direct-mapping(reference)

time-mux

0.20.4

0.60.8

1

0.20.4

0.60.8

10.2

0.4

0.6

0.8

1

Energy-Area-Performance MapEach point on the surface is an optimal architecture automatically generated in Simulink after modified ILP scheduling and retiming

System designer can choose from feasible optimal solutionsIt is not just about functionality, but how good a solution is, and how many alternatives exist

25

ConclusionsSimulink provides level of abstraction needed for complete ASIC development– Hardware emulation of algorithms– Technology-driven architecture selection– FPGA-based ASIC verification

● Logic analysis can be fully ported onto FPGA

Energy-area-delay space is a compact way for comparing multiple architectural realizations– ILP-based formulation automates architecture design

Complex algorithms in 90nm can achieve– 2.1 GOPS/mW, 20 GOPS/mm2

26

ReferencesASIC design and verification– D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. Brodersen,

"Methods for True Energy‐Performance Optimization," IEEE J. Solid‐State Circuits, vol. 39, no. 8, pp. 1282‐1293, Aug. 2004.

– D. Markovic, R.W. Brodersen, and B. Nikolic, "A 70GOPS 34mW Multi‐Carrier MIMO Chip in 3.5mm2," in Proc. IEEE Int'l Symp. on VLSI Circuits (VLSI'06), June 2006, pp. 196‐197.

– D. Markovic, B. Nikolic, R.W. Brodersen, “Power and Area Minimization for Multidimensional Signal Processing,” IEEE J. Solid‐State Circuits, vol. 42, no. 4, pp. 922‐934, April 2007.

– D. Markovic, C. Chang, B. Richards, H. So, B. Nikolic, and R.W. Brodersen, “ASIC Design and Verification in an FPGA Environment,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC’07), Sept. 2007, pp. 737‐740.

More publications available online– www.ee.ucla.edu/~dejan

27

AcknowledgmentsFunding support– C2S2 Focus Center Research Program,

contract 2003‐CT‐888

Infrastructure support– ST Microelectronics, Xilinx (hardware)– Synplicity, Synopsys, Cadence (software)