1 Intel Labs July 11, 2005 Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram...

37
1 Intel Labs July 11, 2005 Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram Kumar Krishnamurthy Microprocessor Research Labs Intel Corporation, Hillsboro, OR [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    1

Transcript of 1 Intel Labs July 11, 2005 Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram...

1

IntelLabsJuly 11, 2005

Microprocessor and DSP Technologies for the

Nanoscale Era

Microprocessor and DSP Technologies for the

Nanoscale EraSeminar 2

Ram Kumar KrishnamurthyMicroprocessor Research LabsIntel Corporation, Hillsboro, [email protected]

Seminar 2Ram Kumar Krishnamurthy

Microprocessor Research LabsIntel Corporation, Hillsboro, [email protected]

2

• General Technology and Circuit Challenges Beyond 65nm:General Technology and Circuit Challenges Beyond 65nm:• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density

• Emerging trends in wireless and embedded DSP industryEmerging trends in wireless and embedded DSP industry• Circuit solutions:Circuit solutions:

• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• HW accelerators for specialized DSP applicationsHW accelerators for specialized DSP applications

• General Technology and Circuit Challenges Beyond 65nm:General Technology and Circuit Challenges Beyond 65nm:• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density

• Emerging trends in wireless and embedded DSP industryEmerging trends in wireless and embedded DSP industry• Circuit solutions:Circuit solutions:

• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• HW accelerators for specialized DSP applicationsHW accelerators for specialized DSP applications

OutlineOutline

3

Technology Scaling 101

Gate

Source DrainBody

L

Tox

Gate

Source DrainBody

L

Tox

Gate

Source Drain

Body0.7 L

0.7 Tox

Gate

Source Drain

Body0.7 L

0.7 Tox

11

0.490.49

0.7

0.7

1

11Freq

1Delay

43.17.0

1Freq

7.0Delay

4

Leakage vs. Switching Power

• From a P perspective, but true for DSPs too• IIoffoff increase 3-5X per generation increase 3-5X per generation• Active leakage power > 50% of total powerActive leakage power > 50% of total power

0

50

100

150

200

250

250nm 180nm 130nm 90nm 65nm

Active Power

0

50

100

150

200

250P

ow

er (

Wat

ts)

Active Leakage power

Technology

5

On-chip Interconnect PerformanceOn-chip Interconnect Performance

• RC/RC/m increases 40-60% per generationm increases 40-60% per generation• Local inter-gate wires dominate critical-path delaysLocal inter-gate wires dominate critical-path delays• Global wire lengths not scaling by 0.7xGlobal wire lengths not scaling by 0.7x

30% per generation

30% per generation

Technology Node (nm)

De

lay

(n

s)

0.0

01

0.0

1

0

.1

1.

0

250 200 150 100 50

Typical Gate Delay

Interconnect Delay

6

• Significant variation in IOFF (hence FSignificant variation in IOFF (hence Fmaxmax spread) spread)

• Worsening with process scalingWorsening with process scaling• Excess leakage dies: lack in robustnessExcess leakage dies: lack in robustness• Low leakage dies: over-designed for robustnessLow leakage dies: over-designed for robustness

Process Variation Tolerance

Fast corner

(180nm CMOS, 110°C)

0

50

100

150

200

Normalized IOFF

Nu

mb

er

of

die

s

0 1 2 3 4 5 7

0.9

1.0

1.1

1.2

1.3

1.4

0 5 10 15 20

Fre

qu

ency

Leakage

30%

20X

Process parameter variation tolerant design techniquesProcess parameter variation tolerant design techniques

7

DSP Application Demands

• 2G:2G:GSMGSMPDCPDCIS-95IS-95

< 50 MIPS< 50 MIPS

> 100 MIPS> 100 MIPS

ASIC DSPASIC DSPHardwareHardware

TimeTime

CapabilityCapability

• 2.5G:2.5G:GPRSGPRSEDGEEDGEIS-95BIS-95B

ASIC DSP ASIC DSP Hardware Hardware AssistAssist

• 3G:3G:WCDMAWCDMA

2001 2003 2005

> 200 MIPS> 200 MIPS

VOICEVOICE DATA andDATA andAPPLICATIONSAPPLICATIONS

• Smart cell-phones: $2B in ’02 Smart cell-phones: $2B in ’02 $15B in ’06! $15B in ’06!• Huge demand for high-performance DSPsHuge demand for high-performance DSPs

8

Multimedia, Graphics, Enterprise…200+ MIPS200+ MIPS64+ MB Flash64+ MB Flash16+ MB RAM16+ MB RAM

• Simple User Simple User interfaceinterface

• CalendarCalendar• NotepadNotepad

< 50 MIPS< 50 MIPS4+ MB Flash4+ MB Flash0.5+ MB RAM0.5+ MB RAM

> 100 MIPS> 100 MIPS16+ MB Flash16+ MB Flash8+ MB RAM8+ MB RAM

• Speech Speech recognitionrecognition

• MultimediaMultimedia• Large files Large files

and and applicationsapplications

TimeTime

MultimediaMultimediaGraphicsGraphics

EnterpriseEnterprise

OS, Services OS, Services and Appsand Apps

• EmailEmail

CapabilityCapability

• Full OS Full OS • GUIGUI• BrowserBrowser• Suite of appsSuite of apps

• Color ScreenColor Screen• AudioAudio•GraphicsGraphics • Secure Secure

remote remote accessaccess

• Full OS and Full OS and user user interfaceinterface

• BrowserBrowser• Suite of appsSuite of apps

2001 2003 2005

• Market is hungry for DSP MIPS Market is hungry for DSP MIPS (if you deliver, they will use it!)(if you deliver, they will use it!)

9

Typical Performance Requirements

MHz per taskMHz per taskTotal required Total required memorymemory

• MP3 encode

• ASCII Browser

• Graphical Browser - small screen

• Voice 128-bit encryption and decryption

• MP3 Playback

• MPEG 4 Playback

• MPEG 4 Playback

• Robust handwriting recognition

< 4MB< 4MB

8MB8MB

16MB16MB

32 MB32 MB

64MB64MB10001000

100100

1010

150 - 200 MHz

80 - 100 MHz

40 - 80 MHz

25 - 50 MHz

8 - 12 MHz

Pocket PC

200 - 300 MHz

10

So, How Do We Meet This Surging So, How Do We Meet This Surging Demand Within Given Power Envelope?Demand Within Given Power Envelope?

Dedicated HWASIC

Configurable Processor/Logic

Digital Signal Processorsor other ASIPs

Embedded Processors

En

erg

y E

ffic

ien

cy i

n M

OP

S/m

WE

ner

gy

Eff

icie

ncy

in

MO

PS

/mW

Flexibility (Coverage)Flexibility (Coverage)

0.1

1.0

10

100

1000

Berkeley’s Pleiades: Berkeley’s Pleiades: 10-80 MOPS/mW 10-80 MOPS/mW

1-2 MOPS/mW 1-2 MOPS/mW

SA110: SA110: 0.4 MOPS/mW 0.4 MOPS/mW

Courtesy: Prof. J. Rabaey, UC BerkeleyCourtesy: Prof. J. Rabaey, UC Berkeley

• Energy vs. Flexibility Trade-offEnergy vs. Flexibility Trade-off

11

Energy and Area EfficiencyEnergy and Area Efficiency

Courtesy: Prof. Teresa Meng, StanfordCourtesy: Prof. Teresa Meng, Stanford

12

MOPS/mW Distinction: General-purpose MOPS/mW Distinction: General-purpose vs. Dedicated vs. Dedicated

DSP functions are more throughput-orientedDSP functions are more throughput-oriented Amenable for parallelism and pipelining (better power-Amenable for parallelism and pipelining (better power-

performance optimization)performance optimization)

DSP functions are more throughput-orientedDSP functions are more throughput-oriented Amenable for parallelism and pipelining (better power-Amenable for parallelism and pipelining (better power-

performance optimization)performance optimization)

Courtesy: Prof. B. Brodersen, UC Berkeley, ISSCC’02

13

Emerging Trends in DSP IndustryN

orm

aliz

ed p

ow

er e

ffic

ien

cy Specialized hardware

1000

10

100

1

Flexibility

Programmable DSP

Embedded/P

Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates

Microprocessors: Best flexibility

Prof. L. Clark, CICC 2002 [2]Prof. L. Clark, CICC 2002 [2]

14

Emerging Trends in DSP IndustryN

orm

aliz

ed p

ow

er e

ffic

ien

cy Specialized hardware

1000

10

100

1

Flexibility

Programmable DSP

Embedded/P

Microprocessors add specialized HW and

coprocessors with DSP functionality

Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates

Microprocessors: Best flexibility

15

Example Case Study

• 153-380MHz, 53-500mW in 153-380MHz, 53-500mW in 180nm CMOS, 1.0-1.8V180nm CMOS, 1.0-1.8V

• 5.84M transistors, 36mm5.84M transistors, 36mm22

• Dedicated DES-Dedicated DES-Encryption and Speech Encryption and Speech processing acceleratorsprocessing accelerators

Encryption and Speech Processing Specialized HWEncryption and Speech Processing Specialized HW

IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3]IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3]

16

Emerging Trends in DSP IndustryN

orm

aliz

ed p

ow

er e

ffic

ien

cy Specialized hardware

1000

10

100

1

Flexibility

Programmable DSP

Embedded/P

DSPs add microcontroller functionality and specialized

HW accelerators

Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates

Microprocessors: Best flexibility

17

• 600MHz, 4.8GOPS, 718mW 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2Vin 130nm CMOS, 1.2V

• Dedicated Viterbi and Turbo Dedicated Viterbi and Turbo decoding co-processor HWdecoding co-processor HW

• 64M transistors 64M transistors

• Integrated DMA controller, Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ PCI, 1MB L2, 16K I$ & D$

• 600MHz, 4.8GOPS, 718mW 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2Vin 130nm CMOS, 1.2V

• Dedicated Viterbi and Turbo Dedicated Viterbi and Turbo decoding co-processor HWdecoding co-processor HW

• 64M transistors 64M transistors

• Integrated DMA controller, Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ PCI, 1MB L2, 16K I$ & D$

Viterbi and Turbo Co-processorsViterbi and Turbo Co-processors

Example Case StudyTI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4]TI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4]

18

Specialized Hardware Accelerators• Specialized (fixed function) hardware is 10-100x more

efficient than general purpose processors: Why?– Trades hardware for power

– Allows very low clock rates

– Essential for some wireless functions

– Viterbi and Turbo decoding, speech recognition, encryption

• Allows custom algorithms and coefficients to limit power– Use shifts instead of multiplies

• Cost is flexibility– Fixed algorithms and coefficients

– As new applications and wireless standards emerge, is this enough?

– How does this cover the application space?

19

Reconfigurable ProcessorsReconfigurable Processors

• Moderate grain function blocks

• Collections of Add, Mpy, Mux, …

• Interconnect overhead is moderate to low

• If functions and connectivity are known, can be highly optimized

• Fine-grain gate-level functions• Array of MUXes to implement any

N-input boolean function• Speed sacrificed for generality

FPGA – Fine Grain Reconfigurable Fabric

Course Grain Course Grain Reconfigurable FabricReconfigurable Fabric

Courtesy: Prof. F. Kurdahi, UCI Courtesy: Prof. F. Kurdahi, UCI

20

Generic Reconfigurable Architecture

Registers

Datapath Tiles

Configuration Control

Array of Fine/Coarse Grain Datapath Tiles and RegistersArray of Fine/Coarse Grain Datapath Tiles and Registers

21

How Do Reconfigurable Processors Work?

Execute one algorithm/ protocol at any given time

Each algorithm is ‘configured’ from the building blocks

Time between subsequent configurations: ~1-10ms

Configuration Control unit decides which algorithm to execute when

Time

Protocol 1

~1-10 ms

~1-10 ms

22

How Do Reconfigurable Processors Work?

Time

~1-10 ms

~1-10 ms

Protocol 2

Execute one algorithm/ protocol at any given time

Each algorithm is ‘configured’ from the building blocks

Time between subsequent configurations: ~1-10ms

Configuration Control unit decides which algorithm to execute when

23

How Do Reconfigurable Processors Work?

Time

~1-10 ms

~1-10 ms

Protocol 3

Execute one algorithm/ protocol at any given time

Each algorithm is ‘configured’ from the building blocks

Time between subsequent configurations: ~1-10ms

Configuration Control unit decides which algorithm to execute when

24

Virtual Vss

Virtual Vcc

Functional Unit

sleep transistor

sleep transistor

MTCMOS Boosted Sleep

Non-Boosted

Sleep

Sleep-TR size 5.1% 2.3% 3.2%

Leakage power

reduction

1450X 3130X 11.5X

Virtual supply bounce

60 mV 59 mV 58 mV

Standby leakage benefit for 5% delay penaltyStandby leakage benefit for 5% delay penalty

Standby Leakage Reduction: Sleep Transistor design

• Motivation: Cut off power supply in sleep-modeMotivation: Cut off power supply in sleep-mode• Insert “sleep” transistor between main supply Insert “sleep” transistor between main supply

and functional unit’s supply railsand functional unit’s supply rails• Latches tied to main supply rails: retain state Latches tied to main supply rails: retain state

25

Switching + Leakage Reduction: Switching + Leakage Reduction: Forward Body BiasForward Body Bias

VddVbp

Vbn-Ve

+Ve

VddVbp

Vbn-Ve

+Ve

0

1

2

3

4

0.6 0.8 1 1.2 1.4

Frequency (GHz)

No

rmali

zed

tota

l p

ow

er

ZBB

FBB 500mV

Vcc: 1, 1.05, 1.1 … 1.5V

1.2V

110oC=0.1

1.1V

0

10

20

30

0.6 0.8 1 1.2 1.4

Frequency (GHz)

FB

B/Z

BB

leakag

e r

ati

o27oC

20% power reduction at 1GHz20% power reduction at 1GHz8% 8% frequency at iso-power frequency at iso-power20X 20X idle-mode leakage idle-mode leakage

A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6]A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6]

26

VCCcore1 VCCcore2

VCCcore3 VCCcore4

Multi-Vcc Usage Model

VRM1

VRM3

VRM2

VRM4

• Optimize performance and power with parallelism Optimize performance and power with parallelism and voltageand voltage

27

Switching + Leakage Reduction:Multi Supply Design

• Active leakage benefit with lower supply voltageActive leakage benefit with lower supply voltage

• Exponential subthreshold and gate leakage reductionExponential subthreshold and gate leakage reduction

Measured Leakage in 1.2V, 130nm process

Subthreshold lkg

Gate lkg

0

2

4

6

8

10

12

0 0.2 0.4 0.6 0.8 1 1.2 1.4VCC (V)

Leak

age

Ener

gy (N

orm

aliz

ed)

w.c. corner

79%Nominalcorner

130nm L1 cache leakageR. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7]R. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7]

0

20

40

60

80

100

0 0.3 0.6 0.9 1.2 1.5Voltage (V)

No

rmal

ized

L

eaka

ge

28

Adaptive Vcc: Variation-tolerant Circuits

0%

20%

40%

60%

80%

100%

0.85 0.9 0.95 1 1.05

Frequency BinD

ie c

ou

nt

Fixed Vdd: 1.05VAdaptive Vdd: 20mV resolution

4.5 mm

5.3

mm

21 sub-sites within 1 die

4.5 mm

5.3

mm

21 sub-sites within 1 die0%

20%

40%

60%

80%

100%

0.85 0.9 0.95 1 1.05

Frequency Bin

Die

co

un

t

Adaptive Vdd + body biasAdaptive Vdd + WID body bias

• Motive: change Vcc adaptively to reduce impact of parameter variationsMotive: change Vcc adaptively to reduce impact of parameter variations• Large Fmax vs. leakage spread (worsening with scaling)Large Fmax vs. leakage spread (worsening with scaling)• Lower Vdd on leakage-limited circuits (subject to stability limits)Lower Vdd on leakage-limited circuits (subject to stability limits)• Higher Vdd on speed-limited circuits (subject to reliability limits)Higher Vdd on speed-limited circuits (subject to reliability limits)

J. Tschanz et al, 2002 Symp. VLSI CircuitsJ. Tschanz et al, 2002 Symp. VLSI Circuits

0.9

1.0

1.1

1.2

1.3

1.4

0 5 10 15 20

Fre

qu

en

cy

Leakage

30

%

20X

29

Viterbi Decoder Organization

• BMU calculates errors for all branches• PMU accumulates errors and outputs transitions with

minimum error• TBU traces minimum error path back to get best estimate of

original input

One of the most performance and powercritical algorithms in wireless baseband DSP

BranchMetric Unit

(BMU)

Path MetricUnit (PMU)

TracebackUnit (TBU)

BranchError

TransitionsDecoded

BitsEncoded

Bits

30

90nm CMOS Implementation90nm dual-Vt 7-metal CMOS technology

64-state radix-2 design: 40mW at 500Mbps, 1.2V

ACS

230µm x 210 µm

Traceback

260µm x 510 µm

8 ACS

BM

U

BM

U

BM

U

BM

UPMmemory P

Mm

emo

ry

TB control

TB memory

TB memory

PM

mem

ory

PM

mem

ory

PM

mem

ory

PMmemory

Path memory

M. Anders et al, 2004 VLSI Circuits Symp. [10]M. Anders et al, 2004 VLSI Circuits Symp. [10]

31

Summary

• Fastest reported 64-state Viterbi accelerator– Total power at 2 GHz (500Mbps) is 40mW (1.2V)

• Lowest power 802.11a implementation– Total power at 216 MHz (54Mbps) is 5mW (0.7V)

Technology 90nm dual-Vt CMOS

ACS area 230µm x 210µm (0.048 mm2)

Traceback area 260µm x 510µm (0.133 mm2)

Viterbi states 64-state

ACS precision 10 bits

Radix-2 max. TB length 96 symbolsM. Anders et al, 2004 VLSI Circuits Symp. [10]M. Anders et al, 2004 VLSI Circuits Symp. [10]

32

Streaming Media Accelerators: 32-bit MAC [ISSCC’03]

• 5GHz 32-bit multiply-accumulate unit• Targeted for special purpose streaming processors/graphics

Die Area 1.32 x 1.57 mm2

Process 90nm CMOS

Interconnect 1 poly, 7 metal

Transistors 230K

Frequency 5GHz

Maximum Vcc 1.5V

Chip Power 1.2W @ 1.2V

Pad Count 75

MULTIPLIER

ALIGNER

ACCUMULATE

NORMALIZE

FIFOs&

SCAN

CL

K

Die Area 1.32 x 1.57 mm2

Process 90nm CMOS

Interconnect 1 poly, 7 metal

Transistors 230K

Frequency 5GHz

Maximum Vcc 1.5V

Chip Power 1.2W @ 1.2V

Pad Count 75

MULTIPLIER

ALIGNER

ACCUMULATE

NORMALIZE

FIFOs&

SCAN

CL

K

S. Vangal et al, ISSCC’03 [11]S. Vangal et al, ISSCC’03 [11]

33

x

MAC

+

FIFOB

Scan Reg

FIFO Control

32 V

ecto

rs D

eep FIFO

AFIFO

C

Scan Out

Scan In

Scan Reg

32

32 32

32-bit MAC Architecture Overview

• Single-cycle 5GHz 32-bit MAC loop• New Multiplier and Accumulator ALU circuit techniques

• Single-cycle 5GHz 32-bit MAC loop• New Multiplier and Accumulator ALU circuit techniques

34

TCP/IP Off-load Accelerator [ISSCC’03]

• 10GHz TCP/IP offload accelerator unit• Targeted for 10Gbps Ethernet packet processing accel.

• 10GHz TCP/IP offload accelerator unit• Targeted for 10Gbps Ethernet packet processing accel.

Core Area

Process

Interconnect

Transistors

Frequency

Max Vcc

Chip power

Pad count

2.23 x 3.54mm2

90nm dual-VT CMOS

1 poly, 7 metal

260K

10GHz

1.5V

1.9W @1.2V

306

TC

B ExecCore

PLL

OOO

ROMC

AM

1

TC

B ExecCore

PLL

ROB

ROMC

LB

Inputseq

Sendbuffer

Y. Hoskote et al, ISSCC’03 [12]Y. Hoskote et al, ISSCC’03 [12]

35

10GHz TCP/IP Execution Core

Rcv buffer Working register

index

TCBTCB

264

Key CLB

6

ALUROB

input

Scratch registers PCPipelined

ALU

Branch addressStart address

Instr ROM

9

3232

112

IRdecode

ALU output

Next address

96

10GHz sparse-tree ALU10GHz sparse-tree ALU

• At-speed packet processing execution core for 10Gbps• At-speed packet processing execution core for 10Gbps

36

• Several Technology and Circuit Challenges Beyond 65nmSeveral Technology and Circuit Challenges Beyond 65nm• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density

• Emerging trends in DSP industryEmerging trends in DSP industry• Specialized hardware accelerators and co-processorsSpecialized hardware accelerators and co-processors• Reconfigurable engines Reconfigurable engines

• Circuit solutions:Circuit solutions:• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• DSP HW accelerators for Viterbi, Streaming media, TCP/IPDSP HW accelerators for Viterbi, Streaming media, TCP/IP

• Several Technology and Circuit Challenges Beyond 65nmSeveral Technology and Circuit Challenges Beyond 65nm• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density

• Emerging trends in DSP industryEmerging trends in DSP industry• Specialized hardware accelerators and co-processorsSpecialized hardware accelerators and co-processors• Reconfigurable engines Reconfigurable engines

• Circuit solutions:Circuit solutions:• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• DSP HW accelerators for Viterbi, Streaming media, TCP/IPDSP HW accelerators for Viterbi, Streaming media, TCP/IP

ConclusionsConclusions

37

References[1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE [1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE

Custom Integrated Circuits Conference 2002, pp. 125-128.Custom Integrated Circuits Conference 2002, pp. 125-128.

[2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference [2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference

2003, pp. 171-176.2003, pp. 171-176.

[3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC [3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC

2002, pp. 340-341.2002, pp. 340-341.

[4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57.[4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57.

Reconfigurable processors: Reconfigurable processors:

1. http://brass.cs.berkeley.edu/

2. http://www.eng.uci.edu/comp.arch/

3. http://www.pactcorp.com/xneu/px_xpphw.html

[5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt [5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt

allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219.allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219.

[6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”, [6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”,

Symposium on VLSI Circuits 2002, pp. 312-315.Symposium on VLSI Circuits 2002, pp. 312-315.

[7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI [7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI

Circuits 2002, pp. 128-129.Circuits 2002, pp. 128-129.

[8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI [8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI

Circuits 2002, pp. 126-127.Circuits 2002, pp. 126-127.

[9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004, [9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004,

pp. 162-163.pp. 162-163.

[10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits [10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits

2004, pp. 174-175.2004, pp. 174-175.

[11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS[11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS”, ISSCC 2003, pp. 334-335.

[12] Y. Hoskote et al, “A 10GHz TCP/IP offload accelerator for 10Gb/s Ethernet in 90nm CMOS”, ISSCC 2003, pp. 258-259.