1 Intel Labs July 11, 2005 Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram...
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of 1 Intel Labs July 11, 2005 Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram...
1
IntelLabsJuly 11, 2005
Microprocessor and DSP Technologies for the
Nanoscale Era
Microprocessor and DSP Technologies for the
Nanoscale EraSeminar 2
Ram Kumar KrishnamurthyMicroprocessor Research LabsIntel Corporation, Hillsboro, [email protected]
Seminar 2Ram Kumar Krishnamurthy
Microprocessor Research LabsIntel Corporation, Hillsboro, [email protected]
2
• General Technology and Circuit Challenges Beyond 65nm:General Technology and Circuit Challenges Beyond 65nm:• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density
• Emerging trends in wireless and embedded DSP industryEmerging trends in wireless and embedded DSP industry• Circuit solutions:Circuit solutions:
• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• HW accelerators for specialized DSP applicationsHW accelerators for specialized DSP applications
• General Technology and Circuit Challenges Beyond 65nm:General Technology and Circuit Challenges Beyond 65nm:• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density
• Emerging trends in wireless and embedded DSP industryEmerging trends in wireless and embedded DSP industry• Circuit solutions:Circuit solutions:
• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• HW accelerators for specialized DSP applicationsHW accelerators for specialized DSP applications
OutlineOutline
3
Technology Scaling 101
Gate
Source DrainBody
L
Tox
Gate
Source DrainBody
L
Tox
Gate
Source Drain
Body0.7 L
0.7 Tox
Gate
Source Drain
Body0.7 L
0.7 Tox
11
0.490.49
0.7
0.7
1
11Freq
1Delay
43.17.0
1Freq
7.0Delay
4
Leakage vs. Switching Power
• From a P perspective, but true for DSPs too• IIoffoff increase 3-5X per generation increase 3-5X per generation• Active leakage power > 50% of total powerActive leakage power > 50% of total power
0
50
100
150
200
250
250nm 180nm 130nm 90nm 65nm
Active Power
0
50
100
150
200
250P
ow
er (
Wat
ts)
Active Leakage power
Technology
5
On-chip Interconnect PerformanceOn-chip Interconnect Performance
• RC/RC/m increases 40-60% per generationm increases 40-60% per generation• Local inter-gate wires dominate critical-path delaysLocal inter-gate wires dominate critical-path delays• Global wire lengths not scaling by 0.7xGlobal wire lengths not scaling by 0.7x
30% per generation
30% per generation
Technology Node (nm)
De
lay
(n
s)
0.0
01
0.0
1
0
.1
1.
0
250 200 150 100 50
Typical Gate Delay
Interconnect Delay
6
• Significant variation in IOFF (hence FSignificant variation in IOFF (hence Fmaxmax spread) spread)
• Worsening with process scalingWorsening with process scaling• Excess leakage dies: lack in robustnessExcess leakage dies: lack in robustness• Low leakage dies: over-designed for robustnessLow leakage dies: over-designed for robustness
Process Variation Tolerance
Fast corner
(180nm CMOS, 110°C)
0
50
100
150
200
Normalized IOFF
Nu
mb
er
of
die
s
0 1 2 3 4 5 7
0.9
1.0
1.1
1.2
1.3
1.4
0 5 10 15 20
Fre
qu
ency
Leakage
30%
20X
Process parameter variation tolerant design techniquesProcess parameter variation tolerant design techniques
7
DSP Application Demands
• 2G:2G:GSMGSMPDCPDCIS-95IS-95
< 50 MIPS< 50 MIPS
> 100 MIPS> 100 MIPS
ASIC DSPASIC DSPHardwareHardware
TimeTime
CapabilityCapability
• 2.5G:2.5G:GPRSGPRSEDGEEDGEIS-95BIS-95B
ASIC DSP ASIC DSP Hardware Hardware AssistAssist
• 3G:3G:WCDMAWCDMA
2001 2003 2005
> 200 MIPS> 200 MIPS
VOICEVOICE DATA andDATA andAPPLICATIONSAPPLICATIONS
• Smart cell-phones: $2B in ’02 Smart cell-phones: $2B in ’02 $15B in ’06! $15B in ’06!• Huge demand for high-performance DSPsHuge demand for high-performance DSPs
8
Multimedia, Graphics, Enterprise…200+ MIPS200+ MIPS64+ MB Flash64+ MB Flash16+ MB RAM16+ MB RAM
• Simple User Simple User interfaceinterface
• CalendarCalendar• NotepadNotepad
< 50 MIPS< 50 MIPS4+ MB Flash4+ MB Flash0.5+ MB RAM0.5+ MB RAM
> 100 MIPS> 100 MIPS16+ MB Flash16+ MB Flash8+ MB RAM8+ MB RAM
• Speech Speech recognitionrecognition
• MultimediaMultimedia• Large files Large files
and and applicationsapplications
TimeTime
MultimediaMultimediaGraphicsGraphics
EnterpriseEnterprise
OS, Services OS, Services and Appsand Apps
• EmailEmail
CapabilityCapability
• Full OS Full OS • GUIGUI• BrowserBrowser• Suite of appsSuite of apps
• Color ScreenColor Screen• AudioAudio•GraphicsGraphics • Secure Secure
remote remote accessaccess
• Full OS and Full OS and user user interfaceinterface
• BrowserBrowser• Suite of appsSuite of apps
2001 2003 2005
• Market is hungry for DSP MIPS Market is hungry for DSP MIPS (if you deliver, they will use it!)(if you deliver, they will use it!)
9
Typical Performance Requirements
MHz per taskMHz per taskTotal required Total required memorymemory
• MP3 encode
• ASCII Browser
• Graphical Browser - small screen
• Voice 128-bit encryption and decryption
• MP3 Playback
• MPEG 4 Playback
• MPEG 4 Playback
• Robust handwriting recognition
< 4MB< 4MB
8MB8MB
16MB16MB
32 MB32 MB
64MB64MB10001000
100100
1010
150 - 200 MHz
80 - 100 MHz
40 - 80 MHz
25 - 50 MHz
8 - 12 MHz
Pocket PC
200 - 300 MHz
10
So, How Do We Meet This Surging So, How Do We Meet This Surging Demand Within Given Power Envelope?Demand Within Given Power Envelope?
Dedicated HWASIC
Configurable Processor/Logic
Digital Signal Processorsor other ASIPs
Embedded Processors
En
erg
y E
ffic
ien
cy i
n M
OP
S/m
WE
ner
gy
Eff
icie
ncy
in
MO
PS
/mW
Flexibility (Coverage)Flexibility (Coverage)
0.1
1.0
10
100
1000
Berkeley’s Pleiades: Berkeley’s Pleiades: 10-80 MOPS/mW 10-80 MOPS/mW
1-2 MOPS/mW 1-2 MOPS/mW
SA110: SA110: 0.4 MOPS/mW 0.4 MOPS/mW
Courtesy: Prof. J. Rabaey, UC BerkeleyCourtesy: Prof. J. Rabaey, UC Berkeley
• Energy vs. Flexibility Trade-offEnergy vs. Flexibility Trade-off
11
Energy and Area EfficiencyEnergy and Area Efficiency
Courtesy: Prof. Teresa Meng, StanfordCourtesy: Prof. Teresa Meng, Stanford
12
MOPS/mW Distinction: General-purpose MOPS/mW Distinction: General-purpose vs. Dedicated vs. Dedicated
DSP functions are more throughput-orientedDSP functions are more throughput-oriented Amenable for parallelism and pipelining (better power-Amenable for parallelism and pipelining (better power-
performance optimization)performance optimization)
DSP functions are more throughput-orientedDSP functions are more throughput-oriented Amenable for parallelism and pipelining (better power-Amenable for parallelism and pipelining (better power-
performance optimization)performance optimization)
Courtesy: Prof. B. Brodersen, UC Berkeley, ISSCC’02
13
Emerging Trends in DSP IndustryN
orm
aliz
ed p
ow
er e
ffic
ien
cy Specialized hardware
1000
10
100
1
Flexibility
Programmable DSP
Embedded/P
Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates
Microprocessors: Best flexibility
Prof. L. Clark, CICC 2002 [2]Prof. L. Clark, CICC 2002 [2]
14
Emerging Trends in DSP IndustryN
orm
aliz
ed p
ow
er e
ffic
ien
cy Specialized hardware
1000
10
100
1
Flexibility
Programmable DSP
Embedded/P
Microprocessors add specialized HW and
coprocessors with DSP functionality
Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates
Microprocessors: Best flexibility
15
Example Case Study
• 153-380MHz, 53-500mW in 153-380MHz, 53-500mW in 180nm CMOS, 1.0-1.8V180nm CMOS, 1.0-1.8V
• 5.84M transistors, 36mm5.84M transistors, 36mm22
• Dedicated DES-Dedicated DES-Encryption and Speech Encryption and Speech processing acceleratorsprocessing accelerators
Encryption and Speech Processing Specialized HWEncryption and Speech Processing Specialized HW
IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3]IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3]
16
Emerging Trends in DSP IndustryN
orm
aliz
ed p
ow
er e
ffic
ien
cy Specialized hardware
1000
10
100
1
Flexibility
Programmable DSP
Embedded/P
DSPs add microcontroller functionality and specialized
HW accelerators
Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates
Microprocessors: Best flexibility
17
• 600MHz, 4.8GOPS, 718mW 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2Vin 130nm CMOS, 1.2V
• Dedicated Viterbi and Turbo Dedicated Viterbi and Turbo decoding co-processor HWdecoding co-processor HW
• 64M transistors 64M transistors
• Integrated DMA controller, Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ PCI, 1MB L2, 16K I$ & D$
• 600MHz, 4.8GOPS, 718mW 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2Vin 130nm CMOS, 1.2V
• Dedicated Viterbi and Turbo Dedicated Viterbi and Turbo decoding co-processor HWdecoding co-processor HW
• 64M transistors 64M transistors
• Integrated DMA controller, Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ PCI, 1MB L2, 16K I$ & D$
Viterbi and Turbo Co-processorsViterbi and Turbo Co-processors
Example Case StudyTI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4]TI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4]
18
Specialized Hardware Accelerators• Specialized (fixed function) hardware is 10-100x more
efficient than general purpose processors: Why?– Trades hardware for power
– Allows very low clock rates
– Essential for some wireless functions
– Viterbi and Turbo decoding, speech recognition, encryption
• Allows custom algorithms and coefficients to limit power– Use shifts instead of multiplies
• Cost is flexibility– Fixed algorithms and coefficients
– As new applications and wireless standards emerge, is this enough?
– How does this cover the application space?
19
Reconfigurable ProcessorsReconfigurable Processors
• Moderate grain function blocks
• Collections of Add, Mpy, Mux, …
• Interconnect overhead is moderate to low
• If functions and connectivity are known, can be highly optimized
• Fine-grain gate-level functions• Array of MUXes to implement any
N-input boolean function• Speed sacrificed for generality
FPGA – Fine Grain Reconfigurable Fabric
Course Grain Course Grain Reconfigurable FabricReconfigurable Fabric
Courtesy: Prof. F. Kurdahi, UCI Courtesy: Prof. F. Kurdahi, UCI
20
Generic Reconfigurable Architecture
Registers
Datapath Tiles
Configuration Control
Array of Fine/Coarse Grain Datapath Tiles and RegistersArray of Fine/Coarse Grain Datapath Tiles and Registers
21
How Do Reconfigurable Processors Work?
Execute one algorithm/ protocol at any given time
Each algorithm is ‘configured’ from the building blocks
Time between subsequent configurations: ~1-10ms
Configuration Control unit decides which algorithm to execute when
Time
Protocol 1
~1-10 ms
~1-10 ms
22
How Do Reconfigurable Processors Work?
Time
~1-10 ms
~1-10 ms
Protocol 2
Execute one algorithm/ protocol at any given time
Each algorithm is ‘configured’ from the building blocks
Time between subsequent configurations: ~1-10ms
Configuration Control unit decides which algorithm to execute when
23
How Do Reconfigurable Processors Work?
Time
~1-10 ms
~1-10 ms
Protocol 3
Execute one algorithm/ protocol at any given time
Each algorithm is ‘configured’ from the building blocks
Time between subsequent configurations: ~1-10ms
Configuration Control unit decides which algorithm to execute when
24
Virtual Vss
Virtual Vcc
Functional Unit
sleep transistor
sleep transistor
MTCMOS Boosted Sleep
Non-Boosted
Sleep
Sleep-TR size 5.1% 2.3% 3.2%
Leakage power
reduction
1450X 3130X 11.5X
Virtual supply bounce
60 mV 59 mV 58 mV
Standby leakage benefit for 5% delay penaltyStandby leakage benefit for 5% delay penalty
Standby Leakage Reduction: Sleep Transistor design
• Motivation: Cut off power supply in sleep-modeMotivation: Cut off power supply in sleep-mode• Insert “sleep” transistor between main supply Insert “sleep” transistor between main supply
and functional unit’s supply railsand functional unit’s supply rails• Latches tied to main supply rails: retain state Latches tied to main supply rails: retain state
25
Switching + Leakage Reduction: Switching + Leakage Reduction: Forward Body BiasForward Body Bias
VddVbp
Vbn-Ve
+Ve
VddVbp
Vbn-Ve
+Ve
0
1
2
3
4
0.6 0.8 1 1.2 1.4
Frequency (GHz)
No
rmali
zed
tota
l p
ow
er
ZBB
FBB 500mV
Vcc: 1, 1.05, 1.1 … 1.5V
1.2V
110oC=0.1
1.1V
0
10
20
30
0.6 0.8 1 1.2 1.4
Frequency (GHz)
FB
B/Z
BB
leakag
e r
ati
o27oC
20% power reduction at 1GHz20% power reduction at 1GHz8% 8% frequency at iso-power frequency at iso-power20X 20X idle-mode leakage idle-mode leakage
A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6]A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6]
26
VCCcore1 VCCcore2
VCCcore3 VCCcore4
Multi-Vcc Usage Model
VRM1
VRM3
VRM2
VRM4
• Optimize performance and power with parallelism Optimize performance and power with parallelism and voltageand voltage
27
Switching + Leakage Reduction:Multi Supply Design
• Active leakage benefit with lower supply voltageActive leakage benefit with lower supply voltage
• Exponential subthreshold and gate leakage reductionExponential subthreshold and gate leakage reduction
Measured Leakage in 1.2V, 130nm process
Subthreshold lkg
Gate lkg
0
2
4
6
8
10
12
0 0.2 0.4 0.6 0.8 1 1.2 1.4VCC (V)
Leak
age
Ener
gy (N
orm
aliz
ed)
w.c. corner
79%Nominalcorner
130nm L1 cache leakageR. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7]R. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7]
0
20
40
60
80
100
0 0.3 0.6 0.9 1.2 1.5Voltage (V)
No
rmal
ized
L
eaka
ge
28
Adaptive Vcc: Variation-tolerant Circuits
0%
20%
40%
60%
80%
100%
0.85 0.9 0.95 1 1.05
Frequency BinD
ie c
ou
nt
Fixed Vdd: 1.05VAdaptive Vdd: 20mV resolution
4.5 mm
5.3
mm
21 sub-sites within 1 die
4.5 mm
5.3
mm
21 sub-sites within 1 die0%
20%
40%
60%
80%
100%
0.85 0.9 0.95 1 1.05
Frequency Bin
Die
co
un
t
Adaptive Vdd + body biasAdaptive Vdd + WID body bias
• Motive: change Vcc adaptively to reduce impact of parameter variationsMotive: change Vcc adaptively to reduce impact of parameter variations• Large Fmax vs. leakage spread (worsening with scaling)Large Fmax vs. leakage spread (worsening with scaling)• Lower Vdd on leakage-limited circuits (subject to stability limits)Lower Vdd on leakage-limited circuits (subject to stability limits)• Higher Vdd on speed-limited circuits (subject to reliability limits)Higher Vdd on speed-limited circuits (subject to reliability limits)
J. Tschanz et al, 2002 Symp. VLSI CircuitsJ. Tschanz et al, 2002 Symp. VLSI Circuits
0.9
1.0
1.1
1.2
1.3
1.4
0 5 10 15 20
Fre
qu
en
cy
Leakage
30
%
20X
29
Viterbi Decoder Organization
• BMU calculates errors for all branches• PMU accumulates errors and outputs transitions with
minimum error• TBU traces minimum error path back to get best estimate of
original input
One of the most performance and powercritical algorithms in wireless baseband DSP
BranchMetric Unit
(BMU)
Path MetricUnit (PMU)
TracebackUnit (TBU)
BranchError
TransitionsDecoded
BitsEncoded
Bits
30
90nm CMOS Implementation90nm dual-Vt 7-metal CMOS technology
64-state radix-2 design: 40mW at 500Mbps, 1.2V
ACS
230µm x 210 µm
Traceback
260µm x 510 µm
8 ACS
BM
U
BM
U
BM
U
BM
UPMmemory P
Mm
emo
ry
TB control
TB memory
TB memory
PM
mem
ory
PM
mem
ory
PM
mem
ory
PMmemory
Path memory
M. Anders et al, 2004 VLSI Circuits Symp. [10]M. Anders et al, 2004 VLSI Circuits Symp. [10]
31
Summary
• Fastest reported 64-state Viterbi accelerator– Total power at 2 GHz (500Mbps) is 40mW (1.2V)
• Lowest power 802.11a implementation– Total power at 216 MHz (54Mbps) is 5mW (0.7V)
Technology 90nm dual-Vt CMOS
ACS area 230µm x 210µm (0.048 mm2)
Traceback area 260µm x 510µm (0.133 mm2)
Viterbi states 64-state
ACS precision 10 bits
Radix-2 max. TB length 96 symbolsM. Anders et al, 2004 VLSI Circuits Symp. [10]M. Anders et al, 2004 VLSI Circuits Symp. [10]
32
Streaming Media Accelerators: 32-bit MAC [ISSCC’03]
• 5GHz 32-bit multiply-accumulate unit• Targeted for special purpose streaming processors/graphics
Die Area 1.32 x 1.57 mm2
Process 90nm CMOS
Interconnect 1 poly, 7 metal
Transistors 230K
Frequency 5GHz
Maximum Vcc 1.5V
Chip Power 1.2W @ 1.2V
Pad Count 75
MULTIPLIER
ALIGNER
ACCUMULATE
NORMALIZE
FIFOs&
SCAN
CL
K
Die Area 1.32 x 1.57 mm2
Process 90nm CMOS
Interconnect 1 poly, 7 metal
Transistors 230K
Frequency 5GHz
Maximum Vcc 1.5V
Chip Power 1.2W @ 1.2V
Pad Count 75
MULTIPLIER
ALIGNER
ACCUMULATE
NORMALIZE
FIFOs&
SCAN
CL
K
S. Vangal et al, ISSCC’03 [11]S. Vangal et al, ISSCC’03 [11]
33
x
MAC
+
FIFOB
Scan Reg
FIFO Control
32 V
ecto
rs D
eep FIFO
AFIFO
C
Scan Out
Scan In
Scan Reg
32
32 32
32-bit MAC Architecture Overview
• Single-cycle 5GHz 32-bit MAC loop• New Multiplier and Accumulator ALU circuit techniques
• Single-cycle 5GHz 32-bit MAC loop• New Multiplier and Accumulator ALU circuit techniques
34
TCP/IP Off-load Accelerator [ISSCC’03]
• 10GHz TCP/IP offload accelerator unit• Targeted for 10Gbps Ethernet packet processing accel.
• 10GHz TCP/IP offload accelerator unit• Targeted for 10Gbps Ethernet packet processing accel.
Core Area
Process
Interconnect
Transistors
Frequency
Max Vcc
Chip power
Pad count
2.23 x 3.54mm2
90nm dual-VT CMOS
1 poly, 7 metal
260K
10GHz
1.5V
1.9W @1.2V
306
TC
B ExecCore
PLL
OOO
ROMC
AM
1
TC
B ExecCore
PLL
ROB
ROMC
LB
Inputseq
Sendbuffer
Y. Hoskote et al, ISSCC’03 [12]Y. Hoskote et al, ISSCC’03 [12]
35
10GHz TCP/IP Execution Core
Rcv buffer Working register
index
TCBTCB
264
Key CLB
6
ALUROB
input
Scratch registers PCPipelined
ALU
Branch addressStart address
Instr ROM
9
3232
112
IRdecode
ALU output
Next address
96
10GHz sparse-tree ALU10GHz sparse-tree ALU
• At-speed packet processing execution core for 10Gbps• At-speed packet processing execution core for 10Gbps
36
• Several Technology and Circuit Challenges Beyond 65nmSeveral Technology and Circuit Challenges Beyond 65nm• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density
• Emerging trends in DSP industryEmerging trends in DSP industry• Specialized hardware accelerators and co-processorsSpecialized hardware accelerators and co-processors• Reconfigurable engines Reconfigurable engines
• Circuit solutions:Circuit solutions:• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• DSP HW accelerators for Viterbi, Streaming media, TCP/IPDSP HW accelerators for Viterbi, Streaming media, TCP/IP
• Several Technology and Circuit Challenges Beyond 65nmSeveral Technology and Circuit Challenges Beyond 65nm• Switching and active leakage energySwitching and active leakage energy• Leakage tolerance and robustnessLeakage tolerance and robustness• On-chip interconnect scalingOn-chip interconnect scaling• Process parameter variations and toleranceProcess parameter variations and tolerance• Execution core thermal/power densityExecution core thermal/power density
• Emerging trends in DSP industryEmerging trends in DSP industry• Specialized hardware accelerators and co-processorsSpecialized hardware accelerators and co-processors• Reconfigurable engines Reconfigurable engines
• Circuit solutions:Circuit solutions:• Active and standby leakage power reduction strategiesActive and standby leakage power reduction strategies• Multi-supply design: switching + leakage power benefitsMulti-supply design: switching + leakage power benefits• Energy-efficient arithmetic circuit technologiesEnergy-efficient arithmetic circuit technologies• DSP HW accelerators for Viterbi, Streaming media, TCP/IPDSP HW accelerators for Viterbi, Streaming media, TCP/IP
ConclusionsConclusions
37
References[1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE [1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE
Custom Integrated Circuits Conference 2002, pp. 125-128.Custom Integrated Circuits Conference 2002, pp. 125-128.
[2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference [2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference
2003, pp. 171-176.2003, pp. 171-176.
[3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC [3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC
2002, pp. 340-341.2002, pp. 340-341.
[4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57.[4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57.
Reconfigurable processors: Reconfigurable processors:
1. http://brass.cs.berkeley.edu/
2. http://www.eng.uci.edu/comp.arch/
3. http://www.pactcorp.com/xneu/px_xpphw.html
[5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt [5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt
allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219.allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219.
[6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”, [6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”,
Symposium on VLSI Circuits 2002, pp. 312-315.Symposium on VLSI Circuits 2002, pp. 312-315.
[7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI [7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI
Circuits 2002, pp. 128-129.Circuits 2002, pp. 128-129.
[8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI [8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI
Circuits 2002, pp. 126-127.Circuits 2002, pp. 126-127.
[9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004, [9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004,
pp. 162-163.pp. 162-163.
[10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits [10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits
2004, pp. 174-175.2004, pp. 174-175.
[11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS[11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS”, ISSCC 2003, pp. 334-335.
[12] Y. Hoskote et al, “A 10GHz TCP/IP offload accelerator for 10Gb/s Ethernet in 90nm CMOS”, ISSCC 2003, pp. 258-259.