2/8/06D&T Seminar1 Multi-Core Parallelism for Low- Power Design Vishwani D. Agrawal James J. Danaher...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of 2/8/06D&T Seminar1 Multi-Core Parallelism for Low- Power Design Vishwani D. Agrawal James J. Danaher...
2/8/06 D&T Seminar 1
Multi-Core Parallelism for Low-Power Design
Vishwani D. AgrawalJames J. Danaher Professor
Department of Electrical and Computer EngineeringAuburn University
http://www.eng.auburn.edu/[email protected]
2/8/06 D&T Seminar 3
SIA Roadmap for Processors (1999)Year 1999 2002 2005 2008 2011 2014
Feature size (nm) 180 130 100 70 50 35
Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M
Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9
Chip size (mm2) 340 430 520 620 750 900
Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5
High-perf. Power (W) 90 130 160 170 175 183
Source: http://www.semichips.org
2/8/06 D&T Seminar 4
ISSCC, Feb. 2001, Keynote“Ten years from now, microprocessors will run at 10GHz to 30GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now.
“Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . . .”
Patrick P. Gelsinger Senior Vice PresidentGeneral ManagerDigital Enterprise Group INTEL CORP.
2/8/06 D&T Seminar 5
VLSI Chip Power Density
40048008
80808085
8086
286386
486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
Po
wer
Den
sity
(W
/cm
2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurface
Source: Intel
2/8/06 D&T Seminar 6
Power Dissipation in CMOS Logic (0.25µ)
%75 %5%20
Ptotal (0→1) = CL VDD2
+ tscVDD Ipeak + VDDIleakage
CL
VDD VDD
2/8/06 D&T Seminar 7
Low-Power Datapath Architecture• Lower supply voltage
– This slows down circuit speed– Use parallel computing to gain the speed back
• Works well when threshold voltage is also lowered.
• About 60% reduction in power obtainable.• Reference: A. P. Chandrakasan and R. W.
Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995.
2/8/06 D&T Seminar 8
A Reference Datapath
Combinationallogic
OutputInputR
eg
iste
r
Re
gis
ter
CK
Supply voltage = Vref
Total capacitance switched per cycle = Cref
Clock frequency = fPower consumption: Pref = CrefVref
2f
Cref
2/8/06 D&T Seminar 9
A Parallel ArchitectureComb.Logic
Copy 1
Comb.Logic
Copy 2
Comb.Logic
Copy N
Re
gis
ter
Re
gis
ter
Re
gis
ter
Re
gis
ter
N to
1 m
ulti
ple
xer
MultiphaseClock gen. and mux
control
InputOutput
CK
f
f/N
f/N
f/N
A copy processes every Nth input, operates at reduced voltage
Supply voltage:VN ≤ V1 = Vref
N = Deg. of parallelism
2/8/06 D&T Seminar 11
PowerPN = Pproc + Poverhead
Pproc = N(Cinreg+ Ccomb)VN2f/N + CoutregVN
2f
= (Cinreg+ Ccomb+Coutreg)VN2f
= CrefVN2f
Poverhead = CoverheadVN2f ≈ δCref(N – 1)VN
2f
PN = [1 + δ(N – 1)]CrefVN2f
PN VN2
── = [1 + δ(N – 1)] ───P1 Vref
2
2/8/06 D&T Seminar 12
Voltage vs. Speed CLVref CLVref
Delay of a gate, T ≈ ──── = ────────── I k(W/L)(Vref – Vt)2
where I is saturation currentk is a technology parameterW/L is width to length ratio of transistorVt is threshold voltage
Supply voltage
No
rma
lize
d g
ate
de
lay,
T
4.0
3.0
2.0
1.0
0.0 Vt Vref =5VV2=2.9V
N=1
N=2
V3
N=31.2μ CMOS Voltage reduction
slows down as we get closer to Vt
2/8/06 D&T Seminar 13
Increasing Multiprocessing
PN/P1
1 2 3 4 5 6 7 8 9 10 11 12
1.0
0.8
0.6
0.4
0.2
0.0
Vt=0V (extreme case)
Vt=0.4V
Vt=0.8V
N
1.2μ CMOS, Vref = 5V
2/8/06 D&T Seminar 14
Extreme Cases: Vt = 0Delay, T α 1/ Vref
For N processing elements, delay = NT → VN = Vref/N
PN 1── = [1+ δ (N – 1)] ── → 1/NP1 N2
For negligible overhead, δ→0
PN 1── ≈ ──P1 N2
For Vt > 0, power reduction is less and there will be an optimum value of N.
2/8/06 D&T Seminar 15
Example: Multiplier Core
• Specification:• 200MHz Clock• 15W dissipation @ 5V• Low voltage operation, VDD ≥ 1.5 volts
(VDD – 0.5)2
Relative clock rate = ─────── 20.25
• Problem:• Integrate multiplier core on a SOC• Power budget for multiplier ~ 5W
2/8/06 D&T Seminar 16
A Multicore Design
MultiplierCore 1
MultiplierCore 5
Reg
RegR
egR
eg
5 to
1 m
ux
MultiphaseClock gen.
and muxcontrol
Input
Output
200MHzCK
200MHz
40MHz
40MHz
40MHz
MultiplierCore 2
Core clock frequency = 200/N, N should divide 200.
2/8/06 D&T Seminar 17
How Many Cores?
• For N cores:• clock frequency = 200/N MHz
• Supply voltage, VDDN= 0.5 + (20.25/N)1/2 Volts
• Assuming 10% overhead per core, VDDN
Power dissipation =15 [1 + 0.1(N – 1)] (───)2
watts 5
2/8/06 D&T Seminar 18
Design TradeoffsNumber of cores
NClock (MHz)
Core supply VDDN (Volts)
Total Power
(Watts)
1 200 5.00 15.0
2 100 3.68 8.94
4 50 2.75 5.90
5 40 2.51 5.29
8 25 2.10 4.50
2/8/06 D&T Seminar 19
Power Reduction in Processors
• Just about everything is used.• Hardware methods:
• Voltage reduction for dynamic power• Dual-threshold devices for leakage reduction• Clock gating, frequency reduction• Sleep mode
• Architecture:• Instruction set• hardware organization
• Software methods
2/8/06 D&T Seminar 20
Parallel Architecture
Processor
f
Processor
f/2
Processor
f/2
f
Input Output
Input
Output
Capacitance = CVoltage = VFrequency = fPower = CV2f
Capacitance = 2.2CVoltage = 0.6VFrequency = 0.5fPower = 0.396CV2f
2/8/06 D&T Seminar 21
Pipeline Architecture
Processor
f
Input Output
Re
gis
ter
½Proc.
f
Input Output
Re
gis
ter
½Proc.
Re
gis
ter
Capacitance = CVoltage = VFrequency = fPower = CV2f
Capacitance = 1.2CVoltage = 0.6VFrequency = fPower = 0.432CV2f
2/8/06 D&T Seminar 22
Approximate Trend n-parallel proc. n-stage pipeline proc.
Capacitance nC C
Voltage V/n V/n
Frequency f/n f
Power CV2f/n2 CV2f/n2
Chip area n times 10-20% increase
G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: KluwerAcademic Publishers, 1998.
2/8/06 D&T Seminar 23
Multicore Processors
2000 2004 2008
Per
form
ance
bas
ed o
nS
PE
Cin
t200
0 an
d S
PE
Cfp
2000
ben
chm
arks
Multicore
Single core
Computer, May 2005, p. 12
2/8/06 D&T Seminar 24
Multicore Processors
• D. Geer, “Chip Makers Turn to Multicore Processors,” Computer, vol. 38, no. 5, pp. 11-13, May 2005.
• A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips,” Computer, vol. 5, no. 7, pp. 36-40, July 2005; this special issue contains three more articles on multicore processors.
• S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp. 20-23, January 2006.
2/8/06 D&T Seminar 25
Cell - Cell Broadband Engine Architecture
L to RAtsushi Kameyama, ToshibaJames Kahle, IBMMasakazu Suzoki, Sony
© I
EE
E S
pe
ctru
m,
Jan
ua
ry 2
00
6
Nine-processor chip:192 Gflops
2/8/06 D&T Seminar 26
Cell’s Nine-Processor Chip
© IEEE Spectrum, January 2006 Eight IdenticalProcessors f = 5.6GHz (max)44.8 Gflops
2/8/06 D&T Seminar 28
Amdahl’s Law
S P = 1 – S
0 1 time 1
Speedup = ─────────S + (1 – S)/ N
Where N = number of parallel processors
Example: S = 0.6, N = 10, Speedup = 1.56S = 0.6, N = ∞, Speedup = 1.67
Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference Proceedings, (30), pp. 483-485, 1967.