Lecture 16
description
Transcript of Lecture 16
Spring 2007 Lecture 16
Lecture 16
Heterogeneous Systems
(Thanks to Wen-Mei Hwu for many of the figures)
Spring 2007 Lecture 16
What are Heterogeneous Systems?
• Programmable -- not restricted to one particular application, though may be heavily optimized for a class of applications.
• Multi-core -- Multiple, independent, execution units on a chip– Some people are starting to use the term “many-core” for architectures
where there are enough cores that you have to use a non-sequential programming model to get full performance out of the system.
• Heterogeneous -- Cores are different– Optimize cores for specific types of applications
– Can schedule for performance or power
Spring 2007 Lecture 16
Why are they Interesting?
Embedded applications have tough performance and power requirements
Example: GSM decoder requires 10 Minst/second in software
Motorola V70 GSM cell phone has power budget of approximately 0.8 watts total when in use.– Includes both encode and decode
– Includes microphone, speaker, radio
Spring 2007 Lecture 16
Application-Specific Integrated Circuits
CPU
Input Data CustomLogic
BufferCustomLogic
Output Data
Control
Spring 2007 Lecture 16
Why Not Keep Using ASICs?
• Decreasing Product Cycles
• Design Time/Cost– Transistors/chip rising at 50%/year
– Transistors/designer day rising at 10%/year
• Re-usable cores helping some, but not enough
– Mask cost greater than $1M
• Need to fabricate many chips to justify a design
• Lack of Flexibility– More and more, consumers want multifunction devices (ex. Cell phone
with camera)
– Increases design time, cost
Spring 2007 Lecture 16
Why Heterogeneous Systems?
• Different parts of programs have different requirements– Control-intensive portions need good branch predictors, speculation, big
caches to achieve good performance
– Data-processing portions need lots of ALUs, have simpler control flows
• Power Consumption– Features like branch prediction, out-of-order execution, tend to have
very high power/performance ratios.
– Applications often have time-varying performance requirements
• Observation: Much of the performance, power advantages of ASICs comes from application-specific memory, not application-specific processing
Spring 2007 Lecture 16
Changing Memory to Communication
CPUCPU
Weight_Ai (Az, F_ga3, Ap3)
Weight_Ai (Az, F_g4, Ap4)
Residu (Ap3, &syn_subfr[i],)
Copy (Ap3, h, 11)
Set_zero (&h[11], 11)
Syn_filt (Ap4, h, h, 22, &h)
tmp = h[0] * h[0];for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i];tmp1 = tmp >> 8;tmp = h[0] * h[1];for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1];tmp2 = tmp >> 8;if (tmp2 <= 0) tmp2 = 0;else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1;
preemphasis (res2, temp2, 40)
Syn_filt (Ap4, res2, &syn_p),
40, mem_syn_pst, 1);
agc (&syn[i_subfr], &syn)
29491, 40)
res2
m_syn
F_g3
F_g4
Az_4
synth
syn
Ap3
Ap4
h
tmp
tmp1
tmp2
CPUDRAM
DRAM
Weight_Ai Weight_Ai
Copy+
Set_zero Residu
Syn_filt
Corr0/Corr1
preemph
agc
Syn_filt
PE’s
res2
m_syn
F_g3
F_g4
Az_4
synth
syn
Ap3
Ap4
h
tmp
tmp1
tmp2
PE’sDRAM
Spring 2007 Lecture 16
View from source code•Note how memory operations dominate•Note presence of “expensive” instructions
Spring 2007 Lecture 16
Not as Easy as it Looks
** * *+
Residu
preemphasis
** * *+
Syn_filt
res
[0:39]
[39:0] [39:0]
[0:39] MEMtime
Order of access to data may make transforming memory ops into communication hard
Spring 2007 Lecture 16
Compilers to the Rescue!
• Remove anti-dependence by array renaming
• Apply loop reversal to match producer/consumer I/O
• Convert array access to inter-component communication
** * *+
Residu
+** * *
Syn_filt
res
preemphasis
res2
timeInterprocedural pointer analysis + array dependence test +
array access pattern summary+ interprocedural memory data flow
Spring 2007 Lecture 16
Heterogeneous Processor Vision
ACC
LOCALMEMORY
ACC
MA
INM
EMO
RY
GPP
MTM
ACC
LOCALMEMORY
Memory transfer module schedules system-wide bulk data movement
General-purpose processor orchestrates activity
Accelerators can use scheduled, streaming communication…
or can operate on locally-buffered data pushed to them in advance
Accelerated activities and associated private data are localized for bandwidth, power, efficiency
Spring 2007 Lecture 16
Intel Network Processor -- Existing Example
XScaleCore
HashEngine
Scratch-pad
SRAM
RFIFO
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
Microengine
QD
RS
RA
M
QD
RS
RA
M
QD
RS
RA
M
QD
RS
RA
M
RD
RA
M
RD
RA
M
RD
RA
M
PC
I
CSRs
TFIFO
SP
I4 / C
SIX
Spring 2007 Lecture 16
STI Cell Processor-- Emerging Example
Power Processor Element (PPE)(Simplified 64-bit PowerPC with VMX)
SPE4
SPE3
SPE2
SPE1
SPE8
SPE7
SPE6
SPE5
I/OController
I/OController
MemoryController
MemoryController
RAM
RAM
EIB
Dual configurableHigh-speed channels(38.4 GB/sec ea.)
Dual 12.8 GB/sec memory busses.
Element Interconnect Bus (EIB) internal communication system.
SynergisticProcessingElement (SPE)
Spring 2007 Lecture 16
Overview of the Rest of the Semester
• This is the last formal lecture– If we haven’t covered it already, we can’t really expect you to use it on
your projects
• Final project proposal due Tuesday in class• I’ll be in my office (208 CSL) during class on 3/27 to
provide an opportunity to discuss project issues• Quiz 2 is 3/29• Final project demos are 5/3