Lecture 16

Spring 2007 Lecture 16

Lecture 16

Heterogeneous Systems

(Thanks to Wen-Mei Hwu for many of the figures)


What are Heterogeneous Systems?

• Programmable -- not restricted to one particular application, though may be heavily optimized for a class of applications.

• Multi-core -- Multiple, independent, execution units on a chip– Some people are starting to use the term “many-core” for architectures

where there are enough cores that you have to use a non-sequential programming model to get full performance out of the system.

• Heterogeneous -- Cores are different– Optimize cores for specific types of applications

– Can schedule for performance or power


Why are they Interesting?

Embedded applications have tough performance and power requirements

Example: GSM decoder requires 10 Minst/second in software

Motorola V70 GSM cell phone has power budget of approximately 0.8 watts total when in use.– Includes both encode and decode

– Includes microphone, speaker, radio


Application-Specific Integrated Circuits

CPU

Input Data CustomLogic

BufferCustomLogic

Output Data

Control


Why Not Keep Using ASICs?

• Decreasing Product Cycles

• Design Time/Cost– Transistors/chip rising at 50%/year

– Transistors/designer day rising at 10%/year

• Re-usable cores helping some, but not enough

– Mask cost greater than $1M

• Need to fabricate many chips to justify a design

• Lack of Flexibility– More and more, consumers want multifunction devices (ex. Cell phone

with camera)

– Increases design time, cost


Why Heterogeneous Systems?

• Different parts of programs have different requirements– Control-intensive portions need good branch predictors, speculation, big

caches to achieve good performance

– Data-processing portions need lots of ALUs, have simpler control flows

• Power Consumption– Features like branch prediction, out-of-order execution, tend to have

very high power/performance ratios.

– Applications often have time-varying performance requirements

• Observation: Much of the performance, power advantages of ASICs comes from application-specific memory, not application-specific processing


Changing Memory to Communication

CPUCPU

Weight_Ai (Az, F_ga3, Ap3)

Weight_Ai (Az, F_g4, Ap4)

Residu (Ap3, &syn_subfr[i],)

Copy (Ap3, h, 11)

Set_zero (&h[11], 11)

Syn_filt (Ap4, h, h, 22, &h)

tmp = h[0] * h[0];for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i];tmp1 = tmp >> 8;tmp = h[0] * h[1];for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1];tmp2 = tmp >> 8;if (tmp2 <= 0) tmp2 = 0;else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1;

preemphasis (res2, temp2, 40)

Syn_filt (Ap4, res2, &syn_p),

40, mem_syn_pst, 1);

agc (&syn[i_subfr], &syn)

29491, 40)

res2

m_syn

F_g3

F_g4

Az_4

synth

syn

Ap3

Ap4

h

tmp

tmp1

tmp2

CPUDRAM

DRAM

Weight_Ai Weight_Ai

Copy+

Set_zero Residu

Syn_filt

Corr0/Corr1

preemph

agc

Syn_filt

PE’s

res2

m_syn

F_g3

F_g4

Az_4

synth

syn

Ap3

Ap4

h

tmp

tmp1

tmp2

PE’sDRAM


View from source code•Note how memory operations dominate•Note presence of “expensive” instructions


Not as Easy as it Looks

** * *+

Residu

preemphasis

** * *+

Syn_filt

res

[0:39]

[39:0] [39:0]

[0:39] MEMtime

Order of access to data may make transforming memory ops into communication hard


Compilers to the Rescue!

• Remove anti-dependence by array renaming

• Apply loop reversal to match producer/consumer I/O

• Convert array access to inter-component communication

** * *+

Residu

+** * *

Syn_filt

res

preemphasis

res2

timeInterprocedural pointer analysis + array dependence test +

array access pattern summary+ interprocedural memory data flow


Heterogeneous Processor Vision

ACC

LOCALMEMORY

ACC

MA

INM

EMO

RY

GPP

MTM

ACC

LOCALMEMORY

Memory transfer module schedules system-wide bulk data movement

General-purpose processor orchestrates activity

Accelerators can use scheduled, streaming communication…

or can operate on locally-buffered data pushed to them in advance

Accelerated activities and associated private data are localized for bandwidth, power, efficiency


Intel Network Processor -- Existing Example

XScaleCore

HashEngine

Scratch-pad

SRAM

RFIFO

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

RD

RA

M

RD

RA

M

RD

RA

M

PC

I

CSRs

TFIFO

SP

I4 / C

SIX


STI Cell Processor-- Emerging Example

Power Processor Element (PPE)(Simplified 64-bit PowerPC with VMX)

SPE4

SPE3

SPE2

SPE1

SPE8

SPE7

SPE6

SPE5

I/OController

I/OController

MemoryController

MemoryController

RAM

RAM

EIB

Dual configurableHigh-speed channels(38.4 GB/sec ea.)

Dual 12.8 GB/sec memory busses.

Element Interconnect Bus (EIB) internal communication system.

SynergisticProcessingElement (SPE)


Overview of the Rest of the Semester

• This is the last formal lecture– If we haven’t covered it already, we can’t really expect you to use it on

your projects

• Final project proposal due Tuesday in class• I’ll be in my office (208 CSL) during class on 3/27 to

provide an opportunity to discuss project issues• Quiz 2 is 3/29• Final project demos are 5/3

Lecture 16

Documents

Transcript of Lecture 16