1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP...
-
Upload
roy-nelson -
Category
Documents
-
view
227 -
download
0
Transcript of 1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP...
1
Advanced VLSI Course Class Presentation
An Asynchronous Array of Simple Processors for DSP Applications
Rasoul Yousefi
Electrical and Computer Engineering DepartmentUniversity of Tehran
2
Copyright
• Slides used– Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari,
Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas “An Asynchronous Array of Simple rocessors for DSP Applications”, VLSI Computation Lab, ECE Department,University of California, Davis, USA,2006 IEEE
International Solid-State Circuits Conference
3
Outline
• The idea • Project objectives• Key features of the AsAP processor• Design of the AsAP processor• Results• Trends
4
The idea
• Using simple processor cores with small instruction and data memory to dramatically reduce area and power while increasing performance
FIFO 032
words ALUMAC
Control
IMem 64
words
DMem128
words
OSC
Staticconfig
Dynamicconfig
Output
FIFO 132
words
[1]
5
Outline
• The idea
• Project objectives• Key features of the AsAP processor• Design of the AsAP processor• Results• Trends
6
Project Objectives
• Programming flexibility• Increasing performance• Reducing power • Reducing area• Suitable for future
fabrication technologiesP
erf
orm
ance
&E
ner
gy
effi
cie
ncy
Programming flexibility
ASIC
FPGA
Prog.DSP
AsAP
Performance, Area, Power improvement !!! [1]
7
Outline
• The idea• Project objectives
• Key features of the AsAP processor• Design of the AsAP processor• Results• Trends
8
Key Features of the AsAP Processor
Chip multiprocessor High performance
Small memory& simple processor
High energy efficiencyGlobally asynchronous
locally synchronous (GALS)Technology scalability
Nearest neighbor communication
9
High Performance Through Chip Multiprocessor
• Increasing the clock frequency is challenging
• Parallelism is more promising– Instruction level parallelism (VLIW, Superscalar)– Data level parallelism (SIMD)– Task level parallelism
Higher clock frequency
Increased design complexity & lower energy efficiency
Deeper pipeline
10
Task Level Parallelism• Well suited for DSP applications
scramcoding
inter-leave
mod.map
loadfftinter-leave
IFFTwin-dow
up-sampfilter
scaleclip
up-sampfilter
in out
trainingpilots
802.11a/g wireless LAN (54 Mbps, 5 GHz) baseband transmit path
Task1Task2Task3
ABC
Memory
Task1 Task2A B C
Proc. Proc.1 Proc.2
Task3
Proc.3
Improves performance andpotentially reduces memory size
…
• Widely available in many DSP applications
… …
[1]
[1]
11
Memory Size in Modern Processors• Memory occupies much of the area in modern processors
— this reduces area available for the core and consumes large amounts of power
• Memories frequently contain the critical path • Area and energy dissipation can be dramatically decreased
while speed can be increased with smaller memories
0%
20%
40%
60%
80%
100%
Are
a b
reak
do
wn other
mem
core
TI_C64x Itanium SPARC BlGe/L [ISSCC 02, 05] [from1]
13
Small Memory Requirements for DSP Tasks
• The memory required for common DSP tasks is quite small
• Several hundred words of memory are sufficient for many DSP tasks
Task IMem
(words)
DMem
(words)
N-pt FIR 6 2N
8-pt DCT 40
16
8x8 2-D DCT 154 72
Conv. coding (k = 7)
29 14
Huffman encode 200 330
N-pt convolution 29 2N
64-pt complex FFT 97 192
Bubble sort 20 1
N merge sort 50 N
Square root 62 15
Exponential 108 32
[1]
14
• A system with many processors can use individual processors or groups of processors to compute individual tasks and intermediate data can be transmitted between processors with a small memory requirement
• So it is possible to use small memory
15
GALS Clocking Style
• The challenge of globally synchronous systems– Design difficulty when using high clock frequencies,
long clock wires caused by large chip sizes, and large circuit parameter variations
– High clock power consumption and lack of flexibility to independently control clock frequencies
• How about totally asynchronous design style– Lack of EDA tool support– Design complexity and circuit overhead
• The GALS compromise– Synchronous blocks each operating in their own
independent clock domain
16
The GALS architecture basics
• Power breakdown in a high-performance CPU[2]
Datapath Memory
ControlI/O
Clock
[Adopted from 2]
17
The GALS architecture basics
• Eliminating global clock Eliminating a major source of power consumption[2]
• Frequency of each SB can be tailored to the local needs Reducing average frequency and overall power consumption[2]
SB1 SB2
SB3DataHandshake protocol [Adopted from 2]
18
On Chip Communication
• Methods to avoid global wires– Network on chip (NOC)– Local communication– Nearest neighbor communication
nearest local
[1]
19
Outline
• The idea• Project objectives• Key features of the AsAP processor
• Design of the AsAP processor• Results• Trends
20
AsAP Block Diagram• GALS array of identical processors
– Each processor is a reduced complexity programmable DSP with small memories
– Each processor can receive data from any two neighbors and send data to any of its four neighbors
FIFO 032
words ALUMAC
Control
IMem 64
words
DMem128
words
OSC
Staticconfig
Dynamicconfig
Output
FIFO 132
words
[1]
21
IFetch
PC
Inst.Mem
Decode Mem.Read
SrcSelect
EXE 1 EXE 2 EXE 3 ResultSelect
WriteBack
FIFO 0 RD
FIFO 1 RD
DataMemRead
DCMemRead
De-code
Addr.Gens.
ALU DataMemWrite
DCMemWrite
Proc Output
Single Processor Architecture• 54 32-bit instructions• 9-stage pipeline
Bypass
Multiply Accumulator
×ACC
+
[1]
22
resethalt
clk
Programmable Clock Oscillator
• Standard cell based• Configurable frequency
– Delay tunable stages using 7 parallel tri-state inverters
– 1 to 128 clock divider• Results
– 1.66 MHz – 702 MHz– Max gap: 0.08 MHz
(1.66 – 500 MHz)
… … … …
…
…
… …
stage_sel
..
...
...
...
.
Clock divider
[1]
23
Write side Read sideeast
westsouth Write
logic
FIFO 0
Readlogic
east
north
west
south
CPU
FIFO 1
Processor
east
northwest
south
Inter-processor Communication
• Each processor contains two dual-clock FIFOs
• Rd/Wr in separate clock domains
north
Binary Gray & Sync.
addr addr
SRAM
[1]
24
Advantages of theAsAP Clocking System
• Simplified clock tree design – The maximum span is < 1 mm in 0.18 µm
• Scalable – easy to add processors• Improved energy efficiency
– Clock halts in 9 cycles (processor dissipates leakage power only) and restores in < 1 cycle according to work availability
– 53% and 65% power saving– Independent clock and voltage scaling (individual
processor voltage scaling not implemented in this version)
25
Physical Design
• 0.18 µm TSMC• Standard cell• Design flow
– Completely synthesized, except oscillator– Macro memory blocks used for four main memories– Completely auto placed and routed
• To simplify physical design, clock gating not implemented in this version
26
Transistors:
1 Proc 230,000
Chip 8.5 million
Max speed: 475 MHz @ 1.8 V
Area: 1 Proc 0.66 mm²
Chip 32.1 mm²
Power (1 Proc @ 1.8V, 475 MHz):
Typical application 32 mW
Typical 100% active 84 mW
Worst case 144 mW
Power (1 Proc @ 0.9V, 116 MHz):
Typical application 2.4 mW
SingleProcessor
OSC FIFOs
DMemIMem
810 µm5.65 mm
810 µm
5.68 mm
Chip Micrograph of the 6 x 6 Array
[1]
27
Outline
• The idea• Project objectives• Key features of the AsAP processor• Design of the AsAP processor
• Results• Trends
28
Area Evaluation
• Most of AsAP’s area is for the core (66%)• Each processor requires a very small area; more
than 20x smaller than others
0%
20%
40%
60%
80%
100%
Are
a b
reak
do
wn
comm mem core
Pro
cess
or
area
(m
m²)
20x
All scaled to 0.13 µm[ISSCC 05, 00; ISCA04; CMPON96]
0.1
1
10
100 7230
8.3 7
0.34
29
TI RAW Fujitsu BlGe/ AsAP C64x 4-VLIW L
TI CELL/ Fujitsu RAW ARM AsAP C64x SPE 4-VLIW
[1] [1]
29
Power and Performance
0.01
0.1
1
10 100 1000 10000
AsAP
CELL/SPE
Fujitsu-4VLIW
ARM
RAW
TI C64x
Po
wer
/ C
lock
fre
qu
ency
/ S
cale
*
(mW
/MH
z)
Peak performance density (MOPS/mm²)
All scaled to 0.13 µm
* Assume 2 ops/cycle for CELL/SPE and 3.3 ops/cycle for TI C64x
Note: word widths and workload not factored
[ISSCC 05, 00; ISCA04; CMPON96]
Higher performance
÷ area, and
Lower estimated
energy
[1]
30
Pad
Scram
Conv. Code
PuncInter-
leave 1
Inter-leave 2
Mod.Map
PilotInsert
Train
IFFT BR
IFFT Mem
IFFT BF
IFFTOutput
GI/Wind.
GI/Wind.
IFFT Mem
IFFT BF
FIRFIR
IFFT BF
IFFT Mem
OutputSync
IFFT
Data bits
ToD/A
converter
802.11a/802.11g Wireless Transmitter Implementation
• 22 processors• Fully functional on chip• 407 mW @ 300 MHz
30% of 54 Mb/s• Code unscheduled and
lightly optimized• ~10x performance and
35x – 75x lower energy dissipation than 8-way VLIW TI C62x
[SIPS 04; ICC 02] [1]
31
Summary
• Scalable programmable processing array– Many processors on a single chip– Reduced complexity processors with small memories– GALS clocking style– Nearest neighbor communication
• Results– 0.18 µm, 475 MHz @ 1.8 V
• 32 mW application power• 84 mW 100% active power• 2.4 mW application power @ 116 MHz and 0.9 V
– High performance density: 475 MOPS in 0.66 mm²– Well suited for future fabrication technologies
32
Trends
• Clock Gating
• Dynamic voltage scaling but adds level-conversion for asynchronous communication interface[3]
• CAD tool – CAD tools for asynchronous design is mature,
but not commercially strong yet.[3]
33
References1.Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin,
Mandeep Singh, Bevan Baas, “An Asynchronous Array of
Simple rocessors for DSP Applications”, VLSI Computation Lab, ECE Department,University of California, Davis, USA,2006 IEEE International Solid-State Circuits Conference
2.A.Hemani , T.Meincke , S.Kumar, A.Postula, T.olsson, P.Nilsson,
J.Oberg , P.Ellervee,D.Lundqvist, “Lowering power consumption in clock by using Globally Asynchronous Locally Synchronous design style”
3.Anoop Iyer ,Diana Marculescu, “Power and Performance Evaluation of Globally Asynchronous Locally Synchronous Processors”, Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA.02)