www.eecs.umich.edu/~sdrg
1
SODA: A Low-power Architecture For Software Radio
Yuan Lin1, Hyunseok Lee1, Mark Woh1, Yoav Harel1, Scott Mahlke1, Trevor Mudge1, Chaitali Chakrabarti2, Krisztian Flautner3
1Advanced Computer Architecture Lab, University of Michigan2Department of Electrical Engineering, Arizona State University
3ARM, Ltd.
www.eecs.umich.edu/~sdrg
2
Anatomy of 3G Cellular Phone
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASIC
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASICs
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASICs
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASICs
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASICs
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
Transport
Network
Link
MAC
GPP
PHY
DSP+ASICs
BluetoothDSP+ASICs
GPSDSP+ASICs
BasebandProcessor
GPP+DSP+ASICs
AnalogFrontend
ASICs
ApplicationProcessorGPP+DSP
PowerManager
Camera
Keyboard
Display
Speaker
Microphone
W-CDMA
Transport
Network
Link
MAC
GPP
PHY
DSPs
Software Defined Radio (SDR):Use of software routines instead of
ASICs for physical layer operations ofwireless protocol systems
SDR BasebandProcessorGPP+DSPs
ApplicationProcessorGPP+DSP
Camera
Keyboard
Display
Speaker
Microphone
GPSAnalog ASICs
BluetoothAnalog ASICs
W-CDMAAnalog ASICs
802.11Analog ASICs
www.eecs.umich.edu/~sdrg
3
Advantages of Software Defined Radio
• Multi-mode operations• Lower costs
– Faster time to market– Prototyping and bug fixes– Chip volumes– Longevity of platforms
• Protocol complexity favors software dominated solutions
• Enables future wireless communication innovations– Cognitive radio
UWB EDGE 802.16a
802.16a Bluetooth
802.11b WCDMA 802.11n
SDR
www.eecs.umich.edu/~sdrg
4
Why is SDR Challenging?
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Pe
rfo
rma
nc
e (
Go
ps
)
Better
Pow
er Efficiency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
GeneralPurpose
ProcessorsEmbeddedDSPs
Mobile SDRRequirements
Pentium MTI C6x
• SDR Design Objectives for 3G and WiFi– Throughput requirements
• 40Gops peak throughput
– Power budget• 100mW~500mW peak power
IBM Cell
High-endDSPs
www.eecs.umich.edu/~sdrg
5
The Anatomy of Wireless Protocols
1. Filtering: suppress signals outside frequency band
2. Modulation: map source information onto signal waveforms
3. Channel Estimation: Estimate channel condition for transceivers
4. Error Correction: correct errors induced by noisy channel
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
D/A
A/D
FrontendW-CDMA Physical Layer Processing
LPF-Tx
LPF-Rx
scrambler spreader
descrambler despreader
descrambler despreader
combiner
searcher
InterleaverChannelencoder
deinteleaverChanneldecoder
(turbo/viterbi)
www.eecs.umich.edu/~sdrg
6
SDR – Application Specific Design
• Wireless protocols are systems of DSP algorithms– System-level
• Example: Specification of W-CDMA DCH channel
– Algorithm-level• Example: Implementation of a 64 point FFT
www.eecs.umich.edu/~sdrg
7
System Level Design DecisionsSystem Characteristics SODA Architectural Decisions
1. Algorithm macro-pipelining with streaming computation
1. Multi-core system2. Communication through DMA
2. Multiple periodic real-time deadlines
3. Deterministic hardware behavior4. Compile-time algorithm mapping and scheduling
3. Low streaming throughput between algorithms
5. Low throughput interconnect
4. Heterogeneous inter- algorithm communication
6. Multi-level scratchpad memories
www.eecs.umich.edu/~sdrg
8
SODA System Architecture
• 4 PEs– static kernel mapping
and scheduling– SIMD+Scalar units
• 1 ARM GPP controller– scalar algorithms and
protocol controls
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
www.eecs.umich.edu/~sdrg
9
SODA Memory Organization
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
• 2-Level scratchpad memories– 12KB Local scratchpad
memory for stream queues– 64KB global scratchpad
memory for large buffers
• Low-throughput shared bus– 200MHz 32-bit bus– inter-PE communication
using DMA
www.eecs.umich.edu/~sdrg
10
DSP Algorithm Characteristics
• 8 to 16-bit precision
• Vector operations– long vectors– constant vector size
• Static data movement patterns
• Scalar operations
Algorithms Type of Computation
Vector Width
W-CDMA
Filter Vector 64
Modulation Vector 2560
Channel Est. Vector 320
Error Correction Mixed 8 or 256
802.11a
Filter Vector 33
Modulation (FFT) Vector 64
Channel Est. Mixed 16
Error Correction Mixed 64
www.eecs.umich.edu/~sdrg
11
SODA PE ArchitecturePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
2 issue LIW (400MHz) - SIMD + (Scalar or AGU) DMA: - mem-to-mem transfer - access global memory
www.eecs.umich.edu/~sdrg
12
SODA PE SIMD PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
16-bit 16 entries2 read/ 1 write port
RF
EX
16-bitMultiplier
40-bit ACC
16-bit
ALU
16bit
16bitWB
16bit
www.eecs.umich.edu/~sdrg
13
SODA PE SIMD PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD: - 32 wide - predicated exec. - predicated neg.
Memory: - 512bit port - 1 read port - 1 write port - 8 KBytes
www.eecs.umich.edu/~sdrg
14
SODA PE SIMD Shuffle NetworkPE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD Shuffle NetworkShuffle Exchange (SE)Inverse Shuffle Exchange (SE)Exchange Only (EX)Iterative Feedback
www.eecs.umich.edu/~sdrg
15
SODA PE Scalar PipelinePE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar
www.eecs.umich.edu/~sdrg
16
W-CDMA Mapping On SODA
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
D/A
A/D
FrontendW-CDMA Physical Layer Processing
2 LPF-RxMisc.
ControlSearcher
De-interleaver
PowerControl
PN CodeTX/RX
TurboDecoder
Buffer(1K Bytes)
Buffer(1K Bytes) Buffer
(2K Bytes)FIFO Queue
(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1K Bytes)
ARM PE PE PE PE GlobalMemory
Buffer(1K Bytes)
WCDMA Receiver WCDMATransmitter
4 LPF-Rx
Scrambler
Spreader
TurboEncoder
Interleaver
De-scrambler
Despreader
Combiner
4 LPF-Rx
Scrambler
Spreader
TurboEncoder
Interleaver
descrambler despreader combiner
descrambler despreader
...
TurboDecoderSearcher
2 LPF-Rx
De-scrambler
Despreader
Combiner
Misc.Control
De-interleaver
PowerControl
PN CodeTX/RX
Channeldecoder
(turbo/viterbi)deinteleaver
searcher
LPF-Rx
Channelencoder
InterleaverspreaderscramblerLPF-Tx
Buffer(1K Bytes)
Buffer(1K Bytes) Buffer
(2K Bytes)FIFO Queue
(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1K Bytes)
Buffer(1K Bytes)
www.eecs.umich.edu/~sdrg
17
SDR Performance Distribution
0
100
200
300
400
500
600
700
filter modulation channelestimation
errorcorrection
Co
mp
uta
tio
ns
(Mcy
cles
)
W-CDMA (2Mbps) 802.11a (24Mbps)
• 802.11a has higher number of total computational cycles• W-CDMA requires higher computational cycles per bit
www.eecs.umich.edu/~sdrg
18
Power Consumption at 180nm
0
200
400
600
800
1000
1200
1400
PE DataMemory
PE SIMDRF
PE SIMDALUs
PE SIMDPipeline
PE Others GlobalMemory
SystemOthers
Po
wer
(m
W)
in 1
80n
m
W-CDMA (2Mbps) 802.11a (24Mbps)
• Wide SIMD requires higher number of pipeline registers• 802.11a consumes higher power than W-CDMA• 8-bit W-CDMA computation versus 16-bit 802.11a computation
www.eecs.umich.edu/~sdrg
19
Summary
• Key features of SODA– Multi-PE with scratchpad memories– Low throughput shared bus– 2-issue LIW: SIMD+(Scalar or AGU)– 32-wide SIMD processing– SIMD shuffle network
SDR Hardware Requirements SODA Results
Comp. requirements: 10 ~ 100 GOPS
W-CDMA & 802.11a: 1.3 ~ 2 GOPS (with SODA LIW ops)
Sub-watt power budget: ~ 0.2 Watt for cellular phones
180nm: ~ 3 Watts (area: 26.6mm2) 90nm (est.): ~ 0.5 Watt (6.7 mm2)
www.eecs.umich.edu/~sdrg
20
Conclusion & Future Work
• Conclusion– 2G and 3G SDR solutions are achievable in 90nm– Optimization opportunities at the algorithm, software
and hardware levels
• Future Work– SDR for Idle mode operation (ISLPED ’06)– SODA for 4G protocols– Application-specific language for SDR– Compiler for SODA
www.eecs.umich.edu/~sdrg
21
Questions?
• www.eecs.umich.edu/~sdrg
www.eecs.umich.edu/~sdrg
22
Backup Slides
www.eecs.umich.edu/~sdrg
23
Different Levels of Software Radio
Tier Name Description
Tier 0 Hardware Radio (HR)Implemented using hardware components. Cannot be modified
Tier 1Software Controlled
Radio (SCR)Only control functions are implemented in software: inter-connects, power levels, etc.
Tier 2Software Defined
Radio (SDR)
Software control of a variety of modulation techniques, wide-band or narrow-band operation, security functions, etc.
Tier 3Ideal Software Radio
(ISR)Programmability extends to the entire system with analog conversion only at the antenna.
Tier 4Ultimate Software
Radio (USR)Defined for comparison purposes only
<source:http://www.sdrforum.org>
www.eecs.umich.edu/~sdrg
24
Power Methodology
• Our flow sequence was– Design Compiler and Silicon Ensemble
• For Initial Floorplan Estimation– Physical Compiler
• For placement and Optimization– Silicon Ensemble
• Routing
• We optimized for power and delay• Blocks like memory were generated with Artisan
Memory Generators• We used the Synopsys IP Blocks as much as
possible to get better compiled blocks
www.eecs.umich.edu/~sdrg
25
ARM
PE1
PE2
PE3
PE4
SearcherReal-Time Critical
Path (5 msec)
Power ControlReal-Time CriticalPath (0.67 msec)
1 W-CDMA frame (15 slots), 10 msec
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
FIR
MOD
ENC
I Modulation FIR
PN Code<0.1 msec
Power Control<0.1 msec
Deinterleaver0.2 msec
Interleaver0.2 msec
Turbo Encoder0.2 msec
FIR(Tx)8 msec
Modulation1 msec
FIR(Rx)0.3 msec
Demodulation0.1 msec
Turbo Decoder10 msec
Searcher5 msec
Time
PN
PN
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
PN
PC
IPC
SYNCP1, ARM
PN
FIR M
PN
FIR M
PN
FIR M
PN
FIR M
Inputfrom A/D
PE1FIR/Mod.
PE2Searcher
PE3Turbo
PE4TX
1 slot0.67 mSec
ARM:PN/Power Ctrl.
PC
SYNCP1, ARM
SYNCP1, P2, ARM
SYNCP1, ARM
SYNCP1, ARM
SYNCP1, ARM
Searcher
PN