Post on 12-Jan-2016
description
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
High Performance Mobile Computing Using Flexible Wide SIMD Processors
Scott Mahlke
in collaboration withMark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali
Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.)
Advanced Computer Architecture LaboratoryUniversity of Michigan
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
The Old Mobile PhoneThe Modern Mobile Phone
2
Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
• Future phones are becoming more complex
• Richer applications require both more performance and more flexibility
• Modern phones look like Franken-chips
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Power/Performance Requirements for Multiple Systems
3
Different applications have different power/performance characteristics!
We need to design keeping each application in mind!
(Not GPP but Domain Specific Processor)
1
10
100
1000
10000
0.1 1 10 100
Better
Pow
er Efficiency
1 Mops/mW
10 Mops/mW100 Mops/mW
1000 Mops/mW
SODA(65nm)
SODA (90nm)
TI C6X
Imagine
VIRAM Pentium M
IBM Cell
Pe
rfo
rma
nce
(G
op
s)
Power (Watts)
3G Wireless
4G Wireless
Mobile HDVideo
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
4G Wireless Basics
• Three kernels make up the majority of the work► FFT – Extract Data from Signals► STBC – Combine Data into More Reliable Stream► LDPC – Error Correction on Data Stream
4
Ant A/D
FFTSymbol Timing / Guard Detect
Channel Estimator
MIMODecoder
(STBC)
Channel Decoder(LDPC)
Recovered Data
Ant D/A IFFTGuard Insertion
Channel Encoder(LDPC)
S/PTransmitted
Data
Modulator
Demodulator NTT DoCoMo 4G test setup
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
High Definition Video (H.264) Basics
5
Inverse Transform
Standard Basis Patterns
Residual Data
5
3 2
Basis Coefficients
“Past” FramesCurrentFrame
Future Frames
Motion Compensation
Form Prediction
ReconstructedMacroblock
Current Uncoded Block
Dec
od
ed B
lock
Decoded Block
Prediction Based on Neighbors
Intra-Prediction
Modules
Entropy decoder
Inverse Transform
MotionCompensation
Intra Prediction
Deblocking Filter
Algorithms
Context-adaptive VLD
4x4 integer kernel
Power Profiling*
Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filter
Directional spatial prediction
In-loop adaptive filter
4%
29%
10%
34%
Misc Kernels 23%
4CIF@30fps
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Mobile Signal Processing Algorithm Characteristics
6
• Algorithms have different SIMD widths► From very large to very small
• Though SIMD width varies all algorithms can exploit it► Large percentage of work can be SIMDized
• Larger SIMD width tend to have less TLP
Problems with traditional SIMD• High register file power• Large data movement/alignment cost• Inconsistent lane utilization• SIMD implies single thread
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
So, What’s the Right Solution?
• Alternatives► More processors, less lanes?► Configurable: Hardware can be SIMD or MIMD?► Franken chip?
• SIMD is the answer! It provides high performance and power efficiency► Low control cost► More area-efficient scaling► Single thread context► Simpler memory system design –cache coherence
more manageable
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
A Closer Look at SIMD: Power Breakdown
• Register file power disproportionately high in a traditional SIMD architecture
SODA WCDMAMEM3%
REG37%
ALU+MULT11%
CONTROL38%
INTERCONNECT2%
SCALAR PIPE9%
MEM
REG
ALU+MULT
CONTROL
INTERCONNECT
SCALAR PIPE
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Register File Accesses
9
• Many of the register file access do not have to go back to the main register file
Lots of power wasted on unneeded register file access!
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science 10
LDPC – Scaling Performance with SIMD Width
Increasing SIMD width reduces Memory andShuffle Operations thus reduces power
Extra hardware power consumptionoutweighs reduced operations
• SIMD loses effectiveness when lanes cannot be put to productive use• SIMD on distributed data (SIMdD)• Efficient data rearrangement critical to success of SIMD
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Data Alignment Issues
• H.264 Intra-prediction has 9 different prediction modes► Each prediction mode requires a specific permutation
11
B C D
A B C D B C D A B C D A B C D
A B C D
A A A A B B B B C C C C D D D D
A B C D
A B C D B C D C
E
E D E
F G
GF D E F
A
A B C D E F G H I J
A B C D G H I JG H I J A B C D
A B C D E F G
A
A B C D E F G H I J
A B C D E F G H I J
A B C D E F G H I J
K L
K L
A B B C C D DD D D DE F F G G
CDE EF FG GHI J JK KLI
CE F G H I J KI J K L D E FD
F F FEG G HH I D E G C D E F0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16
164
4
16x16 luma MBprediction modes for each 4x4 block
a b c d
e f g h
i j k l
m n o p
I
J
K
L
A B C DX
Traditional SIMD machines take too long or cost too much to do this
Good news – small fixed number patterns per kernel
Intra-Prediction
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
4G/H.264 Summary
• Lots of different sized parallelism► From 4 wide to 96 wide to 1024 wide SIMD
• Which means many different SIMD widths need to be supported
• TLP (disjoint SIMD) often available• Very short-lived values• Lots of potential for instruction fusings
(beyond pairwise)• Limited set of shuffle patterns required for
each kernel
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
AnySP: Push SIMD
But, Increase the Inherent Flexibility and Efficiency
13
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
CROSSBAR
Scalar Pipeline
Bank 0
Bank 1
Bank 2
Bank 15
L1ProgramMemory
Controller
Bank 3
Bank 4
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Scalar Memory Buffer
Multi-SIMD Datapath
AGU Group 0 Pipeline
AGU Group 7 Pipeline
AGU Group 1 Pipeline
Multi-Bank Local Memory
DMA
To Inter-PE
Bus
8 Groups of 8-wide
SIMD(64 Total Lanes)
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
AnySP PE
AnySP Architecture – High Level
14
8 Groups of 8-Wide Flexible Function
Units
128x128 16bit Swizzle Network
16 Banked Memory with SRAM-based Crossbar
Multiple Output Adder Tree
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Multi-Width SIMD Support
15
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-3
8-wide SIMDFFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
Normal 64-Wide SIMD mode – all lanes share one AGU
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-3
8-wide SIMDFFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-3
8-wide SIMDFFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Using SIMD Lanes for Deeper Subgraphs
16
Flexible Functional Unit allows us to
1.Exploit Pipeline-parallelism by joining two lanes together2.Handle register bypass and the temporary buffer3.Join multiple pipelines to process deeper subgraphs4.Fuse Instruction Pairs
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science 17
SRAM-based Crossbar
Controller
16 Bit Input
16 Bit Input
16 Bit Input
16 Bit Ouput
16 Bit Output
16 Bit Output
Cell(0,0)
Cell(31,0) Cell(31,31)
Cell(0,1) Cell(0,31)
Cell(1,0)
Multiple SRAM cells replace MUX of traditonal crossbar
Each cell stores configuration information
The controller selects the specific configuration based on the instruction parameter
Each cell can store up to 6 different configurations
Power reduced by 50% for 128x128 crossbar
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
AnySP vs SIMD-based Architecture
• SIMD width doubled• But that only provides half the performance gain, other half due to flexibility
features
18
0.0
0.5
1.0
1.5
2.0
2.5Baseline 64-Wide Multi-SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass
No
rma
lized
Sp
eed
up
FFT 1024ptRadix-2
FFT 1024ptRadix-4
STBC LDPC H.264Intra
Prediction
H.264Deblocking
Filter
H.264Inverse
Transform
H.264Motion
Compensation
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
AnySP Energy-Delay vs SIMD-based Architecture
• Comparison based on 90nm synthesis results
• Flexibility increases utilization of datapath and hence its efficiency
19
0.0
0.2
0.4
0.6
0.8
1.0
No
rma
lize
d E
ne
rgy
-De
lay SIMD- based Architecture AnySP
FFT 1024ptRadix-2
FFT 1024ptRadix-4
STBC LDPC H.264Intra
Prediction
H.264Deblocking
Filter
H.264Inverse
Transform
H.264Motion
Compenstation
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
PE
System
Total
Components Units
SIMD Data Mem (32KB)SIMD Register File (16x1024bit)SIMD ALUs, Multipliers, and SSN
SIMD Pipeline+Clock+Routing
Intra-processor InterconnectScalar/AGU Pipeline & Misc.
ARM (Cortex-M3)Global Scratchpad Memory (128KB)
Inter-processor Bus with DMA
90nm (1V @300MHz)
4444
44
111
AreaAreamm2
Area%
9.76 38.78%3.17 12.59%4.50 17.88%1.18 4.69%
0.94 3.73%1.22 4.85%
0.6 2.38%1.8 7.15%1.0 3.97%
25.17 100%
Est.65nm (0.9V @ 300MHz) 13.1445nm (0.8V @ 300MHz) 6.86
4G + H.264 DecoderPower
mWPower
%
102.88 7.24%299.00 21.05%448.51 31.58%233.60 16.45%
93.44 6.58%134.32 9.46%
2.5 <1%10 <1%
1.5 <1%
1347.03 100%
1091.09862.09
SIMD Buffer (128B)SIMD Adder Tree
44
0.82 3.25%0.18 <1%
84.09 5.92%10.43 <1%
AnySP Power Breakdown
• We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm
20
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Conclusions
• Scaling traditional SIMD for mobile applications► Wide-SIMD hardware under-utilized► Large fraction of power on non-computation
• AnySP design► Can possibly meet the requirements of 100Mbps 4G
and HD video on the same platform @45nm
• Flexibility/Efficiency improvements► Increase SIMD utilization (FFUs, multiple short vectors)► Reduce register file power (bypass buffer)► More efficient data shuffling (SRAM-based crossbar)
21
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Questions
• For more information► http://cccp.eecs.umich.edu