High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

Scott Mahlke

in collaboration withMark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali

Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.)

Advanced Computer Architecture LaboratoryUniversity of Michigan

The Old Mobile PhoneThe Modern Mobile Phone

Video Recording

Video Editing

Higher Data Rates

3D Rendering

Advanced Image Processing

• Future phones are becoming more complex

• Richer applications require both more performance and more flexibility

• Modern phones look like Franken-chips

Power/Performance Requirements for Multiple Systems

Different applications have different power/performance characteristics!

We need to design keeping each application in mind!

(Not GPP but Domain Specific Processor)

0.1 1 10 100

Better

er Efficiency

1 Mops/mW

10 Mops/mW100 Mops/mW

1000 Mops/mW

SODA(65nm)

SODA (90nm)

TI C6X

Imagine

VIRAM Pentium M

IBM Cell

Power (Watts)

3G Wireless

4G Wireless

Mobile HDVideo

4G Wireless Basics

• Three kernels make up the majority of the work► FFT – Extract Data from Signals► STBC – Combine Data into More Reliable Stream► LDPC – Error Correction on Data Stream

Ant A/D

FFTSymbol Timing / Guard Detect

Channel Estimator

MIMODecoder

(STBC)

Channel Decoder(LDPC)

Recovered Data

Ant D/A IFFTGuard Insertion

Channel Encoder(LDPC)

S/PTransmitted

Modulator

Demodulator NTT DoCoMo 4G test setup

High Definition Video (H.264) Basics

Inverse Transform

Standard Basis Patterns

Residual Data

Basis Coefficients

“Past” FramesCurrentFrame

Future Frames

Motion Compensation

Form Prediction

ReconstructedMacroblock

Current Uncoded Block

Decoded Block

Prediction Based on Neighbors

Intra-Prediction

Modules

Entropy decoder

Inverse Transform

MotionCompensation

Intra Prediction

Deblocking Filter

Algorithms

Context-adaptive VLD

4x4 integer kernel

Power Profiling*

Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filter

Directional spatial prediction

In-loop adaptive filter

Misc Kernels 23%

4CIF@30fps

Mobile Signal Processing Algorithm Characteristics

• Algorithms have different SIMD widths► From very large to very small

• Though SIMD width varies all algorithms can exploit it► Large percentage of work can be SIMDized

• Larger SIMD width tend to have less TLP

Problems with traditional SIMD• High register file power• Large data movement/alignment cost• Inconsistent lane utilization• SIMD implies single thread

So, What’s the Right Solution?

• Alternatives► More processors, less lanes?► Configurable: Hardware can be SIMD or MIMD?► Franken chip?

• SIMD is the answer! It provides high performance and power efficiency► Low control cost► More area-efficient scaling► Single thread context► Simpler memory system design –cache coherence

more manageable

A Closer Look at SIMD: Power Breakdown

• Register file power disproportionately high in a traditional SIMD architecture

SODA WCDMAMEM3%

REG37%

ALU+MULT11%

CONTROL38%

INTERCONNECT2%

SCALAR PIPE9%

ALU+MULT

CONTROL

INTERCONNECT

SCALAR PIPE

Register File Accesses

• Many of the register file access do not have to go back to the main register file

Lots of power wasted on unneeded register file access!

University of MichiganElectrical Engineering and Computer Science 10

LDPC – Scaling Performance with SIMD Width

Increasing SIMD width reduces Memory andShuffle Operations thus reduces power

Extra hardware power consumptionoutweighs reduced operations

• SIMD loses effectiveness when lanes cannot be put to productive use• SIMD on distributed data (SIMdD)• Efficient data rearrangement critical to success of SIMD

Data Alignment Issues

• H.264 Intra-prediction has 9 different prediction modes► Each prediction mode requires a specific permutation

A B C D B C D A B C D A B C D

A B C D

A A A A B B B B C C C C D D D D

A B C D

A B C D B C D C

GF D E F

A B C D E F G H I J

A B C D G H I JG H I J A B C D

A B C D E F G

A B C D E F G H I J

A B B C C D DD D D DE F F G G

CDE EF FG GHI J JK KLI

CE F G H I J KI J K L D E FD

F F FEG G HH I D E G C D E F0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16x16 luma MBprediction modes for each 4x4 block

a b c d

e f g h

i j k l

m n o p

A B C DX

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number patterns per kernel

Intra-Prediction

4G/H.264 Summary

• Lots of different sized parallelism► From 4 wide to 96 wide to 1024 wide SIMD

• Which means many different SIMD widths need to be supported

• TLP (disjoint SIMD) often available• Very short-lived values• Lots of potential for instruction fusings

(beyond pairwise)• Limited set of shuffle patterns required for

each kernel

AnySP: Push SIMD

But, Increase the Inherent Flexibility and Efficiency

CROSSBAR

Scalar Pipeline

Bank 0

Bank 1

Bank 2

Bank 15

L1ProgramMemory

Controller

Bank 3

Bank 4

16-bit 8-wide 16 entry

SIMD RF-0

SIMD RF-1

SIMD RF-2

SIMD RF-7

Scalar Memory Buffer

Multi-SIMD Datapath

AGU Group 0 Pipeline

Multi-Bank Local Memory

To Inter-PE

8 Groups of 8-wide

SIMD(64 Total Lanes)

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

Buffer-1

Buffer-2

Buffer-7

Multi-Output Adder Tree

AnySP PE

AnySP Architecture – High Level

8 Groups of 8-Wide Flexible Function

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline

Multi-Width SIMD Support

SIMD RF-0

SIMD RF-1

SIMD RF-2

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

Buffer-0

Buffer-1

Buffer-2

Buffer-7

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

SIMD RF-3

8-wide SIMDFFU-3

Buffer-3

AGU Group 2

AGU Group 3

Normal 64-Wide SIMD mode – all lanes share one AGU

SIMD RF-0

SIMD RF-1

SIMD RF-2

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

Buffer-0

Buffer-1

Buffer-2

Buffer-7

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

SIMD RF-3

8-wide SIMDFFU-3

Buffer-3

AGU Group 2

AGU Group 3

SIMD RF-0

SIMD RF-1

SIMD RF-2

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

Buffer-0

Buffer-1

Buffer-2

Buffer-7

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

SIMD RF-3

8-wide SIMDFFU-3

Buffer-3

AGU Group 2

AGU Group 3

Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

Using SIMD Lanes for Deeper Subgraphs

Flexible Functional Unit allows us to

1.Exploit Pipeline-parallelism by joining two lanes together2.Handle register bypass and the temporary buffer3.Join multiple pipelines to process deeper subgraphs4.Fuse Instruction Pairs

University of MichiganElectrical Engineering and Computer Science 17

SRAM-based Crossbar

Controller

16 Bit Input

16 Bit Ouput

16 Bit Output

Cell(0,0)

Cell(31,0) Cell(31,31)

Cell(0,1) Cell(0,31)

Cell(1,0)

Multiple SRAM cells replace MUX of traditonal crossbar

Each cell stores configuration information

The controller selects the specific configuration based on the instruction parameter

Each cell can store up to 6 different configurations

Power reduced by 50% for 128x128 crossbar

AnySP vs SIMD-based Architecture

• SIMD width doubled• But that only provides half the performance gain, other half due to flexibility

features

2.5Baseline 64-Wide Multi-SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compensation

AnySP Energy-Delay vs SIMD-based Architecture

• Comparison based on 90nm synthesis results

• Flexibility increases utilization of datapath and hence its efficiency

lay SIMD- based Architecture AnySP

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compenstation

System

Components Units

SIMD Data Mem (32KB)SIMD Register File (16x1024bit)SIMD ALUs, Multipliers, and SSN

SIMD Pipeline+Clock+Routing

Intra-processor InterconnectScalar/AGU Pipeline & Misc.

ARM (Cortex-M3)Global Scratchpad Memory (128KB)

Inter-processor Bus with DMA

90nm (1V @300MHz)

AreaAreamm2

9.76 38.78%3.17 12.59%4.50 17.88%1.18 4.69%

0.94 3.73%1.22 4.85%

0.6 2.38%1.8 7.15%1.0 3.97%

25.17 100%

Est.65nm (0.9V @ 300MHz) 13.1445nm (0.8V @ 300MHz) 6.86

4G + H.264 DecoderPower

mWPower

102.88 7.24%299.00 21.05%448.51 31.58%233.60 16.45%

93.44 6.58%134.32 9.46%

2.5 <1%10 <1%

1.5 <1%

1347.03 100%

1091.09862.09

SIMD Buffer (128B)SIMD Adder Tree

0.82 3.25%0.18 <1%

84.09 5.92%10.43 <1%

AnySP Power Breakdown

• We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

Conclusions

• Scaling traditional SIMD for mobile applications► Wide-SIMD hardware under-utilized► Large fraction of power on non-computation

• AnySP design► Can possibly meet the requirements of 100Mbps 4G

and HD video on the same platform @45nm

• Flexibility/Efficiency improvements► Increase SIMD utilization (FFUs, multiple short vectors)► Reduce register file power (bypass buffer)► More efficient data shuffling (SRAM-based crossbar)

Questions

• For more information► http://cccp.eecs.umich.edu

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Documents

Transcript of High Performance Mobile Computing Using Flexible Wide SIMD Processors

Chapter 5 Array Processors. Introduction Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

Intel® Architecture Optimization...Revision Revision History Date 001 Documents Streaming SIMD Extensions optimization techniques for Pentium® II and Pentium III processors. 02/99

Exploiting Parallelism in Geometry Processing with General Purpose Processors …alvy/papers/isa_journal.pdf · 2000-07-19 · SIMD instructions gives the best performance among the

nd Breaking SIMD Shackles with an Exposed Flexible … · 2013-09-22 · SIMD [32] and hardware support for scatter/gather. In general, 1. Control Flow Strided Access Loop Carried

CS152: Computer Systems Architecture SIMD Operationsswjun/courses/2019W-CS152... · (Parallel Processors) Today. Modern Processor Topics Transparent Performance Improvements o Pipelining,

SIMD Computer organizations

Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah 1000857252 kushal.shah7@mavs.uta.edu Advisor: Dr. K. R. Rao Spring 2013 Multimedia.

Streaming SIMD Extensions

Lect. 10: Vector and SIMD Processors · CS4/MSc Parallel Architectures - 2017-2018 Lect. 10: Vector and SIMD Processors Many real-world problems, especially in science and engineering,

POWER Block Course Assignment2: SIMD withAltiVec · 2018-12-07 · 1. SIMD & AltiVec AltiVecis SIMD on POWER Sven Köhler, 07.12.2018 SIMD & AltiVec Chart 5 Single Instruction Multiple

Interleaved Multi-Vectorizing · Modern (co-)processors provide SIMD instructions to operate multiple data elements using a single instruction. SIMD offers higher data level parallelism

Explicitly Parallel Plaorms - Leiden Universityliacs.leidenuniv.nl/~rietveldkfd/courses/parpro2016/Lecture_3.pdf · SIMD Processors • The same instruc=on on diﬀerent processors

Memory Hierarchies and Thread Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse160s05/...CSE 160 Chien, Spring 2005 Lecture #6, Slide 25 SIMD • SIMD » All processors execute

Module 3 SIMD array Processors Synchronous array of parallel processors is an array processor

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

Maximizing SIMD Resource Utilization in GPGPUs with SIMD ...

Resistive GP-SIMD Processing-In-Memorywebee.technion.ac.il/~ran/papers/ResistiveGP-SIMD... · and parallel processors) by in-memory computing, through combining data storage and massively

Www.scotland.gov.uk/simd Presentation Outline SIMD Background SIMD 2009 Methodology SIMD 2009 Results Where to find more information Questions.

Box2D with SIMD in JavaScript - GitHub Pagespeterjensen.github.io/html5-box2d/html5-box2d.pdf · Native SIMD -> JavaScript SIMD JavaScript Bindings Box2D SIMD opportunities Performance

Intel SIMD architecture