The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier

The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier

Hot Chips 23 - August 2011

Chris RowenDan Nicolaescu, Rajiv Ravindran, David Heine, Grant

Martin, James Kim, Dror Maydan, Nupur Andrews, Bill Huffman, Vakis Papaparaskeva, Shay Gal-On, Peter

Nuth, Pushkar Patwardhan, and Manish Paradkar

Tensilica Inc.

Copyright © 2011 Tensilica, Inc. 2

Outline• Introduction: 4G Wireless Processing for LTE and LTE-Advanced• BBE64 Fundamental Goals• Instruction Set Design:

• SIMD//VLIW • Programmer State and Register Organization• Operations• Example: Select and FIR operations• Execution Rate Summary

• BBE64: Where Does the Hardware Go?• BBE64 Software

• SIMD-Width Independent Programming Model• LTE-Advanced Execution Profile• Kernel Examples

• Implementation and Verification Timelines• Tools for Processor Automation

2


4G Rollout

Evolution of Major Cellular Standards

Fixed WiMAX

Mobile WiMAX Wave 1

DL: 23 MbpsUL: 4 Mbps

10 MHz 3:1 TDD

2006 2007 2008 2009 2010 2011 2012Mobile WiMAX

Wave 2DL: 46 MbpsUL: 4 Mbps

10 MHz 3:1 TDD

EVDO Rev 3DL: 100 MbpsUL: 50 Mbps

In 20 MHz

WiM

AX

E

volu

tion

CD

MA

2000

E

volu

tion EVDO Rev 1

DL: 3.1 MbpsUL: 1.8 MbpsIn 1.25 MHz

EVDO Rev 0DL: 23 Mbps

UL: 153 KbpsIn 1.25 MHz

EVDO Rev 2DL: 14.7 MbpsUL: 4.9 Mbps

In 5 MHz

EDGEDL: 474 KbpsUL: 474 Kbps

Evolved EDGE

DL: 1.89 MbpsUL: 947 Kbps

3GP

P G

SM

E

dge

Evo

lutio

n

HSDPADL: 14.4 MbpsUL: 384 Kbps

In 5 MHz

HSUPA/ HSDPA

DL: 14.4 MbpsUL: 5.76 Mbps

In 5 MHz

LTE (Rel 8)DL: 150 MbpsUL: 50 Mbps

In 20 MHz

3GP

P

UM

TS

E

volu

tion

3GP

P L

ong

Term

E

volu

tion

Source: Pysavy Research “Mobile Broadband: EDGE, HSPA & LTE” 2006

LTE – Advanced DL: 1 Gbps

UL: 500 Mbps

5-10x LTE

HSPA Evolution

DL: 42 MbpsUL: 11.5 Mbps

In 5 MHz

Enhanced EDGE

DL: 1.3 MbpsUL: 653 Kbps


User-specified DSPs

ConnX VectraLX DSP

Product Context: Specialized and General Baseband DSP Cores

Xtensa µDSPs

ConnX BBE64-128100 GMACs/sec(infrastructure)

ConnX BBE16

Per

form

ance

(G

MA

Cs)

ConnX D2Dual-MAC DSP

ConnX BBE64(handset)

Core Size (mm2)without memory

1

10

100

28nm standard cell process

0.01 0.1 1.0

ConnX Turbo16MS

ConnX BSP3

ConnX SSP16


BBE64 Family Design Goals and PhilosophyWorld-leading DSP performance for baseband PHY in handsets (aka “User Equipment” or “UE”) and infrastructure (aka base-stations or “eNodeB”

Up to 1GHz in available 28nm fast standard cell process x 128 MAC

Combine SIMD, VLIW and configurable instruction set features for large applications “sweet-spot”.

Leverage high memory system bandwidth of Xtensa LX4 – 1024b per cycle

Good control code performance with multi-issue base ops, including multiple load/store per cycle

Offer both a broad range of built-in options and user-defined extensions to fit: Two initial base configurations BBE64-UE, BBE64-128

Leverage advanced vectorizing compilers, C scalar/vector data-types, operator-overloading and optimized intrinsics to eliminate need for assembly coding

ConnX BBE16 upward compatibility

Fully synthesizable RTL, with complete system modeling, verification and back-end flows environment

Core as building block for leading edge UE and Infrastructure SOC platforms

TXFIRPump


ConnX BBE64 Block Diagram

Optimized Architecture for DSP Application s• 4-way VLIW x 32-way SIMD 128 DSP ops/cycle• 16/24/96b 4-issue VLIW – almost any instruction in any slot• 128 MAC ops/cycle for matrix and filter functions (BBE64-

128)• Guard-bits on all DSP data for numerical accuracy• Protected pipeline: interlocks/bypasses for robustness• Support for all data types from C

• Complex/real• Scalar/vector• Fractional/integer

High Bandwidth Configurable Memory Subsystem Interface

• Dual load/stores with dual 512b memory interfaces • Full bandwidth on packed and unaligned data vectors• DMA support for local data memory• Extensible with special memory ports and direct-connect

data queues: 4 x 640b per cycle

96 bits Wide

4 Way VLIW Instruction Decoder

LoadStore

Data Memory Interface

Local Data RAM Banks

512 bits Wide

512 bits Wide

Instruction Memory Interface

Local Memory or Cache (1-4 ways)

x 32 bits

Data Load/Store Unit 0(16/32/64/128/512 bit to 640 bit)

Data Load/Store Unit 1(16/32/64/128/512 bit to 640 bit)

Vector Register File16 x 640 bits

32 x 20 real 16 x 20 complex64 x 10 real 32 x 10 complex

Computation Unit

32b ALU64 Way MAC

32 Way SIMD ALU

8 x 640 bits 16 x 40 real 8 x 40 complex

8 x 64 bits 64 x 1 Boolean

32 Way SIMD ALU32 Way

SIMD ALU64 Way MAC

LoadStoreALU

MACALU

MACALU

Align/Pack

Align/Pack

640

bits

32 bits

32 b

its

640 bits

32b ALU32b ALU

General Register File

6


BBE64 Pipeline

7

I Data/ tag

I AdrGen

ExecAdrGen

L1 data/tag

L1 Align

WB

I Align

DSP Ex1

DSP Ex2

DecodeReg Read

DSP WB

• Two pipeline options:• 9 stage pipeline – higher MHz or larger memories• 7 stage pipeline – lower power and area• Wide static in-order issue• No branch prediction, but zero-overhead loops and SIMD

predication• Simple length encoding enables single-stage instruction decode

and register specifier extraction• DSP operations start with load return: zero load-use bubbles• Simplified ALU/MAC operations allow DSP pipe reduction to two

stages + write back for reduced regfile cost – fewer values in flight, better utilization of slots

DSP Reg Read


Typical Data Parallelism in SIMD/VLIW Ops

8

VLIW Instruction typically encodes 1-2 load/store, 2-3 ALU and MAC operations

ALU or MAC operation typically works on 16 40b or 32 20b operands

Each 80b may represent 4 20b real, 2 20b+20b complex, 2 40b real, or 1 40b+40b complex

Two 18b x 18b multipliers per 20b element – paired real multiply or half a complex multiply


Baseline DSP instruction set summary550 distinct DSP ops above base Xtensa ISA

Load:• LV<m>[SU][TF].{I,IP,IC,X,XP} : load vector

of 16b elements• LP<m>[SU][TF].{I,IP,IC,X,XP} : load

complex pair of 16b elements• LS<m>[SU][TF].{I,IP,IC,X,XP} : load scalar

16b element• LA<m>[SU][TF].{I,IP,IC,X,XP} : load

unaligned vector of 16 elements• Plus specialty loads for guard-bits, Boolean

vectors, compressed data

<m>= 16X32, 32X16

{I,IP,IC,X,XP} addressing modes:.I: base register + immed offset.IP: base register + immed offset, post-update.X: base register + register index.XP: base register + register index, post update.IC: base register+ increment, post update in circular bufferAll instructions prefixed with “BBE_ ”

Store:• SV<m>[SU][TF].{I,IP,IC,X,XP} : store vector of

16b elements• SP<m>[SU][TF].{I,IP,IC,X,XP} : store complex

pair of 16b elements• SS<m>[SU][TF].{I,IP,IC,X,XP} : store scalar 16b

element• SA<m>[SU][TF].{I,IP,IC,X,XP} : store unaligned

vector of 16 elements• SV<m>PACK[QSR].{I,IP,IC,X,XP} : pack 40b

elements to 16b and store vector• SP<m>PACK[QSR].{ I,IP,IC,X,XP}: pack 40b

elements to 16b and store pair• SS<m>PACK[QSR].{ I,IP,IC,X,XP}: pack 40b

elements to 16b and store scalar• Plus specialty stores for guard-bits, Boolean

vectors, expanded data, bit-reversed addressing


Baseline DSP instruction set highlightsALU:• ABS<s> - absolute value• [R]ADD<s>[TF] - add• AND<s> - bitwise and• CONJ<s> - conjugate complex• EQ<s>[C] – set Boolean on equal• LT<s> - set Boolean on less than• LE<s> - set Boolean on less than or equal• [R]MAX[U]<s>[TF] - maximum• [R]MIN[U]<s>[TF] -minimum• MOV<s>[TF][C] – move conditional• NAND<s>[TF] – bitwise not and• NEG<s>[TF] - negate• NEQ<s>[C] – set Boolean on not equal• [R]NSA[U]<s>[C] – normalize shift amt• OR<s>- bitwise or• SAT<s>[SU] – saturate to memory size• SL[ALS]<s>{B,BR} – shift left• SR[ALS]<s>{B,BR,I} – shift right• SUB<s>[TF] - subtract• XOR<s>- bitwise xor

Multiply:• MAG32X18C{PACKQ,PACKS} complex

magnitude• MUL32X18{,C,J,JC}{,PACKQ,PACKS} multiply• MULA32X18{,C,J}{,PACKQ,PACKS } multiply-

add• MULSIGN<s>• POLY_STEP32X20.STEP – Polynomial series

expansion with parallel table lookup

Data Organization:• PACK[LQRS]32X40 – pack 40b to 20b • UNPACK32X20 – unpack from 20b to 40b• SEL<s>[I] - select elements from vector pair• SHFL<s>[I] – select elements from vector• Plus similar operations for Boolean vectors

<s>= 16X40, 32X20All operations prefixed with “BBE_”


Programmers Model: Registers

Large vector register file supports deep software pipelining and reduced memory trafficPartitioned register file for added bandwidth with less register bloat (narrow/wide operands)Unalignment registers enable full bandwidth streaming of arbitrary alignment and sizeVector Booleans for flexible SIMD and VLIW predicationAggregate regfile bandwidth at 1GHz: >12Tbps

Base Xtensa ISA (user) BBE64 ISA Extension

Gen

eral

Reg

iste

rs (

AR

)16

x32

(or

win

dow

ed to

64x

32)

LOO

P3x

32 SA

Vec

tors

(VR

)16

x640

Wid

eV

ecto

rs (

WR

)8x

640

Una

lign

(UR

)4x

512

Boo

lean

Vec

tor

(VB

R)

8x32


Register File Organization

VEC RFWVEC RF

clk

clk

clk

Register file partitioning reduces abstracted bypass structure


Data Reorganization Key to SIMD: Selection

64:1

Example: operation BBE_SEL32X20 {out vec c, in vec h, in vec l, in vec s}

Options: • Select Immediate (45 patterns)• Shuffle (Single Input Vector)• Shuffle Immediate (75 patterns)

… …

32 m

ux s

elec

t fie

lds

from

Vec

tor

s

Vector h Vector l


Accelerated Operation Examples: Real FIR

* * * * * * * * * * * * * * * ** * * * * * * * * * * * * * *** * * * * * * * * * * * * * * ** * * * * * * * * * * * * * **

Coe

ffici

ents

repl

icat

ed in

com

pani

on

oper

atio

n to

red

uce

fan-

out

Write back shifted data

2 x 16 40b accumulator vectors

+ + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ 2-4

bank

sof

32

mul

tiplie

rs

Data selected from vector pair


Where Does the Hardware Go?

VEC regfile, 21%

Vector Multiply Ops, 20%

Vector Load/Store Ops,

11%

Vector Select/Shuffle Ops,

10%

WVEC regfile, 9%

Base Processor

(2x512b,96b), 6%

Vector Shift Ops, 5%

Vector Pack Ops, 3%

ALIGN regfile, 3%

AR Regfile, 2%

Boolean Ops, 2%

Vector Arithmetic Ops, 2%

Vector Move/Convert Ops,

3%

Vector Logical Ops, 1%Extended Base Ops, 1%

VBOOL regfile, 1%

• Register files are pretty big

• As expected, multipliers dominate execution unit size

• Select and shift are significant

• Base processor and extensions for 3-way base ISA execution are very small

o Invites more aggressive control code resource allocation

Preliminary – subject to change


Relative Performance on Basic Metrics

0

1

2

3

4

5

6

7

8

9

BB

E64

Per

form

ance

(B

BE

16 =

1)

Performance Metric

BBE16

BBE64-128BBE 16BBE 64-128

16Preliminary – subject to change


Vector -Length Independent ProgrammingBBE64 is one of a family of compatible machines:• 4-way to 32-way SIMD• Common code base for HW, verification, libraries, software tools

Allows common code across DSP versionsMuch code is automatically vectorizable by compiler to a version’s vector lengthSome code most conveniently expressed in vector form• Compiler still does operation selection, software

pipelining, scheduling, register allocation etc.

Simple case: xb_vecNx20 vin; xb_vecNx40 vout = 0;for ( i=0;i< N/BBE_SIMD;i++) {

vout += vin[i] * vin[i ];}*out_p = BBE_RADDN_2X40(vout);

Where explicit vector representation makes sense:

• Data marshaling by selecting operands within vectors

• Parallel operations with vector compression and expansion

• Explicit vector reductions• Some special-purpose operations


Profiling BBE64 for LTE -Advanced 2x2 MIMO x 100MHz 1Gbps

0

100

200

300

400

500

600

700

Heterogenenous SDR UE All in One SDR UE

BB

E6

4 M

Hz

Upsample

Frequency rotation

ZC generation

System Control

iFFT

Precoding DFT

Mapper

Channel interleaver

Scrambler

PRBS

PRBS

Rate-matching.Selection

Rate-matching.Collection

Sub-Block Interleaver

Turbo Interleaver

CTC

CRC24 CB

Control

Control Channel CRC

FEC decoder (Viterbi)

MIMO Detector

Rate dematch

Descrambler

PRBS (Gold-31 generator)

Soft Demodulation - output 8-bit LLR

Resource de-mapper

MIMO Detector

Channel Estimation - Time:

Channel Estimation - Freq:

Channel Estimation -RS:

FFT

• User Equipment is power-obsessed, but flexibility increasingly important: want to run 3G, WiFi on same hardware

• Many possible baseband PHY subsystem design approaches for LTE-Advanced. Two styles:• Heterogeneous SDR : offload

transmit bit-level processing and simple DSP (e.g. FFT) to specialized engines – lower MHz, better power

• All -In-One SDR: do as much as possible on one core – simpler programming, more flexibility

Streamlined compute scenario

Preliminary – subject to change


Simple Code Example: 4x4 Complex Matrix Mul

19

Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging:loopgtz a4,.LBB34_mm_auto_opt_4x4_stream_complex {bbe_lv32x16s.ip v0,a2,512 nop bbe_mula32x18cpackq v5,v11,v0 bbe_mula32x18cpackq v6,v15,v0}{bbe_lv32x16s.i v0,a2,1536 bbe_lv32x16s.i v3,a2,3584 bbe_mul32x18cpackq v1,v8,v0 bbe_mul32x18cpackq v2,v12,v0}{bbe_lv32x16s.i v0,a2,5632 bbe_lv32x16s.ip v4,a2,512 bbe_mula32x18cpackq v1,v9,v0 bbe_mula32x18cpackq v2,v13,v0}{bbe_lv32x16s.i v3,a2,1536 bbe_lv32x16s.i v7,a2,3584 bbe_mula32x18cpackq v1,v10,v3 bbe_mula32x18cpackq v2,v14,v3}{bbe_sv32x16s.ip v5,a3,512 bbe_lv32x16s.i v0,a2,5632 bbe_mula32x18cpackq v1,v11,v0 bbe_mula32x18cpackq v2,v15,v0}{nop bbe_sv32x16s.i v6,a3,1536 bbe_mul32x18cpackq v5,v8,v4 bbe_mul32x18cpackq v6,v12,v4}{bbe_sv32x16s.ip v1,a3,512 nop bbe_mula32x18cpackq v5,v9,v3 bbe_mula32x18cpackq v6,v13,v3}{bbe_sv32x16s.i v2,a3,1536 nop bbe_mula32x18cpackq v5,v10,v7 bbe_mula32x18cpackq v6,v14,v7}

Scalar C code (with DSP-extended scalar types – e.g. complex fractions):static xb_cq15 a1[4][4][NSAMPLES];static xb_cq15 b1[4][4][NSAMPLES];static xb_cq15 c1[4][4][NSAMPLES];void mm_auto_opt_4x4_stream_complex () { int i, j, h ;

for (i = 0; i < 4; i++) {for (j = 0; j < 4; j+=3) {

for (h = 0; h < NSAMPLES; h++) {c1[i ][j][h] = ( xb_cq4_15)(a1[i][0][h] * b1[0][j][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j][h]) +

(xb_cq4_15)(a1[i][2][h] * b1[2][j][h]) + ( xb_cq4_15)(a1[i][3][ h] * b1[3][j][h]);c1[i ][j+1][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+1][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+1][h]) +

(xb_cq4_15)(a1[i][2][h] * b1[2][ j+1][ h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+1][h]);c1[i][j+2][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) +

( xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]);c1[i][j+3][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) +

(xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]);}

}}

}


Simple Code Example: 2x2 Real Matrix Mul16 packed matrices @ 8 multiplies every 3 cycles

20

Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging:loopgtz a4,.LBBx {bbe_lv32x16s.ip v0,a2,64 bbe_lv32x16s.ip v1,a3,64 nop nop}{bbe_sv32x16packq.ip wv0,wv1,a9,64 bbe_lv32x16s.ip v4,a2,64 bbe_selmm32x20r v5,v2,v5,v4,v4,0 bbe_mul32x18pm wv0,wv1,v2,v3,2}{bbe_sv32x16packq.ip wv2,wv3,a9,64 bbe_lv32x16s.ip v5,a3,64 bbe_selmm32x20r v2,v3,v1,v0,v0,0 bbe_mul32x18pm wv2,wv3,v5,v2,2}

Length-independent Vector C code for (i = 0; i < n*4/BBE_SIMD; ++i) {

BBE_SELMMNXQ4_15(select_vec1,select_vec2,coeff[i],data[i],data[i],BBE_SELIMMR_2X2_X_2X2_STEP_1);

vout_temp = BBE_MULNXQ4_15PM(select_vec1, select_vec2, MATMUL_REAL_REPLICATION_INDEX_2);

out_p[i ] = vout = BBE_PACKNXQ9_30Q(vout_temp );

}


BBE64 Development Timeline

21

Xtensa LX4 base technology development

BBE64 ISA development

BBE16 DSP enhancements

2009 2010 2011

Q2 Q3 Q2Q2 Q3Q1 Q3Q4 Q4 Q1

BBE64 DSP development


BBE64 Verification Convergence1400 test suites @ 3000 tests per suite

• Average of 3K directed and random data sets per test suite.

• Several test suites per instruction

• Tests and C reference developed independent of implementation

• Tools guarantee orthogonality of instruction implementations to reduce instruction interaction testing

• Many tests adapted from BBE16

0

500

1000

1500

0 25 50 75

Nu

mb

er

of

Test

Su

ite

s

Calendar Days

Passing Test Suites

Total Test Suites


Configurability and ExtensibilityLX4 Block Diagram

VLIW (FLIX) Paral lel Execut ion

pipel ines

Inst . Memory Management ,

Prot ect ion & Error Recovery

Dat a Memory Management ,

Prot ect ion & Error Recovery

Inst ruct ionRAM x2

Inst ruct ionROM

Dat aRAM x2

Dat aROM

Ext ernal Int erface

Processor Int erface Cont rol

Writ e Buf fer

PIF

Bri

dge

XLMI Local MemoryInt erface

Base ISA Feat ure

Designer-Def ined Feat ures (TIE)

Ext ernal RTL & Peripherals

Conf igurable Funct ion

Opt ional Funct ion

Opt ional & Conf igurable Funct ion

QIF32

RTL, FIFO, Memory, Xt ensa

GPIO32

Designer-Def ined Queues, Port s & Lookups

KEY

Trace Port

JTAG Tap Cont rol

Dat a AddressWat ch Regist ers

Inst ruct ion AddressWat ch Regist ers

Timers

Int errupt Cont rol

On-Chip Debug

Processor Cont rols

Except ion Support

Except ion Regist ers

Base Regist er

Fi le

Dat a Load/ St ore

Unit

Inst ruct ion Fet ch / Decode

Base ISA Execut ion Pipeline

Base ALU

Opt ional Funct ional Unit s

Regist er Fi lesProcessor St at e Device

Device

Bu

s B

rid

geA

HB

-Lit

e/A

XI

RAM

DMA

Device

Syst emBus

Designer-Def ined Dual Load/ St ore

Unit

Designer-Def ined Funct ional Unit s

Regist er Fi lesProcessor St at eRegist er Fi les

Processor St at e

Inst ruct ionCache

Dat aCache

Prefet ch


System Modeling / Design

Instruction Set Simulator (ISS)

Fast Function Simulator (TurboXim)

XTSC SystemC System

Modeling

XTMP C-based

System ModelingPin Level

cosimulation

XenergyEnergy Estimator

Software Tools

GNU Software Toolkit(Assembler, Linker, Debugger, Profiler)

Xtensa C/C++ (XCC) Compiler

C Software Libraries

Xplorer IDEGraphical User Interface

to all tools

Operating Systems

Hardware

EDA scripts

RTL

Synthesis

Block Place & Route

Verification

Chip Integration / Co-verification

Designer-Defined Instructions

(TIE)

Xtensa Processor Generator

Processor Generator OutputsApplication Source

C/C++

Compile

Executable

Profile using ISS

Software DevelopmentTo Fab / FPGA

Set/Choose Configuration

options

System Development

Choose different configuration

- or -Develop new instructions

Tools for DSP AutomationSlot Assignment

Exploration and SizingSelect Immediate

GeneratorsC Prototype Generators

Test Automation


Wrap-up• The BBE64 kit is in the hands of lead customers– expect silicon in 2012

• Processor automation tools reduced cost of DSP development, especially in enabling rapid exploration of HW cost vs. SW benefit for many options

• Achieving a balanced design still required1. Broad algorithm analysis - BBE16 was good foundation

2. Dramatic upgrades in infrastructure (wider instruction and data, regfile optimization)

• Most consciously scalable processor design in Tensilica history – enables easy generation of HW, testing, software development tools, DSP software across wide range of configurations and vector lengths

• Pushes the envelope on VLSI design – significant impact and accommodation of

• High instruction set complexity

• High fan-out in large data-path

• Proving ground for exploitation of 28nm

25

The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier

Documents

Transcript of The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier