The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier
Transcript of The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier
The World’s Fastest DSP Core: Breaking the 100 GMAC/s Barrier
Hot Chips 23 - August 2011
Chris RowenDan Nicolaescu, Rajiv Ravindran, David Heine, Grant
Martin, James Kim, Dror Maydan, Nupur Andrews, Bill Huffman, Vakis Papaparaskeva, Shay Gal-On, Peter
Nuth, Pushkar Patwardhan, and Manish Paradkar
Tensilica Inc.
Copyright © 2011 Tensilica, Inc. 2
Outline• Introduction: 4G Wireless Processing for LTE and LTE-Advanced• BBE64 Fundamental Goals• Instruction Set Design:
• SIMD//VLIW • Programmer State and Register Organization• Operations• Example: Select and FIR operations• Execution Rate Summary
• BBE64: Where Does the Hardware Go?• BBE64 Software
• SIMD-Width Independent Programming Model• LTE-Advanced Execution Profile• Kernel Examples
• Implementation and Verification Timelines• Tools for Processor Automation
2
Copyright © 2011 Tensilica, Inc. 3
4G Rollout
Evolution of Major Cellular Standards
Fixed WiMAX
Mobile WiMAX Wave 1
DL: 23 MbpsUL: 4 Mbps
10 MHz 3:1 TDD
2006 2007 2008 2009 2010 2011 2012Mobile WiMAX
Wave 2DL: 46 MbpsUL: 4 Mbps
10 MHz 3:1 TDD
EVDO Rev 3DL: 100 MbpsUL: 50 Mbps
In 20 MHz
WiM
AX
E
volu
tion
CD
MA
2000
E
volu
tion EVDO Rev 1
DL: 3.1 MbpsUL: 1.8 MbpsIn 1.25 MHz
EVDO Rev 0DL: 23 Mbps
UL: 153 KbpsIn 1.25 MHz
EVDO Rev 2DL: 14.7 MbpsUL: 4.9 Mbps
In 5 MHz
EDGEDL: 474 KbpsUL: 474 Kbps
Evolved EDGE
DL: 1.89 MbpsUL: 947 Kbps
3GP
P G
SM
E
dge
Evo
lutio
n
HSDPADL: 14.4 MbpsUL: 384 Kbps
In 5 MHz
HSUPA/ HSDPA
DL: 14.4 MbpsUL: 5.76 Mbps
In 5 MHz
LTE (Rel 8)DL: 150 MbpsUL: 50 Mbps
In 20 MHz
3GP
P
UM
TS
E
volu
tion
3GP
P L
ong
Term
E
volu
tion
Source: Pysavy Research “Mobile Broadband: EDGE, HSPA & LTE” 2006
LTE – Advanced DL: 1 Gbps
UL: 500 Mbps
5-10x LTE
HSPA Evolution
DL: 42 MbpsUL: 11.5 Mbps
In 5 MHz
Enhanced EDGE
DL: 1.3 MbpsUL: 653 Kbps
Copyright © 2011 Tensilica, Inc. 4
User-specified DSPs
ConnX VectraLX DSP
Product Context: Specialized and General Baseband DSP Cores
Xtensa µDSPs
ConnX BBE64-128100 GMACs/sec(infrastructure)
ConnX BBE16
Per
form
ance
(G
MA
Cs)
ConnX D2Dual-MAC DSP
ConnX BBE64(handset)
Core Size (mm2)without memory
1
10
100
28nm standard cell process
0.01 0.1 1.0
ConnX Turbo16MS
ConnX BSP3
ConnX SSP16
Copyright © 2011 Tensilica, Inc. 5
BBE64 Family Design Goals and PhilosophyWorld-leading DSP performance for baseband PHY in handsets (aka “User Equipment” or “UE”) and infrastructure (aka base-stations or “eNodeB”
Up to 1GHz in available 28nm fast standard cell process x 128 MAC
Combine SIMD, VLIW and configurable instruction set features for large applications “sweet-spot”.
Leverage high memory system bandwidth of Xtensa LX4 – 1024b per cycle
Good control code performance with multi-issue base ops, including multiple load/store per cycle
Offer both a broad range of built-in options and user-defined extensions to fit: Two initial base configurations BBE64-UE, BBE64-128
Leverage advanced vectorizing compilers, C scalar/vector data-types, operator-overloading and optimized intrinsics to eliminate need for assembly coding
ConnX BBE16 upward compatibility
Fully synthesizable RTL, with complete system modeling, verification and back-end flows environment
Core as building block for leading edge UE and Infrastructure SOC platforms
TXFIRPump
Copyright © 2011 Tensilica, Inc. 6
ConnX BBE64 Block Diagram
Optimized Architecture for DSP Application s• 4-way VLIW x 32-way SIMD 128 DSP ops/cycle• 16/24/96b 4-issue VLIW – almost any instruction in any slot• 128 MAC ops/cycle for matrix and filter functions (BBE64-
128)• Guard-bits on all DSP data for numerical accuracy• Protected pipeline: interlocks/bypasses for robustness• Support for all data types from C
• Complex/real• Scalar/vector• Fractional/integer
High Bandwidth Configurable Memory Subsystem Interface
• Dual load/stores with dual 512b memory interfaces • Full bandwidth on packed and unaligned data vectors• DMA support for local data memory• Extensible with special memory ports and direct-connect
data queues: 4 x 640b per cycle
96 bits Wide
4 Way VLIW Instruction Decoder
LoadStore
Data Memory Interface
Local Data RAM Banks
512 bits Wide
512 bits Wide
Instruction Memory Interface
Local Memory or Cache (1-4 ways)
x 32 bits
Data Load/Store Unit 0(16/32/64/128/512 bit to 640 bit)
Data Load/Store Unit 1(16/32/64/128/512 bit to 640 bit)
Vector Register File16 x 640 bits
32 x 20 real 16 x 20 complex64 x 10 real 32 x 10 complex
Computation Unit
32b ALU64 Way MAC
32 Way SIMD ALU
8 x 640 bits 16 x 40 real 8 x 40 complex
8 x 64 bits 64 x 1 Boolean
32 Way SIMD ALU32 Way
SIMD ALU64 Way MAC
LoadStoreALU
MACALU
MACALU
Align/Pack
Align/Pack
640
bits
32 bits
32 b
its
640 bits
32b ALU32b ALU
General Register File
6
Copyright © 2011 Tensilica, Inc. 7
BBE64 Pipeline
7
I Data/ tag
I AdrGen
ExecAdrGen
L1 data/tag
L1 Align
WB
I Align
DSP Ex1
DSP Ex2
DecodeReg Read
DSP WB
• Two pipeline options:• 9 stage pipeline – higher MHz or larger memories• 7 stage pipeline – lower power and area• Wide static in-order issue• No branch prediction, but zero-overhead loops and SIMD
predication• Simple length encoding enables single-stage instruction decode
and register specifier extraction• DSP operations start with load return: zero load-use bubbles• Simplified ALU/MAC operations allow DSP pipe reduction to two
stages + write back for reduced regfile cost – fewer values in flight, better utilization of slots
DSP Reg Read
Copyright © 2011 Tensilica, Inc. 8
Typical Data Parallelism in SIMD/VLIW Ops
8
VLIW Instruction typically encodes 1-2 load/store, 2-3 ALU and MAC operations
ALU or MAC operation typically works on 16 40b or 32 20b operands
Each 80b may represent 4 20b real, 2 20b+20b complex, 2 40b real, or 1 40b+40b complex
Two 18b x 18b multipliers per 20b element – paired real multiply or half a complex multiply
Copyright © 2011 Tensilica, Inc. 9
Baseline DSP instruction set summary550 distinct DSP ops above base Xtensa ISA
Load:• LV<m>[SU][TF].{I,IP,IC,X,XP} : load vector
of 16b elements• LP<m>[SU][TF].{I,IP,IC,X,XP} : load
complex pair of 16b elements• LS<m>[SU][TF].{I,IP,IC,X,XP} : load scalar
16b element• LA<m>[SU][TF].{I,IP,IC,X,XP} : load
unaligned vector of 16 elements• Plus specialty loads for guard-bits, Boolean
vectors, compressed data
<m>= 16X32, 32X16
{I,IP,IC,X,XP} addressing modes:.I: base register + immed offset.IP: base register + immed offset, post-update.X: base register + register index.XP: base register + register index, post update.IC: base register+ increment, post update in circular bufferAll instructions prefixed with “BBE_ ”
Store:• SV<m>[SU][TF].{I,IP,IC,X,XP} : store vector of
16b elements• SP<m>[SU][TF].{I,IP,IC,X,XP} : store complex
pair of 16b elements• SS<m>[SU][TF].{I,IP,IC,X,XP} : store scalar 16b
element• SA<m>[SU][TF].{I,IP,IC,X,XP} : store unaligned
vector of 16 elements• SV<m>PACK[QSR].{I,IP,IC,X,XP} : pack 40b
elements to 16b and store vector• SP<m>PACK[QSR].{ I,IP,IC,X,XP}: pack 40b
elements to 16b and store pair• SS<m>PACK[QSR].{ I,IP,IC,X,XP}: pack 40b
elements to 16b and store scalar• Plus specialty stores for guard-bits, Boolean
vectors, expanded data, bit-reversed addressing
Copyright © 2011 Tensilica, Inc. 10
Baseline DSP instruction set highlightsALU:• ABS<s> - absolute value• [R]ADD<s>[TF] - add• AND<s> - bitwise and• CONJ<s> - conjugate complex• EQ<s>[C] – set Boolean on equal• LT<s> - set Boolean on less than• LE<s> - set Boolean on less than or equal• [R]MAX[U]<s>[TF] - maximum• [R]MIN[U]<s>[TF] -minimum• MOV<s>[TF][C] – move conditional• NAND<s>[TF] – bitwise not and• NEG<s>[TF] - negate• NEQ<s>[C] – set Boolean on not equal• [R]NSA[U]<s>[C] – normalize shift amt• OR<s>- bitwise or• SAT<s>[SU] – saturate to memory size• SL[ALS]<s>{B,BR} – shift left• SR[ALS]<s>{B,BR,I} – shift right• SUB<s>[TF] - subtract• XOR<s>- bitwise xor
Multiply:• MAG32X18C{PACKQ,PACKS} complex
magnitude• MUL32X18{,C,J,JC}{,PACKQ,PACKS} multiply• MULA32X18{,C,J}{,PACKQ,PACKS } multiply-
add• MULSIGN<s>• POLY_STEP32X20.STEP – Polynomial series
expansion with parallel table lookup
Data Organization:• PACK[LQRS]32X40 – pack 40b to 20b • UNPACK32X20 – unpack from 20b to 40b• SEL<s>[I] - select elements from vector pair• SHFL<s>[I] – select elements from vector• Plus similar operations for Boolean vectors
<s>= 16X40, 32X20All operations prefixed with “BBE_”
Copyright © 2011 Tensilica, Inc. 11
Programmers Model: Registers
Large vector register file supports deep software pipelining and reduced memory trafficPartitioned register file for added bandwidth with less register bloat (narrow/wide operands)Unalignment registers enable full bandwidth streaming of arbitrary alignment and sizeVector Booleans for flexible SIMD and VLIW predicationAggregate regfile bandwidth at 1GHz: >12Tbps
Base Xtensa ISA (user) BBE64 ISA Extension
Gen
eral
Reg
iste
rs (
AR
)16
x32
(or
win
dow
ed to
64x
32)
LOO
P3x
32 SA
Vec
tors
(VR
)16
x640
Wid
eV
ecto
rs (
WR
)8x
640
Una
lign
(UR
)4x
512
Boo
lean
Vec
tor
(VB
R)
8x32
Copyright © 2011 Tensilica, Inc. 12
Register File Organization
VEC RFWVEC RF
clk
clk
clk
Register file partitioning reduces abstracted bypass structure
Copyright © 2011 Tensilica, Inc. 13
Data Reorganization Key to SIMD: Selection
64:1
Example: operation BBE_SEL32X20 {out vec c, in vec h, in vec l, in vec s}
Options: • Select Immediate (45 patterns)• Shuffle (Single Input Vector)• Shuffle Immediate (75 patterns)
… …
32 m
ux s
elec
t fie
lds
from
Vec
tor
s
Vector h Vector l
Copyright © 2011 Tensilica, Inc. 14
Accelerated Operation Examples: Real FIR
* * * * * * * * * * * * * * * ** * * * * * * * * * * * * * *** * * * * * * * * * * * * * * ** * * * * * * * * * * * * * **
Coe
ffici
ents
repl
icat
ed in
com
pani
on
oper
atio
n to
red
uce
fan-
out
Write back shifted data
2 x 16 40b accumulator vectors
+ + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ 2-4
bank
sof
32
mul
tiplie
rs
Data selected from vector pair
Copyright © 2011 Tensilica, Inc. 15
Where Does the Hardware Go?
VEC regfile, 21%
Vector Multiply Ops, 20%
Vector Load/Store Ops,
11%
Vector Select/Shuffle Ops,
10%
WVEC regfile, 9%
Base Processor
(2x512b,96b), 6%
Vector Shift Ops, 5%
Vector Pack Ops, 3%
ALIGN regfile, 3%
AR Regfile, 2%
Boolean Ops, 2%
Vector Arithmetic Ops, 2%
Vector Move/Convert Ops,
3%
Vector Logical Ops, 1%Extended Base Ops, 1%
VBOOL regfile, 1%
• Register files are pretty big
• As expected, multipliers dominate execution unit size
• Select and shift are significant
• Base processor and extensions for 3-way base ISA execution are very small
o Invites more aggressive control code resource allocation
Preliminary – subject to change
Copyright © 2011 Tensilica, Inc. 16
Relative Performance on Basic Metrics
0
1
2
3
4
5
6
7
8
9
BB
E64
Per
form
ance
(B
BE
16 =
1)
Performance Metric
BBE16
BBE64-128BBE 16BBE 64-128
16Preliminary – subject to change
Copyright © 2011 Tensilica, Inc. 17
Vector -Length Independent ProgrammingBBE64 is one of a family of compatible machines:• 4-way to 32-way SIMD• Common code base for HW, verification, libraries, software tools
Allows common code across DSP versionsMuch code is automatically vectorizable by compiler to a version’s vector lengthSome code most conveniently expressed in vector form• Compiler still does operation selection, software
pipelining, scheduling, register allocation etc.
Simple case: xb_vecNx20 vin; xb_vecNx40 vout = 0;for ( i=0;i< N/BBE_SIMD;i++) {
vout += vin[i] * vin[i ];}*out_p = BBE_RADDN_2X40(vout);
Where explicit vector representation makes sense:
• Data marshaling by selecting operands within vectors
• Parallel operations with vector compression and expansion
• Explicit vector reductions• Some special-purpose operations
Copyright © 2011 Tensilica, Inc. 18
Profiling BBE64 for LTE -Advanced 2x2 MIMO x 100MHz 1Gbps
0
100
200
300
400
500
600
700
Heterogenenous SDR UE All in One SDR UE
BB
E6
4 M
Hz
Upsample
Frequency rotation
ZC generation
System Control
iFFT
Precoding DFT
Mapper
Channel interleaver
Scrambler
PRBS
PRBS
Rate-matching.Selection
Rate-matching.Collection
Sub-Block Interleaver
Turbo Interleaver
CTC
CRC24 CB
Control
Control Channel CRC
FEC decoder (Viterbi)
MIMO Detector
Rate dematch
Descrambler
PRBS (Gold-31 generator)
Soft Demodulation - output 8-bit LLR
Resource de-mapper
MIMO Detector
Channel Estimation - Time:
Channel Estimation - Freq:
Channel Estimation -RS:
FFT
• User Equipment is power-obsessed, but flexibility increasingly important: want to run 3G, WiFi on same hardware
• Many possible baseband PHY subsystem design approaches for LTE-Advanced. Two styles:• Heterogeneous SDR : offload
transmit bit-level processing and simple DSP (e.g. FFT) to specialized engines – lower MHz, better power
• All -In-One SDR: do as much as possible on one core – simpler programming, more flexibility
Streamlined compute scenario
Preliminary – subject to change
Copyright © 2011 Tensilica, Inc. 19
Simple Code Example: 4x4 Complex Matrix Mul
19
Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging:loopgtz a4,.LBB34_mm_auto_opt_4x4_stream_complex {bbe_lv32x16s.ip v0,a2,512 nop bbe_mula32x18cpackq v5,v11,v0 bbe_mula32x18cpackq v6,v15,v0}{bbe_lv32x16s.i v0,a2,1536 bbe_lv32x16s.i v3,a2,3584 bbe_mul32x18cpackq v1,v8,v0 bbe_mul32x18cpackq v2,v12,v0}{bbe_lv32x16s.i v0,a2,5632 bbe_lv32x16s.ip v4,a2,512 bbe_mula32x18cpackq v1,v9,v0 bbe_mula32x18cpackq v2,v13,v0}{bbe_lv32x16s.i v3,a2,1536 bbe_lv32x16s.i v7,a2,3584 bbe_mula32x18cpackq v1,v10,v3 bbe_mula32x18cpackq v2,v14,v3}{bbe_sv32x16s.ip v5,a3,512 bbe_lv32x16s.i v0,a2,5632 bbe_mula32x18cpackq v1,v11,v0 bbe_mula32x18cpackq v2,v15,v0}{nop bbe_sv32x16s.i v6,a3,1536 bbe_mul32x18cpackq v5,v8,v4 bbe_mul32x18cpackq v6,v12,v4}{bbe_sv32x16s.ip v1,a3,512 nop bbe_mula32x18cpackq v5,v9,v3 bbe_mula32x18cpackq v6,v13,v3}{bbe_sv32x16s.i v2,a3,1536 nop bbe_mula32x18cpackq v5,v10,v7 bbe_mula32x18cpackq v6,v14,v7}
Scalar C code (with DSP-extended scalar types – e.g. complex fractions):static xb_cq15 a1[4][4][NSAMPLES];static xb_cq15 b1[4][4][NSAMPLES];static xb_cq15 c1[4][4][NSAMPLES];void mm_auto_opt_4x4_stream_complex () { int i, j, h ;
for (i = 0; i < 4; i++) {for (j = 0; j < 4; j+=3) {
for (h = 0; h < NSAMPLES; h++) {c1[i ][j][h] = ( xb_cq4_15)(a1[i][0][h] * b1[0][j][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j][h]) +
(xb_cq4_15)(a1[i][2][h] * b1[2][j][h]) + ( xb_cq4_15)(a1[i][3][ h] * b1[3][j][h]);c1[i ][j+1][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+1][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+1][h]) +
(xb_cq4_15)(a1[i][2][h] * b1[2][ j+1][ h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+1][h]);c1[i][j+2][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) +
( xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]);c1[i][j+3][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) +
(xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]);}
}}
}
Copyright © 2011 Tensilica, Inc. 20
Simple Code Example: 2x2 Real Matrix Mul16 packed matrices @ 8 multiplies every 3 cycles
20
Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging:loopgtz a4,.LBBx {bbe_lv32x16s.ip v0,a2,64 bbe_lv32x16s.ip v1,a3,64 nop nop}{bbe_sv32x16packq.ip wv0,wv1,a9,64 bbe_lv32x16s.ip v4,a2,64 bbe_selmm32x20r v5,v2,v5,v4,v4,0 bbe_mul32x18pm wv0,wv1,v2,v3,2}{bbe_sv32x16packq.ip wv2,wv3,a9,64 bbe_lv32x16s.ip v5,a3,64 bbe_selmm32x20r v2,v3,v1,v0,v0,0 bbe_mul32x18pm wv2,wv3,v5,v2,2}
Length-independent Vector C code for (i = 0; i < n*4/BBE_SIMD; ++i) {
BBE_SELMMNXQ4_15(select_vec1,select_vec2,coeff[i],data[i],data[i],BBE_SELIMMR_2X2_X_2X2_STEP_1);
vout_temp = BBE_MULNXQ4_15PM(select_vec1, select_vec2, MATMUL_REAL_REPLICATION_INDEX_2);
out_p[i ] = vout = BBE_PACKNXQ9_30Q(vout_temp );
}
Copyright © 2011 Tensilica, Inc. 21
BBE64 Development Timeline
21
Xtensa LX4 base technology development
BBE64 ISA development
BBE16 DSP enhancements
2009 2010 2011
Q2 Q3 Q2Q2 Q3Q1 Q3Q4 Q4 Q1
BBE64 DSP development
Copyright © 2011 Tensilica, Inc. 22
BBE64 Verification Convergence1400 test suites @ 3000 tests per suite
• Average of 3K directed and random data sets per test suite.
• Several test suites per instruction
• Tests and C reference developed independent of implementation
• Tools guarantee orthogonality of instruction implementations to reduce instruction interaction testing
• Many tests adapted from BBE16
0
500
1000
1500
0 25 50 75
Nu
mb
er
of
Test
Su
ite
s
Calendar Days
Passing Test Suites
Total Test Suites
Copyright © 2011 Tensilica, Inc. 23
Configurability and ExtensibilityLX4 Block Diagram
VLIW (FLIX) Paral lel Execut ion
pipel ines
Inst . Memory Management ,
Prot ect ion & Error Recovery
Dat a Memory Management ,
Prot ect ion & Error Recovery
Inst ruct ionRAM x2
Inst ruct ionROM
Dat aRAM x2
Dat aROM
Ext ernal Int erface
Processor Int erface Cont rol
Writ e Buf fer
PIF
Bri
dge
XLMI Local MemoryInt erface
Base ISA Feat ure
Designer-Def ined Feat ures (TIE)
Ext ernal RTL & Peripherals
Conf igurable Funct ion
Opt ional Funct ion
Opt ional & Conf igurable Funct ion
QIF32
RTL, FIFO, Memory, Xt ensa
GPIO32
Designer-Def ined Queues, Port s & Lookups
KEY
Trace Port
JTAG Tap Cont rol
Dat a AddressWat ch Regist ers
Inst ruct ion AddressWat ch Regist ers
Timers
Int errupt Cont rol
On-Chip Debug
Processor Cont rols
Except ion Support
Except ion Regist ers
Base Regist er
Fi le
Dat a Load/ St ore
Unit
Inst ruct ion Fet ch / Decode
Base ISA Execut ion Pipeline
Base ALU
Opt ional Funct ional Unit s
Regist er Fi lesProcessor St at e Device
Device
Bu
s B
rid
geA
HB
-Lit
e/A
XI
RAM
DMA
Device
Syst emBus
Designer-Def ined Dual Load/ St ore
Unit
Designer-Def ined Funct ional Unit s
Regist er Fi lesProcessor St at eRegist er Fi les
Processor St at e
Inst ruct ionCache
Dat aCache
Prefet ch
Copyright © 2011 Tensilica, Inc. 24
System Modeling / Design
Instruction Set Simulator (ISS)
Fast Function Simulator (TurboXim)
XTSC SystemC System
Modeling
XTMP C-based
System ModelingPin Level
cosimulation
XenergyEnergy Estimator
Software Tools
GNU Software Toolkit(Assembler, Linker, Debugger, Profiler)
Xtensa C/C++ (XCC) Compiler
C Software Libraries
Xplorer IDEGraphical User Interface
to all tools
Operating Systems
Hardware
EDA scripts
RTL
Synthesis
Block Place & Route
Verification
Chip Integration / Co-verification
Designer-Defined Instructions
(TIE)
Xtensa Processor Generator
Processor Generator OutputsApplication Source
C/C++
Compile
Executable
Profile using ISS
Software DevelopmentTo Fab / FPGA
Set/Choose Configuration
options
System Development
Choose different configuration
- or -Develop new instructions
Tools for DSP AutomationSlot Assignment
Exploration and SizingSelect Immediate
GeneratorsC Prototype Generators
Test Automation
Copyright © 2011 Tensilica, Inc. 25
Wrap-up• The BBE64 kit is in the hands of lead customers– expect silicon in 2012
• Processor automation tools reduced cost of DSP development, especially in enabling rapid exploration of HW cost vs. SW benefit for many options
• Achieving a balanced design still required1. Broad algorithm analysis - BBE16 was good foundation
2. Dramatic upgrades in infrastructure (wider instruction and data, regfile optimization)
• Most consciously scalable processor design in Tensilica history – enables easy generation of HW, testing, software development tools, DSP software across wide range of configurations and vector lengths
• Pushes the envelope on VLSI design – significant impact and accommodation of
• High instruction set complexity
• High fan-out in large data-path
• Proving ground for exploitation of 28nm
25