Software Configurable Processor - Stanford University

© 2005 Stretch Inc. All rights reserved.

A Software Configurable Processor Architecture

Ricardo E. Gonzalez

© 2005 Stretch Inc. All rights reserved. 2

Embedded System Design Dilemma

COMPUTEPERFORMANCE

SYSTEM TIME-TO-MARKET

GPP

FPGA

ASICASSP

DSPMEDIA

SECURITYETC.

SW HW

?


Solving Compute Intensive Problems

Processor with embedded programmable logic

– Lots of compute resources

Utilize fabric by extending the ISA– Design your own FU– Add processor state

Simple programming model– Issue instructions to perform

computation

Tailored for high-throughput applications

– Compute intensive


Compute Intensive Markets & Applications

• Sonar/Radar• Biometrics

• CAT Scan• Ultrasound

• Printers• Scanners

• Encryption/Decryption• Network Security

• Broadcast Equipment• Audio Studio


877868

629379

275181

164122

6945

4037.7

28.319.5

15.313.61312.711.411.210.210.29.58.78.37.97.16.86.86.35.85.15.14.84.74.63.43.332.9

0 100 200 300 400 500 600 700 800 900 1000

Stretch S5000 - 300 MHz - C OPTIntrins ity Fas tMATH - 2 GHz - ASM OPT

TI C6416 - 720 MHz - ASM OPTTI C6416 - 720 MHz - C OPT

BOPS Manta v2.0 - - MHz - ASM OPTBOPS Manta v2.0 - - MHz - C OPT

Motorola MPC7447 - 1.3GHz - Altivec OPTMotorola MPC7455 - 1GHz - Altivec OPT

TI C6203 - 300 MHz - ASM OPTTI C6203 - 300 MHz - C OPT

LSI 402ZX - 200 MHz - ASM OPTMotorola MPC7447 - 1.3GHz

Motorola MPC7455-1GHzTI TMS320C6416-720IBM 440GX - 667 MHz

Intrins ity Fas tMATH 2GHzPMC Sierra RM7000C - 625MHz

Motorola PowerPC 7400-500IBM 440GP-500

IBM PowerPC 750CX - 500MIPS 20Kc 600 MHz

Motorola MPC755-400IBM 440GP-500

AMD K6-IIIE+ 550/ACRMotorola MPC8245-400

AMD K6-2E+ 500/ACRIBM 405GPr - 400 MHz

NEC VR5500 - 400TI TMS320C6203-300AMD K6-2E 400/ACR

Motorola PowerPC 603e - 300AMD Au1100 - 396 MHz

NEC VR5500 - 300Infineon Carm el 10xx-170

IBM 405GPr - 266 MHzStretch S5000 - 300 MHz

NEC VR5000 - 250LSI402ZX - 200

IDT79RC64575-250Toshiba TMPR4927ATB-200

Proc

esso

r

EEMBC Telemark ScoreEEMBC Telemark

OPTIMIZED

OUT-OF-BOX

Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results

Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results

S5000 Performance

With ISEF Acceleration


Application Acceleration Process


RGB → YCbCr

Color Conversion Function:

Void RGB2YCBCR (WR A, WR *B) {char Y[4], Cb[4], Cr[4];char R[4], G[4], B[4];

R[0] = A(23,16); G[0] = A(15,8); B[0] = A(7,0); /* and so on. . . */

for (i = 0; i < 4; i++) {Y[i] = (77*R[i] + 150*G[i] + 29*B[i]) >> 8;Cb[i] = (32768 - 43*R[i] - 85*G[i] + (B[i] << 7)) >> 9;Cr[i] = (32768 + (R[i] << 7) - 107*G[i] - 21*B[i]) >> 9;

}*B = (Y[3],Cr[3]+Cr[2],Y[2],Cb[3]+Cb[2],Y[1],Cr[1]+Cr[0],Y[0],Cb[1]+Cb[0]);

}

Program Loop:for (…) {

WRGET0(&A, 12); /* Load 12 bytes (4 RGB pixels) */RGB2YCBCR(A, &B); /* Convert 4 pixels */WRPUT0(B, 8); /* Store 8 bytes (4 YCbCr pixels */

}

SE_FUNC /* Tells Stretch C-Compiler to reduce this function to an instruction */


Programming Flow


Development & Debug

Debug

Programmers develop & debug using familiar tools

Faster debug cycle

Port & accelerate apps within Stretch IDE

No hardware design experience required!


Stretch S5 Development Flow

Start DoneApplication

code Compile ExecuteFast

enough?Correctoutput?

ProfileSelect

hot-spot

Create orModify

ExtensionInstructions

Modifyapplication

Y

NN

Y

C/C++Gnu-basedOptimizingcompiler

Targetsimulator

or chipGUI Debugger

Easy touse GUIOne EI does the work of

many operationsand iterations


Under The Hood


S5 Datapath and Key Parameters

Wide Register File32 128-bit entries

Load/store unit128-bit load/storeAuto increment/decrementImmediate, indirect, circularVariable-byte load/storeVariable-bit load/store

ISEF3 inputs and 2 outputsPipelined, interlocked32 16-bit MACs and 256 ALUsBit-sliced for arbitrary bit-widthAllow control logicAllow internal processor state

RISC ProcessorTensilica – Xtensa V32 KB I & D CacheOn-Chip Memory, MMU24 Channels of DMA, FPU


Instruction Set Architecture

Based on Xtensa ISASingle opcode space

– Different for each applicationHardware “caches” EI subset

– # instructions not limited by hardware

Xtensa V WR loads/stores EI’s

Always present

Xtensa V WR Loads/stores EI’s

Cached set

App 1

App 2


Matching Operating Frequency

IR

Decoder WRF

ISEF

Two isochronous clock domains– Processor runs @ higher MHZ– Programmable clock ratio (1:1→1:4)– WR acts as gateway between them– Constraints EI issue rate

Clock ratio invisible to programmer

Table-driven EI decode

ISEF clock domain


S5 Pipeline

Standard 5-stage RISC pipelineExtension instruction may be much longer

– Fully pipelinedCompiler schedules allinstructionsHardware interlocks ensure correctness

– Instruction groups subdivided by use/def requirements

– Interlocks driven by config table in Instruction Unit

I Cache

Decode

Addr gen

D Cache

Write back

AR WR

ALU Forward

DP RAM Forward

Stage 0

Stage 1

Stage 2

Stage n-1

Write back

I

R

E

M

W


Pipeline Schedule

Extension Instructions may have different latenciesInitiation interval of EI a function of:

– ISEF clock ratio– WR and state use/def

Optimal schedule typically has initiation interval > 1

– No performance penalty for lower clock rate!

Statically scheduled– Pipeline behavior is completely

known at compile time

U

D

D

U

U

D

D

U

LD n+2

OP n

ST n-4

LD n+3

OP n+1

ST n-3

DLD 0

DLD 1Startuptransient

Steadystate

Time

InitiationInterval


ISEF Architecture

Array of 64-bit ALUs– May be broken at 4-bit or 1-bit boundaries– Conditional ALU operation: Y = C ? (A op1 B) : (A op2 B)– Registers to implement state, pipelining

Array of 4x8 multipliers, cascadable to 32x32– Programmable pipelining

Programmable routing fabricExecution clock divided from processor clock: 1:1, 1:2, 1:3, 1:4Up to 16 instructions per ISEF configurationMany configurations per executableReconfiguration

– User directed or on demand– ~100 µsec for complete reconfiguration

ALUArray

MultiplierArray

Routing Fabric

ISEF Block

ALUArray

MultiplierArray

Routing Fabric

ISEF Block

WR

ALUArray

MultiplierArray

Routing Fabric

ISEF Block


Dynamic ISEF Configuration

Two ISEFsEach holds up to 16 instructionsConfigured via system bus Managed by DMAIndependently configurable

On-Demand ConfigurationAuto managed by HW and OSHandled like cache miss

Configuration PreloadingManaged by applicationHandled like instruction pre-fetch

TaskConfig

On-Demand Configuration Configuration Preloading


ISEF Bitstream Generation

ELFC/C++

opt

map

plac

e

rout

e

retim

e

bits

tream

xt-x

cc

cc

linke

r


C Compiler Tradeoffs


Examples of Extension Instructions


RGB2YCbCr: SCP Instruction Efficiency

R G B R G B R G B R G B

0 1 2 3RGB 96 bits

YCbCr 64 bits Y

Y +* *+*

>>

Cb Cr

Cr* * *+

++

>>

Cb* * *+

++

>>

0 1 2 3Y Y Cb Y Cr

*+* * * * * * * *++>> +>>

++>>

* * * * * * * * *++

+>>

++

+>>

++

+>>

* * * * * * * * *++

+>>

++

+>>

++

+>>

SCALAR PROCESSOR

4W VLIW PROCESSOR

8W SIMD PROCESSOR

SCP


Flexibility in FIR Implementations

0011

22334455

6677

8899

101112

13

CC

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15

XX

0011223344556677889910111213

CC

-13-12 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15XX

01

2345

67

89

101112

13

C

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

X

8-MACs/Cycle 16-MACs/Cycle

32-MACs/CycleS5000 Strengths

– Up to 32-MAC/Cycle– Flexibility to trade-off ISEF usage

vs. compute speed– Wide Load & Stores feed the

compute array


Sample Applications


H.264 Encoder Challenges

ME

DeblockingFilter

MC

Intra Predict

ForwardTransform

InverseTransform

Quantization

InverseQuantization

Compute Intensive Tasks


H.264 Encode


802.16d (WiMAX) Challenges

FIR Freq/TimeCorrection

Sync

SoftDemapperDeinterleaver

FECViterbi

FECR-S

De-RandomizerFFT

Compute Intensive Tasks


802.16d Transmit (PHY)


802.16d Receiver (PHY)


Summary

C/C++ Programmers can tailor the processor for their application– Continue to develop & debug in a familiar environment (IDE/gdb)

Application-specific instruction provide 10-100X speedup– Modest porting effort

C/C++ programmers can easily design high-performance systems– Enabling a new system design methodology

Software Configurable Processor - Stanford University

Documents

Transcript of Software Configurable Processor - Stanford University