Software Configurable Processor - Stanford University
Transcript of Software Configurable Processor - Stanford University
© 2005 Stretch Inc. All rights reserved.
A Software Configurable Processor Architecture
Ricardo E. Gonzalez
© 2005 Stretch Inc. All rights reserved. 2
Embedded System Design Dilemma
COMPUTEPERFORMANCE
SYSTEM TIME-TO-MARKET
GPP
FPGA
ASICASSP
DSPMEDIA
SECURITYETC.
SW HW
?
© 2005 Stretch Inc. All rights reserved. 3
Solving Compute Intensive Problems
Processor with embedded programmable logic
– Lots of compute resources
Utilize fabric by extending the ISA– Design your own FU– Add processor state
Simple programming model– Issue instructions to perform
computation
Tailored for high-throughput applications
– Compute intensive
© 2005 Stretch Inc. All rights reserved. 4
Compute Intensive Markets & Applications
• Sonar/Radar• Biometrics
• CAT Scan• Ultrasound
• Printers• Scanners
• Encryption/Decryption• Network Security
• Broadcast Equipment• Audio Studio
© 2005 Stretch Inc. All rights reserved. 5
877868
629379
275181
164122
6945
4037.7
28.319.5
15.313.61312.711.411.210.210.29.58.78.37.97.16.86.86.35.85.15.14.84.74.63.43.332.9
0 100 200 300 400 500 600 700 800 900 1000
Stretch S5000 - 300 MHz - C OPTIntrins ity Fas tMATH - 2 GHz - ASM OPT
TI C6416 - 720 MHz - ASM OPTTI C6416 - 720 MHz - C OPT
BOPS Manta v2.0 - - MHz - ASM OPTBOPS Manta v2.0 - - MHz - C OPT
Motorola MPC7447 - 1.3GHz - Altivec OPTMotorola MPC7455 - 1GHz - Altivec OPT
TI C6203 - 300 MHz - ASM OPTTI C6203 - 300 MHz - C OPT
LSI 402ZX - 200 MHz - ASM OPTMotorola MPC7447 - 1.3GHz
Motorola MPC7455-1GHzTI TMS320C6416-720IBM 440GX - 667 MHz
Intrins ity Fas tMATH 2GHzPMC Sierra RM7000C - 625MHz
Motorola PowerPC 7400-500IBM 440GP-500
IBM PowerPC 750CX - 500MIPS 20Kc 600 MHz
Motorola MPC755-400IBM 440GP-500
AMD K6-IIIE+ 550/ACRMotorola MPC8245-400
AMD K6-2E+ 500/ACRIBM 405GPr - 400 MHz
NEC VR5500 - 400TI TMS320C6203-300AMD K6-2E 400/ACR
Motorola PowerPC 603e - 300AMD Au1100 - 396 MHz
NEC VR5500 - 300Infineon Carm el 10xx-170
IBM 405GPr - 266 MHzStretch S5000 - 300 MHz
NEC VR5000 - 250LSI402ZX - 200
IDT79RC64575-250Toshiba TMPR4927ATB-200
Proc
esso
r
EEMBC Telemark ScoreEEMBC Telemark
OPTIMIZED
OUT-OF-BOX
Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results
Embedded Microprocessor Benchmark ConsortiumEEMBC creates, certifies and publishes performance benchmark results
S5000 Performance
With ISEF Acceleration
© 2005 Stretch Inc. All rights reserved.
Application Acceleration Process
© 2005 Stretch Inc. All rights reserved. 7
RGB → YCbCr
Color Conversion Function:
Void RGB2YCBCR (WR A, WR *B) {char Y[4], Cb[4], Cr[4];char R[4], G[4], B[4];
R[0] = A(23,16); G[0] = A(15,8); B[0] = A(7,0); /* and so on. . . */
for (i = 0; i < 4; i++) {Y[i] = (77*R[i] + 150*G[i] + 29*B[i]) >> 8;Cb[i] = (32768 - 43*R[i] - 85*G[i] + (B[i] << 7)) >> 9;Cr[i] = (32768 + (R[i] << 7) - 107*G[i] - 21*B[i]) >> 9;
}*B = (Y[3],Cr[3]+Cr[2],Y[2],Cb[3]+Cb[2],Y[1],Cr[1]+Cr[0],Y[0],Cb[1]+Cb[0]);
}
Program Loop:for (…) {
WRGET0(&A, 12); /* Load 12 bytes (4 RGB pixels) */RGB2YCBCR(A, &B); /* Convert 4 pixels */WRPUT0(B, 8); /* Store 8 bytes (4 YCbCr pixels */
}
SE_FUNC /* Tells Stretch C-Compiler to reduce this function to an instruction */
© 2005 Stretch Inc. All rights reserved. 8
Programming Flow
© 2005 Stretch Inc. All rights reserved. 9
Development & Debug
Debug
Programmers develop & debug using familiar tools
Faster debug cycle
Port & accelerate apps within Stretch IDE
No hardware design experience required!
© 2005 Stretch Inc. All rights reserved. 10
Stretch S5 Development Flow
Start DoneApplication
code Compile ExecuteFast
enough?Correctoutput?
ProfileSelect
hot-spot
Create orModify
ExtensionInstructions
Modifyapplication
Y
NN
Y
C/C++Gnu-basedOptimizingcompiler
Targetsimulator
or chipGUI Debugger
Easy touse GUIOne EI does the work of
many operationsand iterations
© 2005 Stretch Inc. All rights reserved.
Under The Hood
© 2005 Stretch Inc. All rights reserved. 12
S5 Datapath and Key Parameters
Wide Register File32 128-bit entries
Load/store unit128-bit load/storeAuto increment/decrementImmediate, indirect, circularVariable-byte load/storeVariable-bit load/store
ISEF3 inputs and 2 outputsPipelined, interlocked32 16-bit MACs and 256 ALUsBit-sliced for arbitrary bit-widthAllow control logicAllow internal processor state
RISC ProcessorTensilica – Xtensa V32 KB I & D CacheOn-Chip Memory, MMU24 Channels of DMA, FPU
© 2005 Stretch Inc. All rights reserved. 13
Instruction Set Architecture
Based on Xtensa ISASingle opcode space
– Different for each applicationHardware “caches” EI subset
– # instructions not limited by hardware
Xtensa V WR loads/stores EI’s
Always present
Xtensa V WR Loads/stores EI’s
Cached set
App 1
App 2
© 2005 Stretch Inc. All rights reserved. 14
Matching Operating Frequency
IR
Decoder WRF
ISEF
Two isochronous clock domains– Processor runs @ higher MHZ– Programmable clock ratio (1:1→1:4)– WR acts as gateway between them– Constraints EI issue rate
Clock ratio invisible to programmer
Table-driven EI decode
ISEF clock domain
© 2005 Stretch Inc. All rights reserved. 15
S5 Pipeline
Standard 5-stage RISC pipelineExtension instruction may be much longer
– Fully pipelinedCompiler schedules allinstructionsHardware interlocks ensure correctness
– Instruction groups subdivided by use/def requirements
– Interlocks driven by config table in Instruction Unit
I Cache
Decode
Addr gen
D Cache
Write back
AR WR
ALU Forward
DP RAM Forward
Stage 0
Stage 1
Stage 2
Stage n-1
Write back
I
R
E
M
W
© 2005 Stretch Inc. All rights reserved. 16
Pipeline Schedule
Extension Instructions may have different latenciesInitiation interval of EI a function of:
– ISEF clock ratio– WR and state use/def
Optimal schedule typically has initiation interval > 1
– No performance penalty for lower clock rate!
Statically scheduled– Pipeline behavior is completely
known at compile time
U
D
D
U
U
D
D
U
LD n+2
OP n
ST n-4
LD n+3
OP n+1
ST n-3
DLD 0
DLD 1Startuptransient
Steadystate
Time
InitiationInterval
© 2005 Stretch Inc. All rights reserved. 17
ISEF Architecture
Array of 64-bit ALUs– May be broken at 4-bit or 1-bit boundaries– Conditional ALU operation: Y = C ? (A op1 B) : (A op2 B)– Registers to implement state, pipelining
Array of 4x8 multipliers, cascadable to 32x32– Programmable pipelining
Programmable routing fabricExecution clock divided from processor clock: 1:1, 1:2, 1:3, 1:4Up to 16 instructions per ISEF configurationMany configurations per executableReconfiguration
– User directed or on demand– ~100 µsec for complete reconfiguration
ALUArray
MultiplierArray
Routing Fabric
ISEF Block
ALUArray
MultiplierArray
Routing Fabric
ISEF Block
WR
ALUArray
MultiplierArray
Routing Fabric
ISEF Block
© 2005 Stretch Inc. All rights reserved. 18
Dynamic ISEF Configuration
Two ISEFsEach holds up to 16 instructionsConfigured via system bus Managed by DMAIndependently configurable
On-Demand ConfigurationAuto managed by HW and OSHandled like cache miss
Configuration PreloadingManaged by applicationHandled like instruction pre-fetch
TaskConfig
On-Demand Configuration Configuration Preloading
© 2005 Stretch Inc. All rights reserved. 19
ISEF Bitstream Generation
ELFC/C++
opt
map
plac
e
rout
e
retim
e
bits
tream
xt-x
cc
cc
linke
r
© 2005 Stretch Inc. All rights reserved. 20
C Compiler Tradeoffs
© 2005 Stretch Inc. All rights reserved.
Examples of Extension Instructions
© 2005 Stretch Inc. All rights reserved. 22
RGB2YCbCr: SCP Instruction Efficiency
R G B R G B R G B R G B
0 1 2 3RGB 96 bits
YCbCr 64 bits Y
Y +* *+*
>>
Cb Cr
Cr* * *+
++
>>
Cb* * *+
++
>>
0 1 2 3Y Y Cb Y Cr
*+* * * * * * * *++>> +>>
++>>
* * * * * * * * *++
+>>
++
+>>
++
+>>
* * * * * * * * *++
+>>
++
+>>
++
+>>
SCALAR PROCESSOR
4W VLIW PROCESSOR
8W SIMD PROCESSOR
SCP
© 2005 Stretch Inc. All rights reserved. 23
Flexibility in FIR Implementations
0011
22334455
6677
8899
101112
13
CC
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15
XX
0011223344556677889910111213
CC
-13-12 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 00 11 22 33 44 55 66 77 88 99 10 11 12 13 14 15XX
01
2345
67
89
101112
13
C
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
X
8-MACs/Cycle 16-MACs/Cycle
32-MACs/CycleS5000 Strengths
– Up to 32-MAC/Cycle– Flexibility to trade-off ISEF usage
vs. compute speed– Wide Load & Stores feed the
compute array
© 2005 Stretch Inc. All rights reserved.
Sample Applications
© 2005 Stretch Inc. All rights reserved. 25
H.264 Encoder Challenges
ME
DeblockingFilter
MC
Intra Predict
ForwardTransform
InverseTransform
Quantization
InverseQuantization
Compute Intensive Tasks
© 2005 Stretch Inc. All rights reserved. 26
H.264 Encode
© 2005 Stretch Inc. All rights reserved. 27
802.16d (WiMAX) Challenges
FIR Freq/TimeCorrection
Sync
SoftDemapperDeinterleaver
FECViterbi
FECR-S
De-RandomizerFFT
Compute Intensive Tasks
© 2005 Stretch Inc. All rights reserved. 28
802.16d Transmit (PHY)
© 2005 Stretch Inc. All rights reserved. 29
802.16d Receiver (PHY)
© 2005 Stretch Inc. All rights reserved. 30
Summary
C/C++ Programmers can tailor the processor for their application– Continue to develop & debug in a familiar environment (IDE/gdb)
Application-specific instruction provide 10-100X speedup– Modest porting effort
C/C++ programmers can easily design high-performance systems– Enabling a new system design methodology