mpsoc2004 vissers final - MPSoC 2018 · MPsoC, July 5 - 9, 2004 Kees Vissers 25 ... [bytestream] To...
Transcript of mpsoc2004 vissers final - MPSoC 2018 · MPsoC, July 5 - 9, 2004 Kees Vissers 25 ... [bytestream] To...
A qualitative analysis of the benefits of LUTs, Processors, embedded memory and interconnect in MPSoCplatforms
Kees VissersXilinx Research
Kees VissersMPsoC, July 5 - 9, 2004 2
OUTLINE
• Historical Perspective• Conventional FPGAs• Applications and Programming• Future Directions
Kees VissersMPsoC, July 5 - 9, 2004 3
FPGA, the last 10 years
1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04
Year
Price: 300x
Speed: 20x
Capacity: 200x
1
10
100
1000
CapacitySpeedPrice Virtex-II Pro
10x
1x
Spartan-3
Virtex-II Pro
MemoryBandwidth
4000x- 6000xcustomer value!
Kees VissersMPsoC, July 5 - 9, 2004 4
The market
Num
ber o
f Des
igns
(tho
usan
ds)
25
50
75
100
1999 2000 2001 2002 2003 2004 2005
FPGA
ASIC
• Design Starts$
Mill
ions
010,
20,
40,
50,
60,
70,
2001 2002 2003 2004 2005 2006 2007
ASICASSP
→ 2007 ASIC/ASSP = 80 B$→ NRE = 30M$→ R&D = 20% revenue→ Revenue = 150M$→ ~500 successful ASSP/ASIC
30,
Kees VissersMPsoC, July 5 - 9, 2004 5
The ultimate solution
• Glueless interfaces• Can handle all external memory, e.g DDR RAM,
QDR RAM, SRAM• Can do all the processing• Is fully programmable/reconfigurable in all
aspects: I/O, function, memory architecture, signal level, etc
ONLY Platform FPGAs can do this.
Kees VissersMPsoC, July 5 - 9, 2004 6
Processors: Itanium 2 6M
• 130 W (107 W typical)• 3 Levels of Cache: 16k + 16k, 256K, 6M• 130nm• 1.5 GHz• 95 percent of the 7,877 pins are for power • 1,322 Specint base2000 for a single processor!• Next generation: 1.7GHz and 9MByte cache!Ref: IEEE Micro April 2004, vol. 24, Issue 2, pp. 10-18: ITANIUM 2 PROCESSOR 6M: HIGHER FREQUENCY AND
LARGER L3 CACHE, Rusu et. al, Intel.
Kees VissersMPsoC, July 5 - 9, 2004 7
Processors: DSP
• TI C6414: 2 levels of cache• Philips PNX1500 (Trimedia core): 2 levels of
cache• Often specific DMA, Bus architecture and
optimized main memory architecture: does not easily fit the C programming paradigm.
Conclusion: the memory model is a problem
Kees VissersMPsoC, July 5 - 9, 2004 8
ALU
A Von Neumann-style Computation Example
256 Tap FIR Filter Example
256 Tap FIR Filter yn = Σ ckxn-kK=0
255
RegisterFile
r1 =
x0
r0 =
r1*
#c0
r1 =
x1
r0 =
r1*
#c1
...
...
...
...
...
y0 =
r1
...
r0 =
#0
x0
x1
x2
x3
...
x253
x2
54
x255
y0
InstructionDecode
Bottle-necks!
Kees VissersMPsoC, July 5 - 9, 2004 9
• For highest speed, use as many compute elements as problem allows.
• Capable of producing an output sample in one “instruction cycle”.
• Call this “Spatial computing”. (see DeHon)
FIR Implementation in RC
....
c255
xn
ynReg Reg
c254 c1Reg
c0
....
Kees VissersMPsoC, July 5 - 9, 2004 10
Full Parallel
Semi-Parallel
Serial
• Tailor a solution to optimize throughput, precision and cost
Reg Reg
Customizing Architecture in RC
Kees VissersMPsoC, July 5 - 9, 2004 11
Architectural EfficiencySpatial vs. Temporal Computing
FPGA (c=w=1) “Processor” (c=1024, w=64)
Figures courtesy of André DeHon, California Institute of Technology
Kees VissersMPsoC, July 5 - 9, 2004 12
OUTLINE
• Historical Perspective• Conventional FPGAs• Applications and Programming• Technology Impact• Financial Impact• Future Directions
Kees VissersMPsoC, July 5 - 9, 2004 13
4
16 words x 1 bit m
emory
[A,B,C,D]
F
• A 4-input lookup table(LUT) can implement any function of 4 inputs.
• For example, a 1-bit adder needs 2 LUTs:
A⊕B⊕Ci
A.B.Ci
AB
Ci
Co
S
Building an FPGA: Logic First
Kees VissersMPsoC, July 5 - 9, 2004 14
Out
In4
FF
CE RST
M16 words x 1 bit m
emory
M
M
ClkCERst
Add FF to make a Logic Cell
Kees VissersMPsoC, July 5 - 9, 2004 15
4
FF
CE RST
M16 words x 1 bit m
emory
Carry M
M
M
DinWECin
Cout
• Make LUT RAM a user resource.
• Fast carry ripple to neighbor.
Arithmetic, Distributed RAM
Kees VissersMPsoC, July 5 - 9, 2004 16
4
4
4
4
4
4
4
4
40
• Group logic cells to reduce overhead.
• Add H, V routing channels with switchboxes.
• Add input, output MUXing between logic and routing.
Add Interconnect
Kees VissersMPsoC, July 5 - 9, 2004 17
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
Build an Array
Kees VissersMPsoC, July 5 - 9, 2004 18
Putting the ‘R’ in RC • Fine-grained FPGAs are the platform of choice for
Reconfigurable Computing.
State
Configuration
User Logic Configuration RAM
Kees VissersMPsoC, July 5 - 9, 2004 19
Add Bells & Whistles HardProcessor
I/O
BRAM
Gigabit Serial
Multiplier
ProgrammableTermination
Z
VCCIO
Z
Z
ImpedanceControl Clock
Mgmt
18 Bit
18 Bit36 Bit
Kees VissersMPsoC, July 5 - 9, 2004 20
Problem definition
• Perform processing on a stream of data• Sampling rates in the order of 100KHz to 100MHz• per sample 1000 – 100,000 operationsPerform 1010-1011 operations/sec, like add, multiply
etc.
Note: performance is a required part of the solution!
Kees VissersMPsoC, July 5 - 9, 2004 21
So when do I need what?
• Processor + Cache• LUT• Embedded Memory• Bus• Reconfigurable interconect• Internal Memory• External Memory• I/O
Kees VissersMPsoC, July 5 - 9, 2004 22
OUTLINE
• Historical perspective• Conventional FPGAs• Applications and Programming• Technology and Financial Impact• Future Directions
Kees VissersMPsoC, July 5 - 9, 2004 23
JPEG2000 Encoder: Functional Block Diagram
2-D DWT
BitModele
r
Inputdata
Compresseddata“MQ”
Entropy
Coder
Header/Stream Creatio
n
Tier-2CodingTier-1 Coding
Quantizer
DC-Shift
&Comp.Trans.
⇑MQ Coder:
Create codewordsusing arithmetic
coding
Kees VissersMPsoC, July 5 - 9, 2004 24
87.8%
6.6%4.0%1.7%
Interface2-D DWTTier-1 CodingTier-2 Coding
JPEG2000: Run-Time Profiling
* Run on XRL’s JPEG2000 ANSI C code
Kees VissersMPsoC, July 5 - 9, 2004 25
Tier 1 Coder
JPEG2000 Encoder on Multimedia Board
ZBT Memor
y
Multi-Memory / Multi-Port ZBT Interface
Frame Capture
Multi-Pass DWT
Buffer 1 Buffer 2 Buffer 3
RGB
DMA 32 bit
MicroBlaze/ OPB
Bus
Processor
Memory INSTRU
CT
32 32
Data OPB Bus
Interrupt request (IR)IR IR
32Arbite
r/ Bridge
Tier 1 LogicTier 1 CoderTier-1
CoderYCrCb
ZBT Memor
y
ZBT Memor
y
Processor
Memory DATA32
3232
Kees VissersMPsoC, July 5 - 9, 2004 26
Tier-1 Coder: Block Diagram[Original Design]
BRAMs4096 x
Nb
[sign-magnitud
e]
FIFO2048 x
8b
From Off-Chip DWT Buffer
“MQ”Arithmetic Coder
Bit Modeling
FIFO1024 x
32b
[bytestream]
To Off-Chip Comp. Buffer
32
1
5
2
1
5
2
BRAMs2 x
1024 x 4b
BRAM1024 x
16b
Tier-1 Coding Control
Dist. ROM
47 x 29b
Probability/Next State Table
Significance/ Process Flags
Magnitude Bit Planes
D
CX
Flags
Dist. RAM
19 x 7b
Index/MPS Table
MemoryFPGA Fabric
BRAM1024 x
4b
Sign-Mag
Convert&
Sticky Bits
N-1
NN N
Sign Bit Plane
Kees VissersMPsoC, July 5 - 9, 2004 27
Technology ApplicationPerformance
Input Data Rate(Msamples/sec)
CompressionRatio
SystemClock Rate
(MHz)FPGA1 640 x 480
@ 30 fps18.4 10:1 54
DSPProcessor2 352 x 240
@ 8 fps
1.35 10:1 600
ASIC3 720 x 480@ 30 fps
20.7 N/A 27
Sources: 1Schumacher 2Aware Inc. 3Yamauchi
2-D DWT Arithmetic Encoder
Bit Modeler
JPEG2000 Example
Kees VissersMPsoC, July 5 - 9, 2004 28
Power Dissipation (in Watts)
Technology Device ExternalMemory Total
Relative PowerEfficiency(Relative
MOPS/mW)FPGA 3.5 3.5 7.0 1DSP
Processor 1.7 2.4 4.1 0.13
ASIC 2.0 0.2 2.2 3.6
• FPGA has 8x efficiency of processor– 14 x throughput– 1/10 x system clock
• More on-board memory would help. Look for evolving memory hierarchies.
Power Efficiency Results
Kees VissersMPsoC, July 5 - 9, 2004 29
Job Farm w/ Eight MicroBlazes
MicroBlaze
0
UARTOPBTimer
BlockRam
HWSemaphore
MicroBlaze
1
BlockRam
MicroBlaze
2
BlockRam
MicroBlaze
3
BlockRam
MicroBlaze
4
BlockRam
MicroBlaze
5
BlockRam
MicroBlaze
6
BlockRam
MicroBlaze
7
BlockRam
ZBT Memory
Job Farm InformationJob 1 Job 2 Job 3 Job 4Job 5 Job 6 • • • Job N
ImageData
Kees VissersMPsoC, July 5 - 9, 2004 30
Multiple MicroBlaze DWT Results
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5 6 7 8
Processors
Spee
dup
Kees VissersMPsoC, July 5 - 9, 2004 31
Job Farm
• Make sure the ‘bus’ or the internal network can handle the required bandwidth
• Make sure that the latency is under control: drive towards multi-threading/multi context
• The granularity of the communication matters
Kees VissersMPsoC, July 5 - 9, 2004 32
OUTLINE
• Historical perspective• Conventional FPGAs• Applications and Programming• Future Directions
Kees VissersMPsoC, July 5 - 9, 2004 33
System Generator for DSP• Library-based, visual data flow• Polymorphic operators• Arbitrary precision fixed-point• Bit and cycle true modeling
• Seamlessly integrated with Simulink and MATLAB
– Test bench and data analysis
• Automatic code generation– Synthesizable VHDL– IP cores– HDL test bench– Project and constraint files
Kees VissersMPsoC, July 5 - 9, 2004 34
System Generator for DSP
System Generator for DSP v3.1 Allows Systems Designers to target external simulation engines- Hardware in the loop- HDL co-simulation
System Generator for DSP v3.1 Allows Systems Designers to target external simulation engines- Hardware in the loop- HDL co-simulation
HDL co-simulation using ModelSimAllows the user automatically invoke ModelSim and simulate his Verilog/VHDLdirectly from Simulink.
HDL co-simulation using ModelSimAllows the user automatically invoke ModelSim and simulate his Verilog/VHDLdirectly from Simulink.
Hardware in the loopco-simulationAllows the user to simulate part of his system on actual hardware. This means acceleration & verification
Hardware in the loopco-simulationAllows the user to simulate part of his system on actual hardware. This means acceleration & verification
Kees VissersMPsoC, July 5 - 9, 2004 35
Documented Reference Designs
• 16-QAM receiver, including LMS based equalizer and carrier recovery loop
• A/D and delta-sigma D/A conversion• Concatenated FEC codec for DVB• Custom FIR filter reference library• Digital down converter for GSM• 2D discrete wavelet transform (DWT)
filter• 2D filtering using a 5x5 operator• Color space conversion• CORDIC reference design• Polyphase MAC-based FIR• Streaming FFT/IFFT• BER Tester using AWGN model
Kees VissersMPsoC, July 5 - 9, 2004 36
OUTLINE
• Historical Perspective• Conventional FPGAs• Applications and Programming• Technology and Financial Impact• Future Directions
Kees VissersMPsoC, July 5 - 9, 2004 37
The mix revisited
• Processor + Cache• LUT• Embedded Memory• Bus• Reconfigurable interconect• Internal Memory• External Memory• I/O
Kees VissersMPsoC, July 5 - 9, 2004 38
Virtex-4 LXLogic Platform
Virtex-4 SXDSP Platform
Virtex-4 FXConnectivity
Platform
The Future : Domain Optimized Programmable Platforms
• Programmable – mass market of one
• Parallelism – performance and power
• Scalable – ride Moore’s Law
• Distributed memory – data transfer bottleneck
• Regular – manufacturable
• Domain optimized – cost
• Hyper-programming – more degrees of freedom, – spatial computing
Processing DomainCompute and IO
performance
Logic DomainLogic density
DSP DomainDSP performance
Kees VissersMPsoC, July 5 - 9, 2004 39
The Characteristics
500MHz DCM DigitalClock Management
500MHz FlexibleLogic Architecture
500MHz multi-portDistributed RAM
500MHzProgrammable DSP
Execution Units
450MHz PowerPC™Processor
WithAuxiliary Processor Unit
1Gbps DifferentialI/O
0.6-11.1GbpsSerial Transceivers
Kees VissersMPsoC, July 5 - 9, 2004 40
The tool we need
Decode H
eaders(V
OL, V
OP)
DM
UX
Motion Decoder
TextureVL
DecoderVLD
1
1
1
8FIFO
FIFO
ObjectFIFO
Motion Vectors
InverseScan Inverse
Quantisation / IDCT
8
Inverse AC DC
Prediction
Coeffs and DC
MotionCompensation
ObjectFIFOMotion
Vectors8×8
Block
Not Coded
8x8 Block
8×8 Block
Current FrameMemory
Controller
External SRAM
Current Frame
8×8 Block
8×8 Block
Display Controller
MPEG-4 Decoder
InputInputstream stream
Output Image Output Image Blocks Blocks
ObjectFIFO
Software Orchestrator
DCT Coeff
FIFO
Kees VissersMPsoC, July 5 - 9, 2004 41
The author gratefully acknowledges the contributions of material, insight and feedback from:
David Parlour, Steve Trimberger, Robert Turney, Paul Schumacher, Rick Ballantyne, Mark Paluszkiewicz-Xilinx Research Labs
André DeHon - California Institute of Technology
Kristof Denolf, Adrian Chirila-Rus, Bart Vanhoof, Jan Bormans - IMEC
Acknowledgements
Kees VissersMPsoC, July 5 - 9, 2004 42
University donations
•http://www.xilinx.com/univ/xup/ubroch/qform.htm
•http://www.xilinx.com/univ/ML310/ml310_mainpage.html