ISICT 2005
Supercomputing going
Reconfigurable
Reiner Hartenstein
TU Kaiserslautern
Jan. 4-6, 2005, Capetown, South Africa
© 2004, [email protected] http://hartenstein.de2
TU Kaiserslauternmy relations to South Africa
October 1981:reporting aboutKARL language and CHDL 1981 at Kaiserslauternhttp://hartenstein.de/Star.jpg Un. Stellenbosch:
an early KARL licensee
http://hartenstein.de/KARL-users.html
© 2004, [email protected] http://hartenstein.de3
TU Kaiserslautern>> The methodology gap <<
http://www.uni-kl.de
•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to
go•Curricula are obsolete •Final remarks
© 2004, [email protected] http://hartenstein.de4
TU Kaiserslautern
FPGA with island architecture
rLB:reconfigurable logic box
reconfigurable interconnect fabricslogic design issue: far from computing mind set
© 2004, [email protected] http://hartenstein.de5
TU Kaiserslautern
software to configware migration: speed-up examples
straight forward
x16 MOPS/mW16 tap FIR filterPACT Xtreme4-by-4 array[2003]
application example methodspeed-up factorplatform
multipleaspects
> x1000(computation
time)
grid-based DRC** 1-metal 1-poly nMOS*** 256 reference patterns
MoM anti machine with DPLA* [1983]
*) DPLA: MPC fabr. via E.I.S. multi univ. project
key issue: algorithmic cleverness
**) Design Rule Check
hi levelsynthesis
x7 – x46(compute
time)
migrate several simple application exampes
CPU 2 FPGA [FPGA 2004]
***) for 10-metal 3-poly cMOS expected: >> x10,000
not spec.X 100
(compute time)
from fastest DSP: 10 gMACs to 1 teraMAC
DSP 2 FPGA [Xilinx 20042]
2) Wim Roelandts
© 2004, [email protected] http://hartenstein.de6
TU Kaiserslautern
FPGAs are mainstream
my talk is not about FPGAs (reconfigurable logic)
Predicted to grow to $5 billion by 2007
FPGA market is worth $3 billion
fastest growing segmentof IC market
soft hardwaremorphware[DARPA]
its‘ about coarse grain: Reconfigurable Computing
© 2004, [email protected] http://hartenstein.de7
TU Kaiserslautern>> The wrong roadmap <<
http://www.uni-kl.de
•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to
go•Curricula are obsolete •Final remarks
© 2004, [email protected] http://hartenstein.de8
TU Kaiserslautern
data are moved around by software
(slower than CPU clock by 2 orders of magnitude)
i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall
extremely unbalanced
stolen from Bob Colwell
CPU
© 2004, [email protected] http://hartenstein.de9
TU KaiserslauternThe wrong roadmap
But HPC and supercomputing have stubbornly avoided configware use for the past 20 years
Educational deficits are one reason
An exception: Splash - it has been discarded
Adopting a new mindset: the brain hurts
Configware methodology moves data around more efficiently
© 2004, [email protected] http://hartenstein.de10
TU KaiserslauternIt‘s not Software
The programming sources for morphware and Reconfigurable Computing are fundamentally different:Software to Configware migration includes a time to space conversion
it‘s Configware
structural programming instead of procedural .... data-stream-based
instead of instruction-stream-based
© 2004, [email protected] http://hartenstein.de11
TU Kaiserslautern
no instruction fetch
Configware execution does not need instruction fetch ...
which saves memory cycles, and ...
... because a (super) instruction fetch happens before run time:
(re)configuration,
... more performance benefit comes from other acceleration mechanisms which stem
from time to space migration by Reconfigurable Computing
You d
on‘t
belie
ve ?
© 2004, [email protected] http://hartenstein.de12
TU Kaiserslautern
path of least resistance*:avoiding a paradigm shift
Many researchers seem never to stop working on sophisticated solutions for marginal improvements ...... continously ignoring methodologies promising speed-ups by orders of magnitude ....
... continue to bang their heads against the memory
wallinstead
of
*) [Michel Dubois]
blinders to ignore the impact of morphware
© 2004, [email protected] http://hartenstein.de13
TU Kaiserslautern
the data-stream-based approachhas no von Neumann bottle-neck
has no von Neumann bottle-neck
… understand only this parallelism solution:
the instruction-stream-based approach
von Neuma
nn bottle-necks
von Neuma
nn bottle-necks
... c
annot
cope w
ith
this
one
© 2004, [email protected] http://hartenstein.de14
TU Kaiserslautern
Living dangerously …me talking to HPC people:
„the wrong roadmap the past 20 years“
© 2004, [email protected] http://hartenstein.de15
TU Kaiserslautern
the hardware / Software Chasm:
typical programmers don‘t understand function evaluation without machine mechanisms (counters, state registers)
It‘s the gap between procedural (instruction-stream-based) and structural (datastream-based) mind set
acceleratorsacceleratorsµprocessorµprocessor
© 2004, [email protected] http://hartenstein.de16
TU Kaiserslautern>> HPC goesReconfigurable
<<
http://www.uni-kl.de
•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to
go•Curricula are obsolete •Final remarks
© 2004, [email protected] http://hartenstein.de17
TU Kaiserslautern
By the way ...
http://fpl.org
15th International Conference on Field-
Programmable Logic and Applications (FPL)
Aug. 24 – 26 2005, Tampere, Finland
in 2004: 288 submissions !
... the oldest and largest conference in the field:
accel.µProc.... going into every type of application
they all work on high performance
© 2004, [email protected] http://hartenstein.de18
TU Kaiserslautern
another conference ...
http://hartenstein.de/raw05.html
April 4 – 5, 2005, Denver, Coloradoin conjunction with IPDPS 2004 !
99 submissions !accel.µProc.... going into every type of application
they all work on high performance
© 2004, [email protected] http://hartenstein.de19
TU Kaiserslautern
more conferences ...
FPL series: 15th at Tampere FinlandRAW series: 12th at Denver, ColoradoDRS W. on Dynamically Reconf.
Systems, Innsbruck, Austria 2005W. Archit. Res. using FPGA platforms
in conj. w. HPCA-11. SFO, Febr 2005PACT'98 w. RC, Paris 1998.MAPLD (mil appl. ...) USA only - 8thFPGA, Monterey, CaliforniaFCCM, Napa, California, April 2005FPT 2004 Int‘l Conf.ERSA, Las Vegas, Nevada, USARHPC - in conj. w. HPC Asia ARC 2005 Int‘l w. Applied RC, Algarve,
Portugal, February 22, 2005ReConFig04 Univ. of Colima, MexicoDFG w. 04 at DaimlerChrysler, GermanyICES'05 Int‘l C. on Evolvable Systems,
From Biology To Hardware,special tracks in other conferences:
ICECS, DATE, DAC, othersResearch programs: DARPA, EU, DFG ...
EuroGP - European Conference on Genetic ProgrammingGECCO - Genetic Evolutionary Computation Conference, CEC - Congress on Evolutionary ComputationSEAL - Asia-Pacific Conf. on Simulated Evolution And
LearningEA - International Conference on Artificial Evolution, 7th,
October 26-28 2005, Lille, France, ECML - European Conference on Machine LearningIEEE Conference on Evolutionary ComputationInternational Conference on Evolutionary Programming (EP), European Conference on Artificial Evolution (AE)EUNITE - European Symp. on Intelligent Technologies,
Hybrid Systems and their Implementation in Smart Adaptive Systems
EUROGEN - Evolutionary Methods for Design, Optimisation and Control with Applications to Industrial Problems
ACDM - International Conference on Adaptive Computing in Design and Manufacture
EvoRobot - European Workshop on Evolutionary Robotics.......
© 2004, [email protected] http://hartenstein.de20
TU Kaiserslautern
First Indications of Change
10th RAW at IPDPS, Nice, France, April 2003: after a decade of non-overlap: first IPDPS people coming
HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing, July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan: Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address*
HPCA-11, 11th International Symposium on High-Performance Computer Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely: Embedded and reconfigurable architectures
SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29, 2004: topic area explicitely: Reconfigurable Systems
*) keynote speaker:
PARS & Speed-up, Basel, Switzerland, March 2003: keynote address*
IPDPS, Santa Fe, NM, USA, April 2004: keynote address*
PDP’04, La Coruna, Spain, Febr. 2004: keynote address*
© 2004, [email protected] http://hartenstein.de21
TU Kaiserslautern Cray XD1
•FPGAs as programmable co-processors for parallelization
•6 optional DSPs as accelerators
•Xlinx Virtex-II Pro FPGAs cooperate with AMD Opteron
•FPGAs dynamically programmable by Configware.
•Cray provides a configware library with special algorithms for search and Sort, DSP, and Encryption.
•Obtains in Genome-Sequenciung a speed-up of >100.
© 2004, [email protected] http://hartenstein.de22
TU Kaiserslautern>> Coarse grain is the way togo
<<
http://www.uni-kl.de
•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to go•Curricula are obsolete •Final remarks
© 2004, [email protected] http://hartenstein.de23
TU Kaiserslautern
rDPA (coarse grain) vs. FPGA (fine grain)
roughly:area efficiency
(transistors/chip, orders of
magnitude)
hardwired 4
FPGA 2
µProc 0
(coarse grain) rDPA 4
roughly:performance
(MOPS/mW, orders of
magnitude)
hardwired 3
FPGA 2
µProc 0
(coarse grain) rDPA 3
DSP 1
© 2004, [email protected] http://hartenstein.de24
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
Coarse grain is about computing, not logic
rout thru only
not usedbackbus connect
SNN filter on KressArray (mainly a pipe network)
[Ulrich Nageldinger]
Example: mapping onto rDPA by DPSS: based on simulated annealing
reconfigurable function block, e. g. 32 bits wide
no CPU
© 2004, [email protected] http://hartenstein.de25
TU Kaiserslautern
branching example: time-to-space migration
*) if no intermediate storage in register file
on a very simple CPU C = 1
memory cycles
nanoseconds
if C then read A
read instruction 1 100
instruction decoding
read operand* 1 100
operate & register transfers
if not C then read B
read instruction 1 100
instruction decoding
add & store
read instruction 1 100
instruction decoding
operate & register transfers
store result 1 100
total 5 500
S = R + (if C then A else B endif);
S
+
ABR C
Clock200 MHz(5 nanosec)
=1
section of a very large pipe network:
no m
emor
y cy
cles
:
Spee
d-up
fac
tor
= 1
00
© 2004, [email protected] http://hartenstein.de26
TU Kaiserslauterncommercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
• Evaluation Board available, and • XDS Development Tool with Simulator
buses not
shown
rDPU
CF
G
PAE
core
ALU CtrlALU
CF
GC
FG
PAE
core
CF
GC
FG
PAE
core
PAE
core
ALU CtrlALUALU CtrlALU
CF
GC
FG
CF
GC
FG
• Full 32 or 24 Bit Design working silicon
• 2 Configuration Hierarchies
© PACT AG, Munich http://pactcorp.com
(r)DPA
© 2004, [email protected] http://hartenstein.de27
TU Kaiserslautern 64 ALU-PAEs
16 RAM-PAEs
24-bit architecture
Split (12,12)-bit opcodes
complex addition
complex multiplication
conditional sign-flip
JTAG debug interface
Synthesis, P&R,
Layout from ACCENT
0.13µm silicon from
STMicro, Crolles
Available since March 2003
XPP64A: Coarse-grain Reconfigurable Array
© 2004, [email protected] http://hartenstein.de28
TU Kaiserslautern
XPP64A: Platform Development Board
- SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available since March 2003
© 2004, [email protected] http://hartenstein.de29
TU Kaiserslautern
SMeXPP: Multimedia Co-Processor Concept
GamesGames MusicMusicVideosVideos
SMeXPPSMeXPP
CameraCamera
Baseband-Baseband-ProcessorProcessor
Radio-Radio-InterfaceInterface
AudioAudio--InterfaceInterface
SD/MMC CardsSD/MMC Cards
LCD DISPLAY
SMeXPPSMeXPP
• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable deinterlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes
© 2004, [email protected] http://hartenstein.de30
TU Kaiserslautern>> Curricula are obsolete <<
•The methodology gap
•The wrong Roadmap
•HPC goes reconfigurable
•Coarse grain is the way to go
•Curricula are obsolete
•Final remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de31
TU Kaiserslautern
Completely wrong mind set
The key problem, the memory wall, cannot be solved by new CPU technology
We need a 2nd machine paradigm (a 2nd mind set ...)
The vN paradigm is not a communication paradigmIts monopoly creates a completely wrong mind set
We need an architectural communication paradigmBut we need both paradigms: a dichotomy
beef up old architecture principles by new technology?
© 2004, [email protected] http://hartenstein.de32
TU Kaiserslautern
von Neumann is not the common model
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
software
mainframe age:
microprocessor age:
configware age:
CPU accelerator reconfigurable
morp
hw
aresoftware/configware
co-compilersoftware configware
© 2004, [email protected] http://hartenstein.de33
TU Kaiserslautern
Why a new machine
paradigm ???The anti machine as the 2nd machine paradigm is the key to curricular innovation
rDPArDPACPUCPU
... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers
RAM-based platform needed for:• flexibility, programmability
• avoiding the need of specific silicon
mask cost: > 2 mio $ - growing
2nd machine paradigm needed as a common model:• to avoid the need of circuit expertize• needed to educate zillions of programmers
programcounter
DPUCPU
datacounter
memorybank
asMprogramme
d by flowware
© 2004, [email protected] http://hartenstein.de34
TU Kaiserslautern>> Final Remarks <<
•The methodology gap
•The wrong Roadmap
•HPC goes reconfigurable
•Coarse grain is the way to go
•Curricula are obsolete
•Final remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de37
TU Kaiserslautern
CS Education
proceduralhave not
You cannot *teach Hardware to a Programmer*) efficiently
But to a Hardware Guy or gal you
always can teach Programming
stem
s fr
om
vN
mon
opol
y
structuralhave
hardwar
e guy or
gal
natural
© 2004, [email protected] http://hartenstein.de38
TU Kaiserslautern
Growth Rate of Embedded Software
1
2
0 10 12 18 months
factor
*) Department of Trade and Industry, London
(1.4/year)
[Moore
’s law]
>10 times more programmers will write embedded applications
than computer software by 2010
Embe
dded
sof
twar
e [D
TI*]
(~2.
5/yr
)
already to-day, more than 98% of all microprocessors
are used within embedded systems
© 2004, [email protected] http://hartenstein.de39
TU Kaiserslautern
de facto Duality of RAM-based platforms
We now have 2 types of programmable platforms
anti machine: data-stream-based
von Neumann etc.:instruction-stream-based
machine paradigm
configware
software„running“ on it:
morphware (FPGA, rDPA ..)CPURAM-based platform
2nd paradigmtraditional
synthesis
flowware
© 2004, [email protected] http://hartenstein.de40
TU Kaiserslautern
[Gordon Bell]
... going into every type of application
[Gordon Bell]
.... the brain hurts
CW has become mainstream ...
Others experienced, that the brain hurts, when trying the paradigm shift
The HPC scene believed to be smart, when smiling about us CW guys
morphware: fastest growing sector of the IC market
© 2004, [email protected] http://hartenstein.de41
TU Kaiserslautern
configware resources: variable
Nick Tredennick’s Paradigm Shifts explain the differences
2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Software EngineeringSoftware Engineering
1 programming source needed
algorithm: variable
resources: fixedsoftware
CPU
© 2004, [email protected] http://hartenstein.de42
TU Kaiserslautern
Compilation: Software vs. Configware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
© 2004, [email protected] http://hartenstein.de43
TU KaiserslauternCompilation: Software vs.
Flowware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
flowwarecompiler
scheduler
flowware code
source „program“
Flowware Engineeri
ng
Flowware Engineeri
ng
datafo
r hard
wir
ed
anti
mach
ine
© 2004, [email protected] http://hartenstein.de44
TU Kaiserslautern
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
| output data
streams „data
streams“ time
port #
time
time
port #time
port #
Flowware defines: ... which data item at which time at which port
Flowware programs data streams
© 2004, [email protected] http://hartenstein.de45
TU Kaiserslauterndata streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Streams-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...
1996: configware / software partitioning compiler (Becker)
*) please, don‘t
confuse with „data
flow“
© 2004, [email protected] http://hartenstein.de46
TU Kaiserslautern>> Dual Machine Paradigms <<
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de47
TU Kaiserslautern
von Neumann vs. anti machine
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
(r)DPA
withoutsequencern
o C
PU !
asMA: auto-sequencing Memory Array
........
asM
asM
asM
asM
asM
(r)DPA
....
....
data stream machine
(anti machine)
datacounter
memory
bank
asM
asM: auto-sequencing Memory
instruction stream machine(von Neumann etc.)
© 2004, [email protected] http://hartenstein.de48
TU Kaiserslautern
Behavior of the Ccounter
datacounter
memory
bank
asM
asM
asM
asM
asM
asM
........
programcounter
DPUCPU
progra
mme
d
by flowwaredata
streams
programm
e
d by software
(r)DPA
programm
e
d by flowware
© 2004, [email protected] http://hartenstein.de49
TU Kaiserslautern
Behavior of the Ccounter
datacounter
memorybank
asM
asM
asM
asM
asM
asM
........
programcounter
DPUCPU
progra
mme
d
by flowwaredata
streams
programm
e
d by software
(r)DPA
programm
e
d by flowware
© 2004, [email protected] http://hartenstein.de50
TU Kaiserslautern
Counters: the same micro architecture ?
data stream machine (anti
machine)
datacounter
memory
bank
asM
programcounter
DPUCPU
instruction stream machine:(von Neumann etc.)
yes, is possible, but for data counters ...
*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia
... a much better AGU methodology is available*
AGU: address generator unit
© 2004, [email protected] http://hartenstein.de51
TU Kaiserslautern
Software / Configware Co-Compilation
Analyzer/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
Resource Parameters
supportingdifferentplatforms
Juergen Becker’s CoDe-X, 1996
High level PL source
FW Code
© 2004, [email protected] http://hartenstein.de52
TU Kaiserslautern
Better solutions by Configware
Memory cycles minimizede.g.: no instruction fetch at run time & other
effects
Memory access for data: caches do not help anyhowLoop xforms: no intra-stream data memory cyclesComplex address computation: no memory cycles
No cache misses!
instead of software
methodologies not new: high level synthesis (1980+)loop transformations (1970+)
many other areas
© 2004, [email protected] http://hartenstein.de53
TU KaiserslauternWhy the speed-up ...
... although FPGA is clock slower by x 3 or even more(most know-how from „high level synthesis“ discipline)
moving operator to the data stream (before run time)
support operations: no clock nor memory cycle
decisions without memory cycles nor clock cycles
most „data fetch“ without memory cycle
© 2004, [email protected] http://hartenstein.de54
TU Kaiserslautern
HPC experts coming ...
Simulation of Star Clusters: x10 speed-upby supercomputer-to-morphware migration
(also molecular biology et al.)
Rainer Spurzem, University of Heidelberg
Reinhard Maenner, University of MannheimHPC pioneer since 1976 (Physics Dept Heidelberg)
Configware by
Astrophysics by
example: N-body problem going configwarepaper already
at FPL 1999http://fpl.org
ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by
August Kopff
Gottfried Kirch
August Kopff
© 2004, [email protected] http://hartenstein.de55
TU Kaiserslautern
August Kopff
18th Director, Astrononisches Rechen-Institut (ARI) 1924 - 1954
discovered the Kopff comet, Koenigstuhl Observatory, Heidelberg, Germany, 1906
Copyright © 1996 by Masayuki Suzuki
The Galileo spacecraft's 14-year odyssey came to an end on Sunday, Sept. 21, 2003
discovered the asteriod 631 Philippina, 21 March 1907,
which became the first asteroid ever visited by a spacecraft - on the Galileo mission to Jupiter
© 2004, [email protected] http://hartenstein.de56
TU KaiserslauternConclusions
We need an academic grass roots movement, for ....
RC has become mainstream in all kinds of applications
... by a merger with the embedded systems mind set
CS education deficits: a curricular revision is overdue
...free material & tools for undergraduate lab courses to program and emulate small SW/CW/HW examplesall know-how needed readily available:
get involved !
© 2004, [email protected] http://hartenstein.de57
TU Kaiserslautern
500MHz FlexibleSoft Logic Architecture
200KLogic Cells
500MHz Programmable DSP Execution Units
0.6-11.1GbpsSerial Transceivers
500MHz PowerPC™ Processors(680DMIPS)
withAuxiliary Processor Unit
1Gbps DifferentialI/O
500MHz multi-portDistributed 10 Mb SRAM
500MHz DCM DigitalClock Management
State-of-the-art FPGA[courtesy Xilinx Corp.]
© 2004, [email protected] http://hartenstein.de58
TU Kaiserslautern
The Platform FPGA
MGTs I/OsMemory PowerPCSoft Logic
DSP
Com
mun
icat
ion
Por
t
CustomLogic
Inte
rnal
Mem
ory
Exte
rnal
Mem
ory
Port
DSP
A
ccel
erat
or
µP
[courtesy Xilinx Corp.]
© 2004, [email protected] http://hartenstein.de59
TU Kaiserslautern
Domain A Domain B Domain x
Virtex-4 – First Embodiment of ASMBL
Platform A Platform B Platform x
...
Logic DomainHighest logic density
DSP DomainHighest DSP performance
Embedded Processing DomainEmbedded ProcessorsHigh-speed Serial I/O
Virtex-4 LXLogic Platform
Virtex-4 SXSignal Processing Platform
Virtex-4 FXFull Featured Platform
One Family – Multiple Platforms
Logic
DSP
Memory
Legend
CPUs
Gbps I/O
© 2004, [email protected] http://hartenstein.de60
TU Kaiserslautern
Comparison of Hardware Solutions
General Purpose Processors
• CISC (Complex Instruction Set Computing)• RISC (Reduced Instruction Set Computing)
Special Purpose Processors
• Microcontroller• DSPs (Digital Signal Processors)• Application-Specific Instruction Set Processors (ASIPs)
Programmable Hardware
• FPGA (Field-Programmable Gate Arrays)• FPFA (Field-Programmable Function Arrays)
Application-Specific Integrated Circuits(ASICs)
PerformanceFlexibilityPowerConsumption
© 2004, [email protected] http://hartenstein.de61
TU Kaiserslautern
Reconfigurable Hardware (FPL)
•PALs, PLAs: -> 10 - 100 Gate Equivalents
•Fine-grain: FPGAs, FPLDs– Altera MAX Family
-> FPLD (EEPROM)– Actel Programmable Gate-Array
-> FPGA (Anti-Fuse)– Xilinx Logical Cell Array
-> FPGA (RAM-based)
•Several Thousand to multiple Millions Gate Equivalents -> e. g. Xilinx Virtex XCV2000 -> 2 Millions Gate Equivalents
•Coarse-grain:Field Programmable Functional Arrays = FPFAs
– Company PACT XPP Technologies AG (Munich, Germany): Xtreme Processing Platform: XPP Architecture
– Universitaet Karlsruhe (TH): New multi-grain adaptive Architectures
XC4085XL
1997 1998 199
Virtex
Dic
hte (S
yste
m G
ate
s)
10M Gates
In 2002
Virtex
XC40250XV
2M
1M
250k
180k
500k
XC4085XL
1997 1998 1999 2000 2004
Virtex
Dic
hte (S
yste
m G
ate
s)
10M Gates
In 2004
Virtex
XC40250XV
2M
1M
250k
180k
500k
10M
2M
1M
250k
180k
500k
8M
© 2004, [email protected] http://hartenstein.de62
TU Kaiserslautern
Terminology
Configurable:
Programmable:
Reconfigurable:
Dynamical Reconfigurable:Type of Reconfigurations, which realizes Modifications of Configurations during Run-time of the System. This is also called run-time reconfiguration, on-the-fly reconfiguration or in-circuit reconfiguration
General Term, which expresses the Features of a Hardware Architecture to be configured more than once (-> Technology dependent)
Type of flexible Computations, whereas only one or a few
Instructions per Processing Element are loaded and the Execution is performed in the Dimensions Time and Space (-> Area) concurrently
Type of flexible Computations, whereas a Sequence of Instructions is loaded and executed in the Dimension Time by using one or several Processing Elements
© 2004, [email protected] http://hartenstein.de63
TU KaiserslauternCollection of programmable “Gates” embedded in a flexible Interconnect Network … a “user programmable” Alternative to Gate Arrays
Xilinx Virtex FPGA: Logic Realization
Solution:
Programmable Gate
??
Mem
In1 In2
OutMem
In1 In2
OutIn Out00 001 110 111 0
In Out00 001 110 111 0
2-LUT
© 2004, [email protected] http://hartenstein.de64
TU Kaiserslautern
Source: Xilinx Virtex-II-Pro Documentation
Integration Process
Substrate
PolyMetal 1Metal 2Metal 3Metal 4
Hard IP-block
(Power PC) PolyMetal 1Metal 2Metal 3Metal 4
Metal 5Metal 6Metal 7Metal 8Metal 9
On Chip Memory Controller
Power PCCore
BRAM BRAM
BRAM BRAM
EmbededRAM
•It contains PowerPC ® 405 RISC CPU (PPC405) Cores
•FPGA Fabric-based on Virtex-II Architecture
Xilinx Virtex-II Pro FPGA: Architecture
© 2004, [email protected] http://hartenstein.de65
TU Kaiserslautern
PACT XPP
with RISC
Core
PACT XPP will boost RISC cores
•The PACT XPP Technology will allow RISC IP Manufacturer to conquer new markets– RISC manufacturers will be able to extend their road map– RISC manufacturer will regain significance in silicon occupancy
Control Flow Data Flow
Pe
rfo
rman
ce
MIPS
ARMFPGA
DSPASIC
DP
© 2004, [email protected] http://hartenstein.de66
TU Kaiserslautern • .
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
FPGA Performance vs. KressArray
physicallogical
KressArra
y
memory
1980 1990 2000 2010
Tra
nsi
sto
rs/c
hip
100 000 000
10 000 000
1000 000
100 000
10 000
1000
100
10
1
FPGAlogical
FPGArouted
FPGAphysical > 10 000
e. g. KressArray : 18 bit PU cell
8 NNports4 buses
0.18 m CMOS:
0.06 mm2 area
< 10
© 2004, [email protected] http://hartenstein.de67
TU Kaiserslautern
Computer Performance Trend History
Supercomputers
Mainframes
Minicomputers
Microproce
ssors
[Hennessy, Jouppi, 1991]
1965 1970 1975 1980 1985 1990 1995 2000 2005
1000
100
0,1
10
1
performance
year
© 2004, [email protected] http://hartenstein.de68
TU KaiserslauternSoftware to Configware
Migration
this talk will illustrate the performance benfitwhich may be obtained from Reconfigurable Computing stressing coarse grain Reconfigurable Computing (RC),point of view, this talk hardly mentions FPGAs(But coarse grain may be always mapped onto FPGAs)
Software to Configware Migration is the most important source of speed-upHardware is just frozen Configware
© 2004, [email protected] http://hartenstein.de69
TU Kaiserslautern
Reconfigurable Computing (1)
•Goal: software flexibility with hardware performance
– Microprocessors for any software, but slow– ASICs fully specialized, but very fast– Reconfigurable computing to bridge this gap
• Reconfiguration to execute a wide variety of applications on the same hardware platform
• Hardware yields higher performance than software
Microprocessors
Highest FlexibilityPerformance?
ASICs
Highest PerformanceLowest Flexibility
Reconfigurable Computing
High FlexibilityHigh Performance
© 2004, [email protected] http://hartenstein.de70
TU Kaiserslautern
Reconfigurable Computing (2)
• Configuration– “Programming” the hardware– Implements desired functionality in
hardware•For example, if addition necessary, adder made
in hardware
• Reconfiguration– Changing the configuration of the device– Typical overhead: ~ 3 to 100 ms– Two types: static and dynamic
Reconfigurable Hardware
Functionality (a design)
Reconfiguration
Reconfigurable Hardware
New functionality (a new design)
© 2004, [email protected] http://hartenstein.de71
TU Kaiserslautern
Types of Reconfiguration
• Static reconfiguration– Configure the device once for the application– Do not reconfigure until the application is finished or do not
reconfigure at all• Hardware still flexible in the design phase
– Application: high performance computing that needs hardware performance without the high cost of designing an ASIC
• Dynamic/run-time reconfiguration– Hardware is reconfigured while the application is executing.– Partial reconfiguration
• Part of the reconfigurable hardware is reconfigured while the rest stays the same and continues to execute
– Duration between reconfigurations varies• Long duration: e.g. router updating tables and/or protocols, cell phone
switching protocols• Short duration: e.g. regular expression matching, DSP application switching
between stages
© 2004, [email protected] http://hartenstein.de72
TU Kaiserslautern
FPGA Basics
• FPGA consists of– Matrix of programmable logic cells
•Implement any logic function– AND, OR, NOT, etc
•Can also implement storage– Registers or small SRAMs
•Groups of cells make up higherlevel structures
– Adders, multipliers, etc.
– Programmable interconnects•Connect the logic cells to one another
– Embedded features•ASICs within the FPGA fabric for specific functions
– Hardware multipliers, dedicated memory, microprocessors
• FPGAs are SRAM-based– Configure device by writing to configuration memory
Logic CellsInterconnects
© 2004, [email protected] http://hartenstein.de73
TU Kaiserslautern
History of Machine Models
1957
1967
1977
1987
1997
2007
mainframe age
mainframe.
compile
procedural mind set: instruction-stream-based
procedural mind set: instruction-stream-based
(coordinates by Makimtos wave)
computer age (PC age)
accel.µProc.
compile
structural mind set: data-stream-basedstructural mind set: data-stream-based
by hardware guysby hardware guysdesign
e. g. GRAPERIKEN institute
© 2004, [email protected] http://hartenstein.de74
TU Kaiserslautern
Flowware: not new
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
DPArrµProc.
morphware age
*) no confusion, please: no „dataflow
machine“ !!!
*) no confusion, please: no „dataflow
machine“ !!!data stream* ...Flowware:
around 1980
© 2004, [email protected] http://hartenstein.de75
TU Kaiserslautern3rd machine model became
mainstream
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
instruction-stream-based DPArrµProc.
programma
bleprogramma
ble
most CS curricula &
HPC are still here
most CS curricula &
HPC are still here
morphware age
© 2004, [email protected] http://hartenstein.de76
TU Kaiserslautern
symbiosis of machine models
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
morphware age
DPArrµProc.replace PC by PS
co-compiler
symbiosis
© 2004, [email protected] http://hartenstein.de77
TU Kaiserslautern
DPA
morphware age
rr
From Software to Configware Industry
structuralpersonalization:
RAM-based
Repeat Success Story bya 2nd Machine Paradigm !
Growing Configware IndustryGrowing Configware Industry
1957
1967
1977
1987
1997
2007
computer age (PC age)
µProc.
compileProcedural
personalization via RAM-based .
Machine Paradigm
Software IndustrySoftware Industry
1)2)
Software Industry’sSecret of Success anti machine
© 2004, [email protected] http://hartenstein.de78
TU Kaiserslautern
Earth Simulator
5120 Processors, 5000 pins eachES 20: TFLOPS
Crossbar weight: 220 t, 3000 km of cable,moving data around
inside the
© 2004, [email protected] http://hartenstein.de79
TU Kaiserslautern
Hardware / Configware / Software Partitioning skills urgently needed
Algorithm
partitioning
HWHW
CWCW
SWSW
to cope with each of it:SW, CW, HW
.
SW/HWSW/HW
SW/CW/HWSW/CW/HWSW/CWSW/CW
CW/H
WCW
/HW
or: to cope with anycombination of co-design
.
Software to Configware Migration is the most important source of speed-up
Hardware is just frozen Configware
© 2004, [email protected] http://hartenstein.de80
TU Kaiserslauterntypical CS graduates: the
„havenots“
To-day, „typical“ CS graduates are unqualified for this labor market
… cannot cope with Hardware / Configware / Software partitioning issues
… cannot implement Configware
© 2004, [email protected] http://hartenstein.de81
TU KaiserslauternThe „havenots“
Configware methodology to move data around more efficiently:
Configware engineering as a qualification for programming embedded systems*:
„havenots“ are found in the HPC community
The „havenots“ are our typical CS graduates
*) also programming for HPC !
© 2004, [email protected] http://hartenstein.de82
TU Kaiserslautern
###
- SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available since March 2003
Top Related