Is it time for Von Neumann and Harvard to Retire?
Transcript of Is it time for Von Neumann and Harvard to Retire?
1
Is it time for Von Neumann and Harvard to Retire?
Presented by :Allan Cantle – CEOwww.nallatech.com
Commercial In Confidence. Copyright ©2005, Nallatech.2
Agenda
» History & Commercial Realities of FPGA Computing» Thoughts on the future possibilities for FPGA Computing
» FPGA Coprocessor vs FPGA main processor» Optimizing Spatial & Temporal Demands of Computing Problems» Homogeneous vs Heterogeneous vs Polymorphic Computing» Coarse Grain Vs Fine Grain Architectures» Distributed Parallel processing Vs Clustered Parallel processing
» Should Von-Neumann and Harvard Architectures be retired
» Summary
Commercial In Confidence. Copyright ©2005, Nallatech.3
Introduction
» FPGAs………a 20 year History!» From “Glue Logic” beginnings to complete “Systems on
a Chip” today. » Mathematical functions in FPGAs since 1993 » FPGAs pervasive in virtually all electronics equipment » Many different perspectives on FPGA capability» This leads to confusion in the market place and mixed
messages
Commercial In Confidence. Copyright ©2005, Nallatech.4
FPGA Perceptions
ASICMicro- processor
Easy to ProgramFlexible
Difficult to programFixed Function
Low Performance High Performance
DSP FPGA
When an FPGA is viewed as an ASIC……………..…the observer historically saw……
Low performance
Wasted TransistorsHigher Power Consumption
Good for prototyping
Never Use in Production
High Cost
Commercial In Confidence. Copyright ©2005, Nallatech.5
FPGA Perceptions
ASICMicro- processor
Easy to ProgramFlexible
Difficult to programFixed Function
Low Performance High Performance
DSP FPGA
These were the original views of FPGAs and they have STUCK in many peoples minds
Low performance
Wasted TransistorsHigher Power Consumption
Good for prototyping
Never Use in Production
High Cost
Commercial In Confidence. Copyright ©2005, Nallatech.6
FPGA Perceptions
ASICMicro- processor
Easy to ProgramFlexible
Difficult to programFixed Function
Low Performance High Performance
DSP FPGA
When an FPGA is viewed as a DSP……………..…today the observer sees ……
Co-ProcessorHigh Performance
I/O InterfaceNiche
Commercial In Confidence. Copyright ©2005, Nallatech.7
FPGA Perceptions
ASICMicro- processor
Easy to ProgramFlexible
Difficult to programFixed Function
Low Performance High Performance
DSP FPGA
When an FPGA is viewed as a processor……………..…Nallatech sees ……
Main Processor High Performance
Floating Point
Lower Power Consumption
Increased Performance Density
Immature Tools
Commercial In Confidence. Copyright ©2005, Nallatech.8
Have It Your Way!
ASICMicro- processor
Easy to ProgramFlexible
Difficult to programFixed Function
Low Performance High Performance
FPGA
Commercial In Confidence. Copyright ©2005, Nallatech.9
HPEC Vs HPC
» Earlier FPGA adoption within HPEC Community » FPGA based HPEC is a volume based commercial
reality today » High Performance Embedded Computing (HPEC)
» Users have an appreciation of underlying hardware technology» Low level of programming abstraction» Applications are often severely SWAP restricted
» High Performance Computing (HPC)» Users are far more software centric » Programming is achieved using high level software languages» Exclusive use of Floating Point arithmetic» Focus on ease of implementation of complex algorithms
Commercial In Confidence. Copyright ©2005, Nallatech.10
1993 – Simple Maths Functions within FPGA
» Simple Mathematical Functions» 2 bit arithmetic function per Logic Slice» Effective for 1D Pipelined data paths» Very Basic Functions - highly repetitive and Data Intensive» Schematic Hardware designed
» XC4006 Series FPGAs
» 256 Logic Slices» 128 max user I/O
Commercial In Confidence. Copyright ©2005, Nallatech.11
MicroprocessorOnly
Microprocessorbased
FPGAcoprocessor
1990
Pre Nallatech
1993
Professional Consulting Services
Nallatech’s Adoption of FPGAs for HPEC
Commercial In Confidence. Copyright ©2005, Nallatech.12
Nallatech’s Early FPGA HPEC Computing Experience - 1993
» Real time, ultra low latency, Imaging Simulator» Embedded Distributed Processing System» Floating point, matrix transformations, convolution, sensor interfacing etc
Gain + Offset Control Convolution
Image Composition
Sensor Interface
Target Image Generation
Background Image Processing
T9
FPGA
I860
x6
x4
x2
I860
I860
I860
T9
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
x2
FPGA
T9
FPGA
x3
x1
T9
T9
T9
FPGA
C80
x2
x3
x4
T9
FPGA FPGA
C80 C80
C80
T9
x3
T9
T9
Commercial In Confidence. Copyright ©2005, Nallatech.13
1998 – Virtex FPGA Family
» Revolutionary Xilinx Virtex FPGA family Introduced» 32 x 4Kbits Block RAMs + other mathematical features introduced
» allowed 2D mathematical algorithms to be implemented» Excellent for Image Processing Algorithms
» Significant DSP capability
» Virtex
» 12,000 Logic Slices» 804 max user I/O» 32 4Kbit Block RAMs
Commercial In Confidence. Copyright ©2005, Nallatech.14
MicroprocessorOnly
Microprocessorbased
FPGAcoprocessor
1990
Pre Nallatech
1993
Professional Consulting Services
Nallatech’s Adoption of FPGAs for HPEC
Microprocessor coprocessor
FPGAbased
1998 DIME Product
Family FPGA Centric
Computing Architecture
Commercial In Confidence. Copyright ©2005, Nallatech.15
2001 – Virtex-II FPGA Family
» Virtex-II FPGA introduced followed by Virtex-II Pro in 2003» 444 18x18 Multipliers & 18kbit block RAMs introduced» Gbit Serial I/O Communications & Power PC Processors Introduced» Complex Floating Point Algorithm Implementation now possible
» Virtex-II / Pro» 44,000 Logic Slices» 444 18Kbits BRAMs» 444 18x18 Multipliers» 2 PowerPC
Processors» 20 Gbit I/O» 1164 Max User I/O
Commercial In Confidence. Copyright ©2005, Nallatech.16
What This Means for HPC
MicroprocessorItanium 2
FPGAVirtex 2VP100
Technology 0.13 Micron 0.13 Micron
Clock Speed 1.6GHz 180MHz
Internal Memory Bandwidth
102 GBytes per Sec 7.5 TBytes per Sec
# Processing Units 5 FPU(2MACs + 1FPU) + 6 MMU
+ 6 Integer Units
212 FPU or 300+ Integer Units or
……….Power Consumption 130 WATTS 15 WATTS
Peak Performance 8 GFLOPs 38 GFLOPS
Sustained Performance
~2 GFLOPs ~19 GFLOPS
I/O / External Memory Bandwidth
6.4 GBytes/sec 67 GBytes/sec
Commercial In Confidence. Copyright ©2005, Nallatech.17
MicroprocessorOnly
Microprocessorbased
FPGAcoprocessor
Microprocessor coprocessor
FPGAbased
1990
Pre Nallatech
1993
Professional Consulting Services
1998
DIME Product Family
FPGA Centric Computing Architecture
Nallatech’s Adoption of FPGAs for HPEC
FPGA
Microprocessor embedded
2001
DIME-II Product Family
FPGA Centric Computing Architecture
Commercial In Confidence. Copyright ©2005, Nallatech.18
Hardware Platform
System Software
Systems Communications
FPGA ComputingThe Whole Solution
» Inter-FPGA Communication
»Abstracts Hardware Architecture
» System Management and control
» APIs
» COTS Hardware
» Modular
» Multiple-FPGA Systems
Commercial In Confidence. Copyright ©2005, Nallatech.19
FPGA Communications and Tool Support
From PCI
Physical Link
FPGA
PCI Host
FPGA
N3
N4
R
N5
R N6N2
N1
RE
BB
VHDL
Memory
MATLAB
Processors
Block Flows
C FlowsOpen 3rd Party Component
Support
N
N N
N N
N
Viva
- C
Commercial In Confidence. Copyright ©2005, Nallatech.20
Accelerated Hardware Implementation
From PCI
Physical Link
FPGA
PCI Host
FPGA
N3
N4
R
N5
R N6N2
N1
RE
BB
VHDL
Memory
MATLAB
Processors
Block Flows
C Flows3rd Party Component
Support
N
N N
N N
N
Viva
- C
Commercial In Confidence. Copyright ©2005, Nallatech.21
Commercial Realities for HPC
» Not a Panacea – As with all parallelisation» Translation of Legacy Code
Legacy Code
C Program
FPGA Translated
Execution Time
0%
100%
On Processor
Bandwidth & Latency
Considerations
uP ExecutedFPGA Executed
uP / FPGA Partition
Execution Time
Commercial In Confidence. Copyright ©2005, Nallatech.22
Commercial Realities for HPC
» Maturity of High Level Languages» Good progress has been made in FPGA compilers» Often Trade off between performance and ease of use
» Parallelising of code» Fine Grain parallelism is critical» Still not automatic» Taking implied parallelism from serial code is NOT good enough» HPC Software engineers are well qualified to deal with this
» Development & Debug Time» Comparable to programming in assembler vs C» Biggest Hurdle is the Synthesis times –hours to days!» Where tools can make a significant impact- If they have no bugs!
Commercial In Confidence. Copyright ©2005, Nallatech.23
Commercial Realities for HPC
» No Real Industry Standardisation » Requires expertise to “brew your own solutions”» Difficulty for Beginners
» Bang for Buck – for Floating Point Implementations» NRE today WILL be more expensive - ~5-10 times» Can approach 100 times performance for 1/2th the Cost » Significantly reduced SWAP, >200 times,
» Result in a significant Cost of Ownership savings
Commercial In Confidence. Copyright ©2005, Nallatech.24
Nallatech’s Adoption of FPGAs for HPEC
MicroprocessorOnly
Microprocessorbased
FPGAcoprocessor
Microprocessor coprocessor
FPGAbased
1990
Pre Nallatech
1993
Professional Consulting Services
1998
DIME Product Family
FPGA Centric Computing Architecture
FPGA
Microprocessor embedded
2001
DIME-II Product Family
FPGA Centric Computing Architecture
FPGA
2003
FPGA Based HPC Solutions
Commercial In Confidence. Copyright ©2005, Nallatech.25
Algorithm Acceleration
Seismic Processing- Kirchhoff algorithm- Single Precision Floating Point - 64 times faster than a 2GHz Pentium 4-200 times less power consumption
Smith-Waterman- Dynamic Programming Algorithm used in Biological Sequencing- 155 times faster than SunFIRE 280R processing unit
Commercial In Confidence. Copyright ©2005, Nallatech.26
Algorithm Acceleration
Real Time Video Processing - Single Precision Floating
Point calculations-36 GFlops + 40 GOPs sustained Performance on a single PCI card- >200 times Power reduction over Xeon
Gravity Simulation - N-Body computation- Single Precision Floating Point - 20GFlops/sec sustained performance-100 times faster than 2.4GHz Pentium 4 CPU-
Commercial In Confidence. Copyright ©2005, Nallatech.27
A Young Solar System
Commercial In Confidence. Copyright ©2005, Nallatech.28
Colliding Galaxies
Commercial In Confidence. Copyright ©2005, Nallatech.29
FPGA CoprocessorVs
FPGA Main Processor
Commercial In Confidence. Copyright ©2005, Nallatech.30
FPGA’s as Coprocessors
» Accelerator / Offload Engine for Microprocessor based solutions
» Advantages» Easy to conceptualise» Pragmatic Approach» Possible large performance improvement for least effort» Port small functions of Compute intensive Legacy Code» Rest of code remains on Existing Host. » Benefit from Existing Host interfaces
» Disadvantages » Only Applicable to certain functions» Need to consider bandwidth / latency requirements
Commercial In Confidence. Copyright ©2005, Nallatech.31
FPGA’s as Main Processor
» The FPGA takes on the complete Compute Function» Advantages
» Build the Computing Architecture around Algorithmic Problem» Can provide another order of magnitude increase in performance» Can go back to First principles» Don’t have to port Optimised processor code to FPGA
» Disadvantages» Rarely Start with a clean sheet of paper» Tool Maturity» Only practical for relatively straight forward algorithms
Commercial In Confidence. Copyright ©2005, Nallatech.32
Main Vs Co - Recommendations
» Co-processor approach is most applicable Today : -» For HPC» Whenever a Man Machine interface is required» Whenever low performance industry standard interfaces are required. » If you need to work with legacy code» Quick wins
» A Main Processor Approach is recommended Today : -» For Stand alone Embedded applications (HPEC)» When starting with a clean sheet of paper» If ultimate performance is a pre-requisite » The best power/performance ratio is required from your system» Relaxed development times
Commercial In Confidence. Copyright ©2005, Nallatech.33
Optimizing Spatial & Temporal Demands
of Computing Problems
Commercial In Confidence. Copyright ©2005, Nallatech.34
Spatial & Temporal Definitions
» There are several perspective on the meaning of spatial and temporal 1. Cluster of Microprocessors
» Temporal = Function runs within one processor» Spatial = Function spread across many microprocessor nodes
2. Traditional Embedded Computing Hardware» Temporal = using Microprocessor» Spatial = Implementing dedicated ASIC accelerators
3. FPGAs» Temporal = using same logic resources for multiple Functions» Spatial = Paralleling and pipelining a function across the FPGA Fabric
» Ultimate Aim is to ensure that no processor goes idle» And you utilise all the available resources
Commercial In Confidence. Copyright ©2005, Nallatech.35
Complexity and Speed
» Any Application will be constructed from several functions» Each Function will have varying degrees of complexity» Each Function will also have varying demands on its
execution time
ComplexSimple
Can Execute Slowly
Must Execute Quickly
ComputeIntensive
MediumCompute
LowCompute
MediumCompute
Fully spatial implementation
Balanced spatial and temporal
implementation
Fully Temporal Implementatio
n
TraditionalASIC
Traditional Vector
TraditionalMicroprocesso
rFPGA
All parallel / pipelined
FPGAPartially Parallel
Partial reuse
FPGASoft
Microprocessor
Commercial In Confidence. Copyright ©2005, Nallatech.36
Homogenous ComputingVersus
Heterogeneous ComputingVersus
Polymorphic Computing
Commercial In Confidence. Copyright ©2005, Nallatech.37
Direction of Computing
Cell 1
PPC
Cell 3
Cell 5
Cell 7
Cell 2
Cell 4
Cell 6
Cell 8
Cell Processor – IBM, Sony, Toshiba
Heterogeneous Computing on Silicon
Global Shared Memory
uPuP
GP
FPGAFPGAASSP
SGI – Heterogeneous Architecture
Intel – 16 processor per die Homogeneous on Silicon
DARPAPolymorphic Computing
Commercial In Confidence. Copyright ©2005, Nallatech.38
Polymorphic Computing
Application Data TypesSymbolicVector/StreamingBit Level
SWEP
T Ef
fici
ency
Size
, Wei
ght,
Ene
rgy,
Per
form
ance
, Tim
e FPGA ProcessorPolymorphic
Commercial In Confidence. Copyright ©2005, Nallatech.39
Cell
Proc
esso
r
RP –
e.g
. Cle
arsp
eed
Elix
ent,
Pic
ochi
p
Supp
ort
for
FPU
?
Supp
ort
for
DSP
Polymorphic Computing
Application Data TypesSymbolicVector/StreamingBit Level
SWEP
T Ef
fici
ency
Size
, Wei
ght,
Ene
rgy,
Per
form
ance
, Tim
e FPGA Processor
Commercial In Confidence. Copyright ©2005, Nallatech.40
What is Polymorphic processor?
MicroprocessorDSP UnitFloating Point Operator
Integer OperatorLogical Operator
Commercial In Confidence. Copyright ©2005, Nallatech.41
A Polymorphic FPGA
» FPGA is the closest concept to Polymorphic computing» It can morph into the different operators» However it cannot perform them all with equal efficiency
» Is a polymorphic course grained FPGA Possible
= Polymorphic processing elementCan Morph into : -•Microprocessor•Integer Operator•DP/SP Floating Point Operator•Logical Operator•Text Operator
= Traditional FPGA Fabric
Commercial In Confidence. Copyright ©2005, Nallatech.42
Coarse Grain ArchitecturesVersus
Fine Grain Architectures
Commercial In Confidence. Copyright ©2005, Nallatech.43
Coarse vs Fine Grain
» Cluster = ultimate in coarse grained parallelism» ASIC = Ultimate in fine grain parallelism» FPGA = Programmable Fine grain parallelism» The Finer the grain, the more you can make the
architecture exactly fit the problem.» However Fine Grain Programmable FPGA are a sub-
optimal solution as they suffer from» An inefficient transistor utilisation on coarser grain operations» A slower clock frequency that could be improved with coarser
granularity
Commercial In Confidence. Copyright ©2005, Nallatech.44
Distributed Parallel Processing (DPP) Vs
Cluster Parallel Processing (CPP)
Commercial In Confidence. Copyright ©2005, Nallatech.45
DPP & CPP Definitions
» Processing Power is distributed to where it is needed
» Direct Communications built as needed
» Computing Architecture designed to fit the Application
Communications Infrastructure
= Server node
Distributed Parallel Processing Cluster Parallel Processing
» Regular processor Architecture
» Regular communications Infrastructure
» Application must be designed to fit the computer architecture
Commercial In Confidence. Copyright ©2005, Nallatech.46
Application Implementation
» Example application consisting of 8 algorithms
» Need to map onto hardware for real-time implementation
» Algorithms each have different characteristics
Algorithm A
Algorithm B
Algorithm C
Algorithm D
Algorithm E
Algorithm F
Algorithm G
Algorithm H
Commercial In Confidence. Copyright ©2005, Nallatech.47
Communications Infrastructure
Fitting Application to Cluster
Algorithm A
Algorithm B
Algorithm C
Algorithm D
Algorithm E
Algorithm F
Algorithm G
Algorithm H
Application Cluster
Commercial In Confidence. Copyright ©2005, Nallatech.48
Fitting Application to Distributed FPGA Computer
» VME Blade form-factor» Five high-density platform
FPGAs» High-speed external analog
interfaces» High-speed synchronous
SRAM memory» Gigabit Ethernet interface
Commercial In Confidence. Copyright ©2005, Nallatech.49
Application Implementation
» Example application consisting of 8 algorithms
» Need to map onto hardware for real-time implementation
» Algorithms each have different characteristics
Algorithm A
Algorithm B
Algorithm C
Algorithm D
Algorithm E
Algorithm F
Algorithm G
Algorithm H
Commercial In Confidence. Copyright ©2005, Nallatech.50
Same Application Implemented on FPGAs
FPGA
FPGA
VME FPGA
GB
it Ethernet
FPGA
FPGA
C (PicoBlaze uP)
Algorithm B
VHDLAlgorithm AAlgorithm D
C (MicroBlaze uP)
Algorithm CAlgorithm F
VerilogAlgorithm G
MATLAB or Simulink
Algorithm HAlgorithm E
Commercial In Confidence. Copyright ©2005, Nallatech.51
Communications network to connect algorithms
FPGA
FPGA
VME FPGA
GB
it Ethernet
FPGA
FPGA
B
N
N
B B
E R
R
N
N
B
B
B
N
N
B
N
N
N
R
N
R
B
R
N
R
Commercial In Confidence. Copyright ©2005, Nallatech.52
So, should Von-Neuman and Harvard Architectures be retired?
Commercial In Confidence. Copyright ©2005, Nallatech.53
Should they Retire?
» Von Neumann and Harvard provide highly efficient use of silicon real estate whilst still being capable of executing any computational function.
» Therefore perhaps they should still live on» However this will be less and less in a hard chip
implementation» The Intelligent compiler will instantiate a Von-neumann
or Harvard like architecture when they are the most efficient way to execute an algorithmVon-Neumann and Harvard will live on as part of the
intelligence within tomorrow’s Compilers.
Commercial In Confidence. Copyright ©2005, Nallatech.54
Summary
» FPGAs for computing is not new» 12 Years accelerating maths functions» Floating Point & Tools make FPGAs viable for
HPC Community» No coherent Industry Standardisation » Code development WILL take longer» Significant potential savings
» Price/Performance» SWAP» Cost of Ownership
Commercial In Confidence. Copyright ©2005, Nallatech.55
And Finally……………
SGI & Nallatech have formally agreed on a Strategic Collaborative
ArrangementThis brings together 12 years of expertise in delivering Real FPGA computing solutions from Nallatech with the
Global Shared Memory MPP computing from SGI.Customers now have a path to scale from a commodity
cluster with FPGAs all the way up to a massive HPC system with thousands of Processors and & thousands
of FPGAs
Commercial In Confidence. Copyright ©2005, Nallatech.56
Thank You for your attention
www.nallatech.com
Copyright © 2005 Nallatech Limited. All rights reserved. Nallatech, the Nallatech logo, the triangles device and “The High Performance FPGA Solutions Company" are trademarks of Nallatech Limited. All other trademarks acknowledged.