Download - ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa.

ISICT 2005

Supercomputing going

Reconfigurable

Reiner Hartenstein

TU Kaiserslautern

Jan. 4-6, 2005, Capetown, South Africa

© 2004, [email protected] http://hartenstein.de2

TU Kaiserslauternmy relations to South Africa

October 1981:reporting aboutKARL language and CHDL 1981 at Kaiserslauternhttp://hartenstein.de/Star.jpg Un. Stellenbosch:

an early KARL licensee

http://hartenstein.de/KARL-users.html


TU Kaiserslautern>> The methodology gap <<

http://www.uni-kl.de

•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to

go•Curricula are obsolete •Final remarks


TU Kaiserslautern

FPGA with island architecture

rLB:reconfigurable logic box

reconfigurable interconnect fabricslogic design issue: far from computing mind set

Rainer Hartenstein


TU Kaiserslautern

software to configware migration: speed-up examples

straight forward

x16 MOPS/mW16 tap FIR filterPACT Xtreme4-by-4 array[2003]

application example methodspeed-up factorplatform

multipleaspects

> x1000(computation

time)

grid-based DRC** 1-metal 1-poly nMOS*** 256 reference patterns

MoM anti machine with DPLA* [1983]

*) DPLA: MPC fabr. via E.I.S. multi univ. project

key issue: algorithmic cleverness

**) Design Rule Check

hi levelsynthesis

x7 – x46(compute

time)

migrate several simple application exampes

CPU 2 FPGA [FPGA 2004]

***) for 10-metal 3-poly cMOS expected: >> x10,000

not spec.X 100

(compute time)

from fastest DSP: 10 gMACs to 1 teraMAC

DSP 2 FPGA [Xilinx 20042]

2) Wim Roelandts


TU Kaiserslautern

FPGAs are mainstream

my talk is not about FPGAs (reconfigurable logic)

Predicted to grow to $5 billion by 2007

FPGA market is worth $3 billion

fastest growing segmentof IC market

soft hardwaremorphware[DARPA]

its‘ about coarse grain: Reconfigurable Computing


TU Kaiserslautern>> The wrong roadmap <<





TU Kaiserslautern

data are moved around by software

(slower than CPU clock by 2 orders of magnitude)

i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall

extremely unbalanced

stolen from Bob Colwell

CPU


TU KaiserslauternThe wrong roadmap

But HPC and supercomputing have stubbornly avoided configware use for the past 20 years

Educational deficits are one reason

An exception: Splash - it has been discarded

Adopting a new mindset: the brain hurts

Configware methodology moves data around more efficiently


TU KaiserslauternIt‘s not Software

The programming sources for morphware and Reconfigurable Computing are fundamentally different:Software to Configware migration includes a time to space conversion

it‘s Configware

structural programming instead of procedural .... data-stream-based

instead of instruction-stream-based


TU Kaiserslautern

no instruction fetch

Configware execution does not need instruction fetch ...

which saves memory cycles, and ...

... because a (super) instruction fetch happens before run time:

(re)configuration,

... more performance benefit comes from other acceleration mechanisms which stem

from time to space migration by Reconfigurable Computing

You d

on‘t

belie

ve ?


TU Kaiserslautern

path of least resistance*:avoiding a paradigm shift

Many researchers seem never to stop working on sophisticated solutions for marginal improvements ...... continously ignoring methodologies promising speed-ups by orders of magnitude ....

... continue to bang their heads against the memory

wallinstead

of

*) [Michel Dubois]

blinders to ignore the impact of morphware


TU Kaiserslautern

the data-stream-based approachhas no von Neumann bottle-neck

has no von Neumann bottle-neck

… understand only this parallelism solution:

the instruction-stream-based approach

von Neuma

nn bottle-necks

von Neuma

nn bottle-necks

... c

annot

cope w

ith

this

one


TU Kaiserslautern

Living dangerously …me talking to HPC people:

„the wrong roadmap the past 20 years“


TU Kaiserslautern

the hardware / Software Chasm:

typical programmers don‘t understand function evaluation without machine mechanisms (counters, state registers)

It‘s the gap between procedural (instruction-stream-based) and structural (datastream-based) mind set

acceleratorsacceleratorsµprocessorµprocessor


TU Kaiserslautern>> HPC goesReconfigurable

<<





TU Kaiserslautern

By the way ...

http://fpl.org

15th International Conference on Field-

Programmable Logic and Applications (FPL)

Aug. 24 – 26 2005, Tampere, Finland

in 2004: 288 submissions !

... the oldest and largest conference in the field:

accel.µProc.... going into every type of application

they all work on high performance


TU Kaiserslautern

another conference ...

http://hartenstein.de/raw05.html

April 4 – 5, 2005, Denver, Coloradoin conjunction with IPDPS 2004 !

99 submissions !accel.µProc.... going into every type of application

they all work on high performance


TU Kaiserslautern

more conferences ...

FPL series: 15th at Tampere FinlandRAW series: 12th at Denver, ColoradoDRS W. on Dynamically Reconf.

Systems, Innsbruck, Austria 2005W. Archit. Res. using FPGA platforms

in conj. w. HPCA-11. SFO, Febr 2005PACT'98 w. RC, Paris 1998.MAPLD (mil appl. ...) USA only - 8thFPGA, Monterey, CaliforniaFCCM, Napa, California, April 2005FPT 2004 Int‘l Conf.ERSA, Las Vegas, Nevada, USARHPC - in conj. w. HPC Asia ARC 2005 Int‘l w. Applied RC, Algarve,

Portugal, February 22, 2005ReConFig04 Univ. of Colima, MexicoDFG w. 04 at DaimlerChrysler, GermanyICES'05 Int‘l C. on Evolvable Systems,

From Biology To Hardware,special tracks in other conferences:

ICECS, DATE, DAC, othersResearch programs: DARPA, EU, DFG ...

EuroGP - European Conference on Genetic ProgrammingGECCO - Genetic Evolutionary Computation Conference, CEC - Congress on Evolutionary ComputationSEAL - Asia-Pacific Conf. on Simulated Evolution And

LearningEA - International Conference on Artificial Evolution, 7th,

October 26-28 2005, Lille, France, ECML - European Conference on Machine LearningIEEE Conference on Evolutionary ComputationInternational Conference on Evolutionary Programming (EP), European Conference on Artificial Evolution (AE)EUNITE - European Symp. on Intelligent Technologies,

Hybrid Systems and their Implementation in Smart Adaptive Systems

EUROGEN - Evolutionary Methods for Design, Optimisation and Control with Applications to Industrial Problems

ACDM - International Conference on Adaptive Computing in Design and Manufacture

EvoRobot - European Workshop on Evolutionary Robotics.......


TU Kaiserslautern

First Indications of Change

10th RAW at IPDPS, Nice, France, April 2003: after a decade of non-overlap: first IPDPS people coming

HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing, July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan: Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address*

HPCA-11, 11th International Symposium on High-Performance Computer Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely: Embedded and reconfigurable architectures

SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29, 2004: topic area explicitely: Reconfigurable Systems

*) keynote speaker:

PARS & Speed-up, Basel, Switzerland, March 2003: keynote address*

IPDPS, Santa Fe, NM, USA, April 2004: keynote address*

PDP’04, La Coruna, Spain, Febr. 2004: keynote address*


TU Kaiserslautern Cray XD1

•FPGAs as programmable co-processors for parallelization

•6 optional DSPs as accelerators

•Xlinx Virtex-II Pro FPGAs cooperate with AMD Opteron

•FPGAs dynamically programmable by Configware.

•Cray provides a configware library with special algorithms for search and Sort, DSP, and Encryption.

•Obtains in Genome-Sequenciung a speed-up of >100.


TU Kaiserslautern>> Coarse grain is the way togo

<<


•The methodology gap•The wrong Roadmap•HPC goes reconfigurable•Coarse grain is the way to go•Curricula are obsolete •Final remarks


TU Kaiserslautern

rDPA (coarse grain) vs. FPGA (fine grain)

roughly:area efficiency

(transistors/chip, orders of

magnitude)

hardwired 4

FPGA 2

µProc 0

(coarse grain) rDPA 4

roughly:performance

(MOPS/mW, orders of

magnitude)

hardwired 3

FPGA 2

µProc 0

(coarse grain) rDPA 3

DSP 1


TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

Example: mapping onto rDPA by DPSS: based on simulated annealing

reconfigurable function block, e. g. 32 bits wide

no CPU


TU Kaiserslautern

branching example: time-to-space migration

*) if no intermediate storage in register file

on a very simple CPU C = 1

memory cycles

nanoseconds

if C then read A

read instruction 1 100

instruction decoding

read operand* 1 100

operate & register transfers

if not C then read B



add & store



operate & register transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

S

+

ABR C

Clock200 MHz(5 nanosec)

=1

section of a very large pipe network:

no m

emor

y cy

cles

:

Spee

d-up

fac

tor

= 1

00


TU Kaiserslauterncommercial rDPA example:

PACT XPP - XPU128

XPP128 rDPA

• Evaluation Board available, and • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• Full 32 or 24 Bit Design working silicon

• 2 Configuration Hierarchies

© PACT AG, Munich http://pactcorp.com

(r)DPA


TU Kaiserslautern 64 ALU-PAEs

16 RAM-PAEs

24-bit architecture

Split (12,12)-bit opcodes

complex addition

complex multiplication

conditional sign-flip

JTAG debug interface

Synthesis, P&R,

Layout from ACCENT

0.13µm silicon from

STMicro, Crolles

Available since March 2003

XPP64A: Coarse-grain Reconfigurable Array


TU Kaiserslautern

XPP64A: Platform Development Board

- SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab

- Assembly & Test / Available since March 2003


TU Kaiserslautern

SMeXPP: Multimedia Co-Processor Concept

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

SMeXPPSMeXPP

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable deinterlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes


TU Kaiserslautern>> Curricula are obsolete <<

•The methodology gap

•The wrong Roadmap

•HPC goes reconfigurable

•Coarse grain is the way to go

•Curricula are obsolete

•Final remarkshttp://www.uni-kl.de


TU Kaiserslautern

Completely wrong mind set

The key problem, the memory wall, cannot be solved by new CPU technology

We need a 2nd machine paradigm (a 2nd mind set ...)

The vN paradigm is not a communication paradigmIts monopoly creates a completely wrong mind set

We need an architectural communication paradigmBut we need both paradigms: a dichotomy

beef up old architecture principles by new technology?


TU Kaiserslautern

von Neumann is not the common model

programcounter

DPUCPU

RAMmemory

von Neumann bottleneck

von Neumann instruction-

stream-based machine

co-processors

acceleratorCPU

instruction-stream-based

data-stream-

based

hard

ware

software

mainframe age:

microprocessor age:

configware age:

CPU accelerator reconfigurable

morp

hw

aresoftware/configware

co-compilersoftware configware


TU Kaiserslautern

Why a new machine

paradigm ???The anti machine as the 2nd machine paradigm is the key to curricular innovation

rDPArDPACPUCPU

... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers

RAM-based platform needed for:• flexibility, programmability

• avoiding the need of specific silicon

mask cost: > 2 mio $ - growing

2nd machine paradigm needed as a common model:• to avoid the need of circuit expertize• needed to educate zillions of programmers

programcounter

DPUCPU

datacounter

memorybank

asMprogramme

d by flowware


TU Kaiserslautern>> Final Remarks <<

•The methodology gap


•HPC goes reconfigurable

•Coarse grain is the way to go

•Curricula are obsolete

•Final remarkshttp://www.uni-kl.de


TU Kaiserslautern

http://configware.org/


TU Kaiserslautern

END


TU Kaiserslautern

CS Education

proceduralhave not

You cannot *teach Hardware to a Programmer*) efficiently

But to a Hardware Guy or gal you

always can teach Programming

stem

s fr

om

vN

mon

opol

y

structuralhave

hardwar

e guy or

gal

natural


TU Kaiserslautern

Growth Rate of Embedded Software

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

(1.4/year)

[Moore

’s law]

>10 times more programmers will write embedded applications

than computer software by 2010

Embe

dded

sof

twar

e [D

TI*]

(~2.

5/yr

)

already to-day, more than 98% of all microprocessors

are used within embedded systems


TU Kaiserslautern

de facto Duality of RAM-based platforms

We now have 2 types of programmable platforms

anti machine: data-stream-based

von Neumann etc.:instruction-stream-based

machine paradigm

configware

software„running“ on it:

morphware (FPGA, rDPA ..)CPURAM-based platform

2nd paradigmtraditional

synthesis

flowware


TU Kaiserslautern

[Gordon Bell]

... going into every type of application

[Gordon Bell]

.... the brain hurts

CW has become mainstream ...

Others experienced, that the brain hurts, when trying the paradigm shift

The HPC scene believed to be smart, when smiling about us CW guys

morphware: fastest growing sector of the IC market


TU Kaiserslautern

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source needed

algorithm: variable

resources: fixedsoftware

CPU


TU Kaiserslautern

Compilation: Software vs. Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data


TU KaiserslauternCompilation: Software vs.

Flowware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

flowwarecompiler

scheduler

flowware code

source „program“

Flowware Engineeri

ng

Flowware Engineeri

ng

datafo

r hard

wir

ed

anti

mach

ine


TU Kaiserslautern

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

| output data

streams „data

streams“ time

port #

time

time

port #time

port #

Flowware defines: ... which data item at which time at which port

Flowware programs data streams


TU Kaiserslauterndata streams*: not new

1980: data streams (Kung, Leiserson: systolic arrays)

1989: data-stream-based Xputer architecture

1990: rDPU (Rabaey)

1994: Flowware Language MoPL (Becker et al.)

1995: super systolic array (rDPA) + DPSS tool (Kress)

1996+: Streams-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...

1996: configware / software partitioning compiler (Becker)

*) please, don‘t

confuse with „data

flow“


TU Kaiserslautern>> Dual Machine Paradigms <<

•HPC

•Embedded Computing


•Configware Engineering

•Dual Machine Paradigms

•Speed-up Examples

•Final Remarkshttp://www.uni-kl.de


TU Kaiserslautern

von Neumann vs. anti machine

programcounter

DPUCPU

RAMmemory

von Neumann bottleneck

(r)DPA

withoutsequencern

o C

PU !

asMA: auto-sequencing Memory Array

........

asM

asM

asM

asM

asM

(r)DPA

....

....

data stream machine

(anti machine)

datacounter

memory

bank

asM

asM: auto-sequencing Memory

instruction stream machine(von Neumann etc.)


TU Kaiserslautern

Behavior of the Ccounter

datacounter

memory

bank

asM

asM

asM

asM

asM

asM

........

programcounter

DPUCPU

progra

mme

d

by flowwaredata

streams

programm

e

d by software

(r)DPA

programm

e

d by flowware


TU Kaiserslautern

Behavior of the Ccounter

datacounter

memorybank

asM

asM

asM

asM

asM

asM

........

programcounter

DPUCPU

progra

mme

d

by flowwaredata

streams

programm

e

d by software

(r)DPA

programm

e

d by flowware


TU Kaiserslautern

Counters: the same micro architecture ?

data stream machine (anti

machine)

datacounter

memory

bank

asM

programcounter

DPUCPU

instruction stream machine:(von Neumann etc.)

yes, is possible, but for data counters ...

*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia

... a much better AGU methodology is available*

AGU: address generator unit


TU Kaiserslautern

Software / Configware Co-Compilation

Analyzer/ Profiler

SW code

SWcompiler

paradigm“vN" machine

CW Code

CWcompiler

anti machineparadigm

Partitioner

Resource Parameters

supportingdifferentplatforms

Juergen Becker’s CoDe-X, 1996

High level PL source

FW Code


TU Kaiserslautern

Better solutions by Configware

Memory cycles minimizede.g.: no instruction fetch at run time & other

effects

Memory access for data: caches do not help anyhowLoop xforms: no intra-stream data memory cyclesComplex address computation: no memory cycles

No cache misses!

instead of software

methodologies not new: high level synthesis (1980+)loop transformations (1970+)

many other areas


TU KaiserslauternWhy the speed-up ...

... although FPGA is clock slower by x 3 or even more(most know-how from „high level synthesis“ discipline)

moving operator to the data stream (before run time)

support operations: no clock nor memory cycle

decisions without memory cycles nor clock cycles

most „data fetch“ without memory cycle


TU Kaiserslautern

HPC experts coming ...

Simulation of Star Clusters: x10 speed-upby supercomputer-to-morphware migration

(also molecular biology et al.)

Rainer Spurzem, University of Heidelberg

Reinhard Maenner, University of MannheimHPC pioneer since 1976 (Physics Dept Heidelberg)

Configware by

Astrophysics by

example: N-body problem going configwarepaper already

at FPL 1999http://fpl.org

ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by

August Kopff

Gottfried Kirch

August Kopff


TU Kaiserslautern

August Kopff

18th Director, Astrononisches Rechen-Institut (ARI) 1924 - 1954

discovered the Kopff comet, Koenigstuhl Observatory, Heidelberg, Germany, 1906

Copyright © 1996 by Masayuki Suzuki

The Galileo spacecraft's 14-year odyssey came to an end on Sunday, Sept. 21, 2003

discovered the asteriod 631 Philippina, 21 March 1907,

which became the first asteroid ever visited by a spacecraft - on the Galileo mission to Jupiter


TU KaiserslauternConclusions

We need an academic grass roots movement, for ....

RC has become mainstream in all kinds of applications

... by a merger with the embedded systems mind set

CS education deficits: a curricular revision is overdue

...free material & tools for undergraduate lab courses to program and emulate small SW/CW/HW examplesall know-how needed readily available:

get involved !


TU Kaiserslautern

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

State-of-the-art FPGA[courtesy Xilinx Corp.]


TU Kaiserslautern

The Platform FPGA

MGTs I/OsMemory PowerPCSoft Logic

DSP

Com

mun

icat

ion

Por

t

CustomLogic

Inte

rnal

Mem

ory

Exte

rnal

Mem

ory

Port

DSP

A

ccel

erat

or

µP

[courtesy Xilinx Corp.]


TU Kaiserslautern

Domain A Domain B Domain x

Virtex-4 – First Embodiment of ASMBL

Platform A Platform B Platform x

...

Logic DomainHighest logic density

DSP DomainHighest DSP performance

Embedded Processing DomainEmbedded ProcessorsHigh-speed Serial I/O

Virtex-4 LXLogic Platform

Virtex-4 SXSignal Processing Platform

Virtex-4 FXFull Featured Platform

One Family – Multiple Platforms

Logic

DSP

Memory

Legend

CPUs

Gbps I/O


TU Kaiserslautern

Comparison of Hardware Solutions

General Purpose Processors

• CISC (Complex Instruction Set Computing)• RISC (Reduced Instruction Set Computing)

Special Purpose Processors

• Microcontroller• DSPs (Digital Signal Processors)• Application-Specific Instruction Set Processors (ASIPs)

Programmable Hardware

• FPGA (Field-Programmable Gate Arrays)• FPFA (Field-Programmable Function Arrays)

Application-Specific Integrated Circuits(ASICs)

PerformanceFlexibilityPowerConsumption


TU Kaiserslautern

Reconfigurable Hardware (FPL)

•PALs, PLAs: -> 10 - 100 Gate Equivalents

•Fine-grain: FPGAs, FPLDs– Altera MAX Family

-> FPLD (EEPROM)– Actel Programmable Gate-Array

-> FPGA (Anti-Fuse)– Xilinx Logical Cell Array

-> FPGA (RAM-based)

•Several Thousand to multiple Millions Gate Equivalents -> e. g. Xilinx Virtex XCV2000 -> 2 Millions Gate Equivalents

•Coarse-grain:Field Programmable Functional Arrays = FPFAs

– Company PACT XPP Technologies AG (Munich, Germany): Xtreme Processing Platform: XPP Architecture

– Universitaet Karlsruhe (TH): New multi-grain adaptive Architectures

XC4085XL

1997 1998 199

Virtex

Dic

hte (S

yste

m G

ate

s)

10M Gates

In 2002

Virtex

XC40250XV

2M

1M

250k

180k

500k

XC4085XL

1997 1998 1999 2000 2004

Virtex

Dic

hte (S

yste

m G

ate

s)

10M Gates

In 2004

Virtex

XC40250XV

2M

1M

250k

180k

500k

10M

2M

1M

250k

180k

500k

8M


TU Kaiserslautern

Terminology

Configurable:

Programmable:

Reconfigurable:

Dynamical Reconfigurable:Type of Reconfigurations, which realizes Modifications of Configurations during Run-time of the System. This is also called run-time reconfiguration, on-the-fly reconfiguration or in-circuit reconfiguration

General Term, which expresses the Features of a Hardware Architecture to be configured more than once (-> Technology dependent)

Type of flexible Computations, whereas only one or a few

Instructions per Processing Element are loaded and the Execution is performed in the Dimensions Time and Space (-> Area) concurrently

Type of flexible Computations, whereas a Sequence of Instructions is loaded and executed in the Dimension Time by using one or several Processing Elements


TU KaiserslauternCollection of programmable “Gates” embedded in a flexible Interconnect Network … a “user programmable” Alternative to Gate Arrays

Xilinx Virtex FPGA: Logic Realization

Solution:

Programmable Gate

??

Mem

In1 In2

OutMem

In1 In2

OutIn Out00 001 110 111 0

In Out00 001 110 111 0

2-LUT


TU Kaiserslautern

Source: Xilinx Virtex-II-Pro Documentation

Integration Process

Substrate

PolyMetal 1Metal 2Metal 3Metal 4

Hard IP-block

(Power PC) PolyMetal 1Metal 2Metal 3Metal 4

Metal 5Metal 6Metal 7Metal 8Metal 9

On Chip Memory Controller

Power PCCore

BRAM BRAM

BRAM BRAM

EmbededRAM

•It contains PowerPC ® 405 RISC CPU (PPC405) Cores

•FPGA Fabric-based on Virtex-II Architecture

Xilinx Virtex-II Pro FPGA: Architecture


TU Kaiserslautern

PACT XPP

with RISC

Core

PACT XPP will boost RISC cores

•The PACT XPP Technology will allow RISC IP Manufacturer to conquer new markets– RISC manufacturers will be able to extend their road map– RISC manufacturer will regain significance in silicon occupancy

Control Flow Data Flow

Pe

rfo

rman

ce

MIPS

ARMFPGA

DSPASIC

DP


TU Kaiserslautern • .

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

FPGA Performance vs. KressArray

physicallogical

KressArra

y

memory

1980 1990 2000 2010

Tra

nsi

sto

rs/c

hip

100 000 000

10 000 000

1000 000

100 000

10 000

1000

100

10

1

FPGAlogical

FPGArouted

FPGAphysical > 10 000

e. g. KressArray : 18 bit PU cell

8 NNports4 buses

0.18 m CMOS:

0.06 mm2 area

< 10


TU Kaiserslautern

Computer Performance Trend History

Supercomputers

Mainframes

Minicomputers

Microproce

ssors

[Hennessy, Jouppi, 1991]

1965 1970 1975 1980 1985 1990 1995 2000 2005

1000

100

0,1

10

1

performance

year


TU KaiserslauternSoftware to Configware

Migration

this talk will illustrate the performance benfitwhich may be obtained from Reconfigurable Computing stressing coarse grain Reconfigurable Computing (RC),point of view, this talk hardly mentions FPGAs(But coarse grain may be always mapped onto FPGAs)

Software to Configware Migration is the most important source of speed-upHardware is just frozen Configware


TU Kaiserslautern

Reconfigurable Computing (1)

•Goal: software flexibility with hardware performance

– Microprocessors for any software, but slow– ASICs fully specialized, but very fast– Reconfigurable computing to bridge this gap

• Reconfiguration to execute a wide variety of applications on the same hardware platform

• Hardware yields higher performance than software

Microprocessors

Highest FlexibilityPerformance?

ASICs

Highest PerformanceLowest Flexibility

Reconfigurable Computing

High FlexibilityHigh Performance


TU Kaiserslautern

Reconfigurable Computing (2)

• Configuration– “Programming” the hardware– Implements desired functionality in

hardware•For example, if addition necessary, adder made

in hardware

• Reconfiguration– Changing the configuration of the device– Typical overhead: ~ 3 to 100 ms– Two types: static and dynamic

Reconfigurable Hardware

Functionality (a design)

Reconfiguration

Reconfigurable Hardware

New functionality (a new design)


TU Kaiserslautern

Types of Reconfiguration

• Static reconfiguration– Configure the device once for the application– Do not reconfigure until the application is finished or do not

reconfigure at all• Hardware still flexible in the design phase

– Application: high performance computing that needs hardware performance without the high cost of designing an ASIC

• Dynamic/run-time reconfiguration– Hardware is reconfigured while the application is executing.– Partial reconfiguration

• Part of the reconfigurable hardware is reconfigured while the rest stays the same and continues to execute

– Duration between reconfigurations varies• Long duration: e.g. router updating tables and/or protocols, cell phone

switching protocols• Short duration: e.g. regular expression matching, DSP application switching

between stages


TU Kaiserslautern

FPGA Basics

• FPGA consists of– Matrix of programmable logic cells

•Implement any logic function– AND, OR, NOT, etc

•Can also implement storage– Registers or small SRAMs

•Groups of cells make up higherlevel structures

– Adders, multipliers, etc.

– Programmable interconnects•Connect the logic cells to one another

– Embedded features•ASICs within the FPGA fabric for specific functions

– Hardware multipliers, dedicated memory, microprocessors

• FPGAs are SRAM-based– Configure device by writing to configuration memory

Logic CellsInterconnects


TU Kaiserslautern

History of Machine Models

1957

1967

1977

1987

1997

2007

mainframe age

mainframe.

compile

procedural mind set: instruction-stream-based

procedural mind set: instruction-stream-based

(coordinates by Makimtos wave)

computer age (PC age)

accel.µProc.

compile

structural mind set: data-stream-basedstructural mind set: data-stream-based

by hardware guysby hardware guysdesign

e. g. GRAPERIKEN institute


TU Kaiserslautern

Flowware: not new

1957

1967

1977

1987

1997

2007


accel.

design

µProc.

compile

(Makimtos wave)

mainframe age

mainframe

compile

DPArrµProc.

morphware age

*) no confusion, please: no „dataflow

machine“ !!!

*) no confusion, please: no „dataflow

machine“ !!!data stream* ...Flowware:

around 1980


TU Kaiserslautern3rd machine model became

mainstream

1957

1967

1977

1987

1997

2007


accel.

design

µProc.

compile

(Makimtos wave)

mainframe age

mainframe

compile

instruction-stream-based DPArrµProc.

programma

bleprogramma

ble

most CS curricula &

HPC are still here

most CS curricula &

HPC are still here

morphware age


TU Kaiserslautern

symbiosis of machine models

1957

1967

1977

1987

1997

2007


accel.

design

µProc.

compile

(Makimtos wave)

mainframe age

mainframe

compile

morphware age

DPArrµProc.replace PC by PS

co-compiler

symbiosis


TU Kaiserslautern

DPA

morphware age

rr

From Software to Configware Industry

structuralpersonalization:

RAM-based

Repeat Success Story bya 2nd Machine Paradigm !

Growing Configware IndustryGrowing Configware Industry

1957

1967

1977

1987

1997

2007


µProc.

compileProcedural

personalization via RAM-based .

Machine Paradigm

Software IndustrySoftware Industry

1)2)

Software Industry’sSecret of Success anti machine


TU Kaiserslautern

Earth Simulator

5120 Processors, 5000 pins eachES 20: TFLOPS

Crossbar weight: 220 t, 3000 km of cable,moving data around

inside the


TU Kaiserslautern

Hardware / Configware / Software Partitioning skills urgently needed

Algorithm

partitioning

HWHW

CWCW

SWSW

to cope with each of it:SW, CW, HW

.

SW/HWSW/HW

SW/CW/HWSW/CW/HWSW/CWSW/CW

CW/H

WCW

/HW

or: to cope with anycombination of co-design

.

Software to Configware Migration is the most important source of speed-up

Hardware is just frozen Configware


TU Kaiserslauterntypical CS graduates: the

„havenots“

To-day, „typical“ CS graduates are unqualified for this labor market

… cannot cope with Hardware / Configware / Software partitioning issues

… cannot implement Configware


TU KaiserslauternThe „havenots“

Configware methodology to move data around more efficiently:

Configware engineering as a qualification for programming embedded systems*:

„havenots“ are found in the HPC community

The „havenots“ are our typical CS graduates

*) also programming for HPC !


TU Kaiserslautern

###

- SDR Board in Debug Phase -> XPP64A Chips from STMicro Fab

- Assembly & Test / Available since March 2003