(keynote) (from HPC to) New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances...

Post on 19-Dec-2015

218 views 1 download

Tags:

Transcript of (keynote) (from HPC to) New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances...

(keynote)(from HPC to)

New Horizons of Very High Performance Computing

(VHPC): Hurdles and Chances

Reiner Hartenstein

TU Kaiserslautern

Rhodes Island, Greece, April 25-26, 2006

© 2006, reiner@hartenstein.de http://hartenstein.de2

TU KaiserslauternReconfigurable Supercomputing

(VHPC) going commercial

Cray XD1

silicon graphics RASC

… it‘s a paradigm shift !… and other vendors

© 2006, reiner@hartenstein.de http://hartenstein.de3

TU Kaiserslautern

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene Math/SW-savvy sceneunqualified for RC ?

© 2006, reiner@hartenstein.de http://hartenstein.de4

TU Kaiserslautern

world-wide a mass movement

Methodology ?

reminds me to the mass migration of lemmings

terminology chaosnot really a sense of direction

an urgent need to get organized

© 2006, reiner@hartenstein.de http://hartenstein.de5

TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de6

TU KaiserslauternThe Reconfigurable Computing

Paradox

very poor effective integration density

„very power-hungry“ [Rick Kornfeld*]

very poor application development support

poor FPGA technology:

lower clock frequencies, and more expensive.

RC education: extremely poor, or none

Languages and tools unacceptable for software peoplemost hardware experts (86%**) hate their tools

**) DeHon ‘98 *) personal communication

poor tools:

poor education:

However, brilliant

results everywhere

what paradox ?

ignored by CS curricula

… teach like for a 50 year old mainframe …

© 2006, reiner@hartenstein.de http://hartenstein.de7

TU Kaiserslautern

Computing Curricula 2004fully ignores

Reconfigurable Computing

Joint Task Force for

FPGA & synonyma: 0 hits

not even here

(Google: 10 million hits)

Education ?

© 2006, reiner@hartenstein.de http://hartenstein.de8

TU Kaiserslautern

Computing Curricula v.2005:no changes other than „… FPGA, etc.“(not really mentioning that it‘s missing)

Completed ?

Taskforce activity completed ?Next task force in 2020 or later ?

© 2006, reiner@hartenstein.de http://hartenstein.de9

TU Kaiserslautern

End of this week: brainstorming session at DARPA:

(urgently needed – overdue! )

Tools ?

© 2006, reiner@hartenstein.de http://hartenstein.de10

TU Kaiserslautern

fine-grained RC: 1st DeHon‘s Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

density:

FPGAphysical

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]

© 2006, reiner@hartenstein.de http://hartenstein.de11

TU Kaiserslautern

X 2/yr

FPGA

published speed-up factors

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)

Lee Routing (by TU-KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

pre-FPGA era

© 2006, reiner@hartenstein.de http://hartenstein.de12

TU Kaiserslautern

pre FPGA era: Why DPLA* was so good

Close to Moore because of small overhead (wiring, programmability, routing)

Large arrays of canonical boolean expressions

PLA layout ~similar to RAM / ROM layout:

Mid’ 80ies: first very tiny FPGAs available

*) designed by TU-KL, fabricated by E.I.S. German multi university project

GAG Generic Address Generator to avoid address computation overhead

2ASM: Auto-Sequencing MemoryASM

[M. Herz et al.: ICECS 2003, Dubrovnik]

Reiner Hartenstein
ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik

© 2006, reiner@hartenstein.de http://hartenstein.de13

TU Kaiserslautern(anti-von-Neumann machine

paradigm)Data Counter instead of Program CounterGeneralization of the DMA

datacounter

GAG RAM

ASM: Auto-Sequencing MemoryASM

GAG & enabling technology:published 1989 [by TU-KL],Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL

**) -- patented by TI** 1995

Storge Scheme optimization methodology, etc.

Reiner Hartenstein
ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik

© 2006, reiner@hartenstein.de http://hartenstein.de14

TU Kaiserslautern

Thousands or Millions of $ for free

Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)

GoogleAmsterdam

NY

© 2006, reiner@hartenstein.de http://hartenstein.de15

TU KaiserslauternReconfigurable Scientific

Computing How software types do programming the FPGAs ?Hiring a good student from the EE Dept. ?

Because of Missing RC education: Far away from optimum solutions ?Much higher speedup achievable ?

1 or 2 more orders of magnitude ? 100.000 ? 1.000.000 ?

© 2006, reiner@hartenstein.de http://hartenstein.de16

TU Kaiserslautern

X 2/yr

FPGA

By education: better speed-up factors ?

1980 1990 2000 2010100

103

106

109

8080

P4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)

Lee Routing (by TU-KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

tool

s & e

du a

vaila

ble

?

© 2006, reiner@hartenstein.de http://hartenstein.de17

TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de18

TU Kaiserslautern

   

   

   

   

The Supercomputing Paradox

Growing listed Teraflops

Often limited sustained Teraflops

Almost stalled application implementation progress

Increasing number of processors running in parallel

COTS processor decreasing cost

Very high total cost of the Tera(?)flops

promising technology

poor results

Scientists waiting for affordable compute capacity

The Law of More

Reiner Hartenstein
programmer productivity shrinking with growing number of processors

© 2006, reiner@hartenstein.de http://hartenstein.de19

TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de20

TU Kaiserslautern

   

   

   

   

Why traditional supercomputing / HPC failed

instruction-stream-based: memory-cycle-hungry

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

extr

emel

y unbal

ance d

stolen from Bob Colwell

CPU

© 2006, reiner@hartenstein.de http://hartenstein.de21

TU Kaiserslautern

Earth Simulator

5120 Processors, 5000 pins eachES 20: TFLOPS

Crossbar weight: 220 t, 3000 km of thick cable,moving data around

inside the

© 2006, reiner@hartenstein.de http://hartenstein.de22

TU Kaiserslautern

Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:

© 2006, reiner@hartenstein.de http://hartenstein.de23

TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de24

TU Kaiserslautern

coarse-grained RC: Hartenstein‘s Law

rDPA

FPGArouted

>> 10 000

1980 1990 2000 2010100

103

106

109

(Gordon Moore curve)

transistors / microchip

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

e.g.

KressArray

family

© 2006, reiner@hartenstein.de http://hartenstein.de25

TU Kaiserslautern

X 2/yr

FPGA

higher speed-up factors by coarse-grained?

1980 1990 2000 2010100

103

106

109

8080

P4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)

Lee Routing (by TU-KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000Coa

rse-

grai

ned

arra

ys ?

© 2006, reiner@hartenstein.de http://hartenstein.de26

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

© 2006, reiner@hartenstein.de http://hartenstein.de27

TU Kaiserslautern

SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

S

+

© 2006, reiner@hartenstein.de http://hartenstein.de28

TU KaiserslauternCompare it to software solution on CPU

on a very simple CPU C = 1

memory cycles

nanoseconds

if C then read A

read instruction

instruction decoding

read operand*

operate & register transfers

if not C then read B

read instruction

instruction decoding

add & store

read instruction

instruction decoding

operate & register transfers

store result

total

S

+

ABR C

Clock200

=1

S

+

S = R + (if C then A else B endif);

© 2006, reiner@hartenstein.de http://hartenstein.de29

TU Kaiserslautern

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

read instruction 1 100instruction decoding

add & store

read instruction 1 100instruction decoding

operate & reg. transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

S

+

ABR C

clock200 MHz(5 nanosec)

=1

no m

emor

y cy

cles

:

no m

emor

y cy

cles

:

spee

d-up

fac

tor

= 1

00

spee

d-up

fac

tor

= 1

00

© 2006, reiner@hartenstein.de http://hartenstein.de30

TU Kaiserslautern

moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams

© 2006, reiner@hartenstein.de http://hartenstein.de31

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect[Ulrich Nageldinger]

The wrong mind set ....

S = R + (if C then A else B endif);

=1

+

ABR C

section of a very large pipe network:

decision

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education

© 2006, reiner@hartenstein.de http://hartenstein.de32

TU Kaiserslautern

   

   

   

   

The new paradigm: how the data are traveling

not transport-triggered: old hat

pipeline, or chaining

super systolic array

no, not by instruction execution

DPU DPU DPU

vN Move Processor

instruction-driven

+ instruction-driven

[Jack Lipovski, EUROMiCRO, Nice, 1975]

P&R: move locality of operation, not data !

© 2006, reiner@hartenstein.de http://hartenstein.de33

TU Kaiserslautern

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MASM: Auto-Sequencing

Memory

50 & more on-chip ASM are feasible

50 & more on-chip ASM are feasible

© 2006, reiner@hartenstein.de http://hartenstein.de34

TU Kaiserslautern

The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array

© 2006, reiner@hartenstein.de http://hartenstein.de35

TU Kaiserslautern>> Outline <<

• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de36

TU Kaiserslautern

Here is the common model

data-stream-based

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware code

CPU

it’s not von Neumann the vN monopoly in our curricula is severely harmful

wagging the dog

the tail is

we need dual paradigm education

© 2006, reiner@hartenstein.de http://hartenstein.de37

TU Kaiserslautern

A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

concurrent processes and

and for multithreading,

Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

CPU

mod

eDP

U m

ode

© 2006, reiner@hartenstein.de http://hartenstein.de38

TU Kaiserslautern“Super Pentium” configuration

examplerDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

CPUCPU

CPUCPU CPUCPU

CPUCPU

© 2006, reiner@hartenstein.de http://hartenstein.de39

TU Kaiserslautern

e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

© 2006, reiner@hartenstein.de http://hartenstein.de40

TU Kaiserslautern

Dual Paradigm Application Development

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware codedata-stream-based

CPU

software/configwareco-compiler

high level language

© 2006, reiner@hartenstein.de http://hartenstein.de41

TU KaiserslauternSoftware / Configware Co-

Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

Resource Parameters

supportingdifferentplatforms

SWcompiler

CWcompiler

C language source

Partitioner

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

Placement &

Routing

Placement &

Routing(Move the Locality of Operation

)

© 2006, reiner@hartenstein.de http://hartenstein.de42

TU Kaiserslautern

Bringing together data and processor

Move the stool

byConfigware

Place the location of execution into the data pipe

© 2006, reiner@hartenstein.de http://hartenstein.de43

TU Kaiserslautern>> Conclusions <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

•Conclusions http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de44

TU Kaiserslautern

Conclusions (1): Hurdles

Obstacles are:

unbelievably disastrous tools market:

unbelievably ignorant curricula:

enabling technologies available, partly decades old, but not used

transdisciplinary models not available nor taught at CS, nor elsewhere

fragmentation into application-domain-specific cultures and trick boxes

… teach like for a 50 year old mainframe …

© 2006, reiner@hartenstein.de http://hartenstein.de45

TU Kaiserslautern

Conclusions (2): Future Work

CS disciplines must recognize and accept its strategic role and its responsibility toward all its application disciplines: embedded and scientific computing.

The monopoly of the von-Neumann-based mind set in CS education:

heavily stalls progress in R&D, not only in HPC causes high cost in R&D, not only in supercomputing

The von-Neumann-only-based mind set in CS urgently needs to go to adopt the dual paradigm common model

CS graduates are not qualified for our job market

© 2006, reiner@hartenstein.de http://hartenstein.de46

TU Kaiserslautern

Conclusions (3): Chances

New horizons: chances are brilliant

© 2006, reiner@hartenstein.de http://hartenstein.de47

TU Kaiserslautern

thank you

© 2006, reiner@hartenstein.de http://hartenstein.de48

TU Kaiserslautern

END

© 2006, reiner@hartenstein.de http://hartenstein.de49

TU Kaiserslautern

thank you

© 2006, reiner@hartenstein.de http://hartenstein.de50

TU Kaiserslautern

Backup:

© 2006, reiner@hartenstein.de http://hartenstein.de51

TU Kaiserslautern

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer

© 2006, reiner@hartenstein.de http://hartenstein.de52

TU KaiserslauternCompilation: Software vs.

Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB

© 2006, reiner@hartenstein.de http://hartenstein.de53

TU Kaiserslautern

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source needed

algorithm: variable

resources: fixedsoftware

CPU

© 2006, reiner@hartenstein.de http://hartenstein.de54

TU Kaiserslautern

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

simulated annealing

simulated annealing

© 2006, reiner@hartenstein.de http://hartenstein.de55

TU Kaiserslautern

Co-Compiler for Hardwired Kress/Kung Machine

[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

data

source

automatic SW / CW partitioner

© 2006, reiner@hartenstein.de http://hartenstein.de56

TU KaiserslauternThe first archetype machine model

mainframe

CPU

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

© 2006, reiner@hartenstein.de http://hartenstein.de57

TU KaiserslauternThe 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”

accelerator reconfigurable

simple basic .Machine Paradigm

© 2006, reiner@hartenstein.de http://hartenstein.de58

TU Kaiserslautern

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]

© 2006, reiner@hartenstein.de http://hartenstein.de59

TU Kaiserslauternmodern FPGA bestsellers:

The new model is reality:FPGA fabrics, together with several µprocessors, many memory banks, and other IP cores, on the same COTS microchip

© 2006, reiner@hartenstein.de http://hartenstein.de60

TU Kaiserslautern

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

DSP platform FPGA[courtesy Xilinx Corp.]