Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen...

Post on 27-Mar-2015

219 views 3 download

Tags:

Transcript of Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen...

Mihai BudiuMicrosoft Research – Silicon Valley

Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein

Carnegie Mellon University

Spatial ComputationComputing without General-Purpose Processors

2

Outline• Intro: Problems of current architectures

• Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

1000

Per

form

ance

1

10

100

19

80

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

19

82

3

Resources

• We do not worry about not having hardware resources• We worry about being able to use hardware resources

[Intel]

4

Complexity

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

Designer productivity

104

Chip size

105

106

107

108

109

1010

ALUs

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

5

Complexity

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

Designer productivity

104

Chip size

105

106

107

108

109

1010

ALUs

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

Automatictranslation

C ! HW

Simple, short,unidirectionalinterconnect

No interpretationDistributed

control,Asynchronous

Simple hw,mostly idle

6

Our Proposal:Application-Specific Hardware

• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU

High-ILPcomputation

Low ILP computation+ OS + VM CPU ASH

Memory

$

7

Paper Content

• Automatic translation of C to hardware dataflow machines

• High-level comparison of dataflow and superscalar

• Circuit-level evaluation -- power, performance, area

8

Outline• Problems of current architectures

• CASH: Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

9

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

HW backend

10

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

&

a 7

>>

2

x

IR

a

Circuits

&7

>>2

No interpretation

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

11

Basic Computation=Pipeline Stage

data

valid

ack

latch+

12

Distributed Control Logic

+ -

ackrdy

global

FSM

short, local wires

13

MUX: Forward Branches

if (x > 0) y = -x;

elsey = b*x;

*

x

b 0

y

!

- >

Conditionals ) Speculation

SSA= no arbitration

14

Memory Access

LD

ST

LD

MonolithicMemory

local communication global structures

pipelinedarbitratednetwork

Future work: fragment this!

15

Outline• Problems of current architectures

• Compiling ASH

• ASH Evaluation

• Conclusions

16

Evaluating ASHC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

ASIC

180nm std. cell library, 2V

~1999technology

Mediabench kernels(1 hot function/benchmark)

ModelSim(Verilog simulation)

performancenumbers

Mem

commercial tools

17

Compile TimeC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

ASIC

20 seconds

10 seconds

20 minutes1 hour

200 lines

Mem

18

ASH AreaP4: 217

minimal RISC core

0

1

2

3

4

5

6

7

8

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Sq

uar

e m

m

Mem accessDatapath

19

ASH vs 600MHz CPU [.18 m]

0.600.77

0.53 0.48

1.87

0.70

1.93

1.351.52 1.55

3.65 3.57

1.23

0

0.5

1

1.5

2

2.5

3

3.5

4

Tim

es s

low

er

20

Bottleneck: Memory Protocol

LD

ST Memory

• Enabling dependent operations requires round-trip to memory.• Limit study: round trip zero time ) up to 5x speed-up.

LSQ

• Exploring novel memory access protocols.

21

PowerDSP110

mP4000

Xeon [+cache]67000

34.4

21.8

9.3 9.3

13.0

29.7

42.5

23.622.5

28.3

25.2 25.2

21.6

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

Po

we

r [m

W]

22

Energy-delay vs. Wattch

1

10

100

1000

10000

En

erg

y-d

elay

vs

sup

ersc

alar

(tim

es b

ette

r)

23

Energy Efficiency

0.01 0.1 1 10 100 1000

Energy Efficiency [Operations/nJ]

General-purpose DSP

Dedicated hardware

ASH media kernels

FPGA

Microprocessors

1000x

Asynchronous P

24

Outline

Problems of current architectures

+ Compiling ASH

+ Evaluation

= Related work, Conclusions

25

Related Work• Optimizing compilers

• High-level synthesis

• Reconfigurable computing

• Dataflow machines

• Asynchronous circuits

• Spatial computation

We target an extreme point in the design space:no interpretation,

fully distributed computation and control

26

ASH Design Point

• Design an ASIC in a day

• Fully automatic synthesis to layout

• Fully distributed control and computation

(spatial computation)– Replicate computation to simplify wires

• Energy/op rivals custom ASIC

• Performance rivals superscalar

• E£t 100 times better than any processor

27

Conclusions

Feature Advantages

No interpretation Energy efficiency, speed

Spatial layout Short wires, no contention

Asynchronous Low power, scalable

Distributed No global signals

Automatic compilation Designer productivity

Spatial computation strengths

28

Backup Slides• Absolute performance • Control logic• Exceptions• Leniency• Normalized area• Loops• ASH weaknesses• Splitting memory• Recursive calls• Leakage• Why not compare to…• Targetting FPGAs

29

Absolute Performance

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Meg

aop

erat

ion

s p

er s

eco

nd

MOPSall

MOPSspec

MOPS

=

rdyin

ackout

rdyoutackin

datain dataout

Re

g

back

Pipeline Stage

C

31

Exceptions• Strictly speaking, C has no exceptions

• In practice hard to accommodate exceptions in hardware implementations

• An advantage of software flexibility: PC is single point of execution control

High-ILPcomputation

Low ILP computation+ OS + VM + exceptions CPU ASH

Memory

back

$$$

32

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

33

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Solves the problem of unbalanced paths

back

34

Normalized Area

back

0

20

40

60

80

100

120

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

jpeg_

e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

avg

0

0.5

1

1.5

2

2.5Lines/sq mmsq mm/kbyte

35

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

36

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

retback

37

ASH Weaknesses

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient

back

38

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

...

if (exception) break;

}

i

+

<

1

&

!

exception

result available before inputs

ASH crit path

CPU crit path

back

39

Memory Partitioning• MIT RAW project: Babb FCCM ‘99,

Barua HiPC ‘00,Lee ASPLOS ‘00

• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02

• Illinois FlexRAM: Fraguella PPoPP ‘03

• Hand-annotations #pragma

back

40

Recursion

recursive call

save live values

restore live valuesstack

back

41

Leakage Power

Ps = k Area e-VT

• Employ circuit-level techniques

• Cut power supply of idle circuit portions– most of the circuit is idle most of the time– strong locality of activity

back

42

Why Not Compare To…• In-order processor

– Worse in all metrics than superscalar, except power– We beat it in all metrics, including performance

• DSP– We expect roughly the same results as for superscalar

(Wattch maintains high IPC for these kernels)

• ASIC– No available tool-flow supports C to the same degree

• Asynchronous ASIC– We compared with a Balsa synthesis system– We are 15 times better in Et compared to resulting ASIC

• Async processor– We are 350 times better in Et than Amulet (scaled to .18)

back

43

Compared to Next Talk

Engine[180nm]

Performance[MIPS]

E/instruction[pJ]

SNAP/LE 28 24

SNAP/LE 240 218

ASH 1100 20

back

44

Why not target FPGA

• Do not support asynchronous circuits

• Very inefficient in area, power, delay

• Too fine-grained for datapath circuits

• We are designing an async FPGA

back