Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture

30
Tomorrow's Computing Engines WJD Feb 3, 1998 1 Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William J. Dally Computer Systems Laboratory Stanford University [email protected]

description

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture. William J. Dally Computer Systems Laboratory Stanford University [email protected]. Focus on Tomorrow, not Yesterday. General’s tend to always fight the last war - PowerPoint PPT Presentation

Transcript of Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture

Tomorrow's Computing EnginesWJD Feb 3, 1998 1

Tomorrow’s Computing Engines

February 3, 1998Symposium on High-Performance Computer Architecture

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Tomorrow's Computing EnginesWJD Feb 3, 1998 2

Focus on Tomorrow, not Yesterday

General’s tend to always fight the last war

Computer architects tend to always design the last computer

old programs

old technology assumptions

Tomorrow's Computing EnginesWJD Feb 3, 1998 3

Some Previous “Wars” (1/3)

MARS Router1984

Torus Routing Chip1985

Network Design Frame1988

Reliable Router1994

Tomorrow's Computing EnginesWJD Feb 3, 1998 4

Some Previous “Wars” (2/3)

MDP Chip J-Machine Cray T3D MAP Chip

Tomorrow's Computing EnginesWJD Feb 3, 1998 5

Some Previous “Wars” (3/3)

Tomorrow's Computing EnginesWJD Feb 3, 1998 6

Tomorrow’s Computing Engines

• Driven by tomorrow’s applications - media

• Constrained by tomorrow’s technology

Tomorrow's Computing EnginesWJD Feb 3, 1998 7

90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000

• Quote from Scott Kirkpatric of IBM (talk abstract)• Media applications include

– video encode/decode– polygon & image-based graphics– audio processing - compression, music, speech -

recognition/synthesis– modulation/demodulation at audio and video rates

• These applications involve stream processing • So do

– radar processing: SAR, STAP, MTI ...

Tomorrow's Computing EnginesWJD Feb 3, 1998 8

Typical Media KernelImage Warp and Composite

• Read 10,000 pixels from memory• Perform 100 16-bit integer operations on each pixel• Test each pixel• Write 3,000 result pixels that pass to memory

• Little reuse of data fetched from memory– each pixel used once

• Little interaction between pixels– very insensitive to operation latency

• Challenge is to maximize bandwidth

Tomorrow's Computing EnginesWJD Feb 3, 1998 9

Telepresence: A Driving Application

Acquire2D

Images

Extract Depth

(3D Images)

SegmentationModel

ExtractionCompression

Decompression RenderingDisplay

3DScene

Most kernels: Latency insensitiveHigh ratio of arithmetic to memory references

Channel

Tomorrow's Computing EnginesWJD Feb 3, 1998 10

Tomorrow’s Technology is Wire Limited

• Lots of devices• A little faster• Slow wires

Tomorrow's Computing EnginesWJD Feb 3, 1998 11

Technology scaling makes communication the scarce resource

0.35m64Mb DRAM

16 64b FP Proc400MHz

0.10m4Gb DRAM

1K 64b FP Proc2.5GHz

1997 2007

18mm12,000 tracks

1 clock

32mm90,000 tracks

20 clocks

P

Tomorrow's Computing EnginesWJD Feb 3, 1998 12

On-chip wires are getting slower

x1 x2

y

y

x2 = s x1 0.5x

R2 = R1/s2 4x

C2 = C1 1x

tw2 = R2C2y2 = tw1/s2 4x

tw2/tg2= tw1/(tg1s3) 8x

v = 0.5(tgRC)-1/2 (m/s)

v2 = v1s1/2 0.7x

vtg = 0.5(tg/RC)1/2 (m/gate)

v2tg2 = v1tg1s3/2 0.35x

tw = RCy2 RCy2 RCy2

tg tg tg

Tomorrow's Computing EnginesWJD Feb 3, 1998 13

Bandwidth and Latency of Modern VLSI

Size101 100 103 104 105

10

100

1

103

Latency

Latency Bandwidth

1

0.01

10-4

10-6

Bandwidth

Chip Boundary

Tomorrow's Computing EnginesWJD Feb 3, 1998 14

Architecture for LocalityExploit high on-chip bandwidth

Off-chip RAM

Pin

-Ban

dwid

th,

2G

B/s

VectorRegFile

10432-bit ALUs50GB/s S

witc

h

500GB/s

Tomorrow's Computing EnginesWJD Feb 3, 1998 15

Tomorrow’s Computing Engines

• Aimed at media processing– stream based

– latency tolerant

– low-precision

– little reuse

– lots of conditionals

• Use the large number of devices available on future chips

• Make efficient use of scarce communication resources– bandwidth hierarchy

– no centralized resources

• Approach the performance of a special-purpose processor

Tomorrow's Computing EnginesWJD Feb 3, 1998 16

Why do Special-Purpose Processors Perform Well?

Fed by dedicated wires/memoriesLots (100s) of ALUs

Tomorrow's Computing EnginesWJD Feb 3, 1998 17

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU

Tomorrow's Computing EnginesWJD Feb 3, 1998 18

Three Key Problems

• Instruction bandwidth• Data bandwidth• Conditional execution

Tomorrow's Computing EnginesWJD Feb 3, 1998 19

A Bandwidth Hierarchy

SDRAM

SDRAM

SDRAM

SDRAM Str

eam

ing

Mem

ory

1.6GB/s

Vec

tor

Reg

iste

r F

ile

50GB/s

ALU Cluster

ALU Cluster

ALU Cluster

500GB/s

13 ALUs per cluster

•Solves data bandwidth problem

•Matched to bandwidth curve of technology

Tomorrow's Computing EnginesWJD Feb 3, 1998 20

A Streaming Memory System

AddressGenerator

AddressGenerator

IX

D

Cro

ssba

r

ReorderQueue

ReorderQueue

SDRAMBank

SDRAMBank

Tomorrow's Computing EnginesWJD Feb 3, 1998 21

Streaming Memory Performance

Bank Queue Effectiveness

0.00000

0.20000

0.40000

0.60000

0.80000

1.00000

1.20000

1.40000

1.60000

1.80000

1 2 4 8 16 32 64 Infinite

Queue Size

Cyc

les/

Acc

ess

• Exploit latency insensitivity for improved bandwidth

• 1.75:1 Performance improvement from relatively short reorder queue

Tomorrow's Computing EnginesWJD Feb 3, 1998 22

Compound Vector Operations1 Instruction does lots of work

LD Vd Vx

Mem AG

VRF

Memory Instructions

ControlStore

uIP

Op V0 V1 V2 V3 V4 V5 V6 V7

Compound Vector Instruction

Op Ra Rb Op Ra Rb Op Ra Rb

1 CV Inst (50b)

uInst (300b)x 20uInst/Opx 1000el/vec------------------6 x 106 b

Tomorrow's Computing EnginesWJD Feb 3, 1998 23

Scheduling by Simulated Annealing

• List scheduling assumes global communication– does poorly when

communication exposed

• View scheduling as a CAD problem (place and route)– generate naïve ‘feasible’

schedule

– iteratively improve schedule by moving operations.

ALUsTime

ReadyOps

Tomorrow's Computing EnginesWJD Feb 3, 1998 24

Typical Annealing Schedule

0

20

40

60

80

100

120

140

160

180

1 2001 4001 6001 8001 10001 12001 14001 16001 18001

13

166

Energy function changed

Tomorrow's Computing EnginesWJD Feb 3, 1998 25

Conventional Approaches to Data-Dependent Conditional Execution

A

x>0

B

C

J

K

Data-DependentBranch

Y N

A

x>0

Y

B

C

Whoops

J

K

SpeculativeLoss D x W~1000

A

B

J

y=(x>0)

if y

if ~y

C

K

if y

if ~y

ExponentiallyDecreasing Duty Factor

Tomorrow's Computing EnginesWJD Feb 3, 1998 26

Zero-Cost Conditionals

• Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches

– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.

• Conditional Vectors– append an element to an output stream depending on a case

variable.

Result Stream

Case Stream {0,1}

0

1

Output Stream 0

Output Stream 1

Tomorrow's Computing EnginesWJD Feb 3, 1998 27

Application Sketch - Polygon Rendering

V1V2

V3V1 V2 V3 X Y RGB

Y X1 X2 RGB1 RGBY

X1 X2

UV

UV1 UV

Vertex

Span

X Y RGB UV PixelY

X

X Y RGBTexturedPixel

UV RGB

Tomorrow's Computing EnginesWJD Feb 3, 1998 28

Status

• Working simulator of Imagine• Simple kernels running on simulator

– FFT

• Applications being developed– Depth extraction, video compression, polygon rendering,

image-based graphics

• Circuit/Layout studies underway

Tomorrow's Computing EnginesWJD Feb 3, 1998 29

Acknowledgements

• Students/Staff– Don Alpert (Intel)

– Chris Buehler (MIT)

– J.P Grossman (MIT)

– Brad Johanson

– Ujval Kapasi

– Brucek Khailany

– Abelardo Lopez-Lagunas

– Peter Mattson

– John Owens

– Scott Rixner

• Helpful Suggestions– Henry Fuchs (UNC)

– Pat Hanrahan

– Tom Knight (MIT)

– Marc Levoy

– Leonard McMillan (MIT)

– John Poulton (UNC)

Tomorrow's Computing EnginesWJD Feb 3, 1998 30

Conclusion

• Work toward tomorrow’s computing engines• Targeted toward media processing

– streams of low-precision samples– little reuse– latency tolerant

• Matched to the capabilities of communication-limited technology– explicit bandwidth hierarchy– explicit communication between units– communication exposed

• Insight not numbers