MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

80
MOS High Performance Arithmetic Mark Horowitz Stanford University [email protected]

Transcript of MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

Page 1: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

MOS High Performance Arithmetic

Mark Horowitz

Stanford [email protected]

Page 2: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Arithmetic Is Important

2

Then

Now(TegraK1)

Page 3: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

What Is Hard?

9999999+ 1

3

Page 4: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Proof:

4

2nd Gen

1st Gen

Page 5: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

And Getting The Data You Need

But we didn’t notice this until much later …

To notice this problem• We need many advances in technology

5

Page 6: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

3rd Gen – Relays (Z3 1940)

6

http://history-computer.com/ModernComputer/Relays/Zuse.html

A few adds/sec

Page 7: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

4th Gen – Tubes (Eniac 1945)

7

5000 Adds/sec

Page 8: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

5th Gen – Transistors (TX-0, Transistor 1 1953)

8

TX-0

Page 9: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Modern Era – IC, 1961

Image from State of the Art © Stan Augarten

9

Page 10: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Moore’s Law

Number of components on IC doubles every yearLater modified to doubling every 18 to 24 months

From Electronics, Volume 38, Number 8, April 19, 1965

10

Page 11: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

ECL Computers

11

Page 12: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Microprocessor – MOS Processor (1974)

12

4004

Page 13: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

nMOS 1978

13

8086

Page 14: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

CMOS 1985 – To Present

14

80386

Page 15: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

CMOS (Arithmetic) Design

What makes a good design?

15

Page 16: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Try to Balance 4 Parameters

Area

Performance

Power

Design Time

16

Page 17: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Good News …

By the time CMOS came along

• There had been a lot of work on arithmetic• Booth coding• Wallace trees• Ling coding• Tree adders• Manchester carry chains• SRT division• …

17

Page 18: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Bad News

The best logical design depends on technology

Remember the carry dependency?• For relays a Manchester carry chain is the best

• All the delay is in changing the relay state

18

P1_pv1

G1_pv1

P0_pv1

G0_pv1

P2_pv1

G2_pv1

Page 19: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

More Bad News

The metrics you are optimizing work in opposition

19

Performance

Ener

gy

Page 20: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Just to Make Life More Complicated

Your metrics change w/ technology scaling

20

Page 21: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Must Use Technology Independent Metrics

Performance• In terms of a FO4 delay

21

0

100

200

300

400

500

600

700

0.20.40.60.811.2

Gat

e de

lay

(pS)

Technology Ldrawn (um)

Fanout=4 inverter delay at TT, 90% Vdd, 125C

500 * Ldrawn

1 4 16

FO4

Page 22: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Area and Energy

Area• Measure linear dimensions in “features”

Energy• CMOS energy is ∗• Normalize by ∗

22

Page 23: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

23

Dennard’s Scaling

The triple play:• Get more gates, 1/L2 1/2

• Gates get faster, CV/i • Energy per switch CV2 3

Dennard, JSSC, pp. 256-268, Oct. 1974

Page 24: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Three Era’s of CMOS Arithmetic Design

Getting Going• Area constrained

Party time!• Performance Constrained

The hangover• Power Constrained

24

Page 25: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

GETTING STARTED

25

Page 26: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Life in the 80’s

Just learning how to design complex chips• Chips had 100K transistors• Almost no CAD tools

• Worried about it getting the design done• And getting all the functions to fit on chip

Getting the design to fit was job one• Getting it to go fast was job two

26

Page 27: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Main Effects on Arithmetic Circuits

Merged Function Blocks• ALU

27

LookupTable

A

BP

F

Page 28: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Precharge Logic

28

Dual pMOSNetwork

staticcurrent

A

B

precharge

evaluate evaluate

precharge

non-overlapping(good, but not always

possible)

evaluate

precharge

Psuedo-nMOS CMOS Pre-Charge

WW

2W

2W

Page 29: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Carry Chains, and Carry Skip Adders

29

Cin Cout

P0*P1*P2*P3

Carry Carry Carry Carry

C0 C1 C2 C3

PGCin =0XORCin=1XORMux

Page 30: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Iterating Structures

Main processor• Just use instructions (micro-code) and ALU

Co-processor• Also used iterating structures• But built these structures for multiple or division• Often asynchronous

30

Page 31: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

MIPS R3010 Multiplier

Clocked by internal oscillator, not external clock

31

CSA CSA

Page 32: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

A Self-Timed Pipeline

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

1 1 1 1 1

prec

h

prec

h

prec

h

prec

h

32

Page 33: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

A Self-Timed Pipeline

Data enters at the far left and the NOR gate flips• This activates the C-element

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 1 1 1 1

prec

h

prec

h

prec

h

prec

h

33

Page 34: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

A Self-Timed Pipeline

First logic block goes into evaluate

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 1 1 1 1eval

prec

h

prec

h

prec

h

34

Page 35: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

A Self-Timed Pipeline

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 0 1 1 1eval

prec

h

prec

h

prec

h

35

Page 36: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

A Self-Timed Pipeline

Second block goes into evaluatePrimary inputs are deasserted, flipping the first NOR gate

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

1 0 1 1 1eval

eval

prec

h

prec

h

36

Page 37: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Division - SRT

Ted Williams• Completely self-timed

37

Page 38: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

PARTY TIME!

38

Page 39: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Performance, Performance, Performance

Scaling provided• Enough transistors• Low energy, and fast gates

Goal was to find the fastest structures• Lots of dual rail domino logic• Started to build full array/trees

• Many of the trees were regular (4:2 adder) for designer sanity

39

Page 40: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Ling Adder Implementation

Sam Naffziger (HP, 1996) presented a 64b adder• 7 FO4 delay (< 1nS): pretty darn fast• 0.5m CMOS

From VLSI lecture notes in early 2000’s

40

Page 41: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Kogge Stone Adders

41

H640 H64

62

(g0, t0)

H4/I4

H16/I16

H64

cin (g62, t62)

Page 42: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Alignment Shifter

Build full shifter

42

Page 43: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Even Fuse Multiplier and Adder Together

IBM Power 6 FMA

• 5 GHz 7-stage in 65nm

• Dependent unrounded results forwarded making dependent latency 6 cycles instead of 7

• (6,6,7) design

43

Page 44: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Life Was Good, For a While

44http://cpudb.stanford.edu/

Page 45: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

THE HANGOVER

45

Page 46: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

But You Have to Pay Eventually

46http://cpudb.stanford.edu/

Page 47: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Power Limit

47http://cpudb.stanford.edu/

Watts/m

m2

Page 48: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

48

Power Increased Because We Were Greedy

10x too large

Clever

http://cpudb.stanford.edu/

Page 49: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

This Power Problem Is Not Going Away:P = C * Vdd2 * f

49http://cpudb.stanford.edu/

L0.6

Page 50: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Think About It

50

Page 51: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

32 bit CMOS Adder Design Space

51

10

10

23468 1

1

Delay in 100ps

Ener

gy in

pJ

dual rail Sklansky Ling

static Sklanskydomino Sklansky Ling w/ 2bit sum select

Page 52: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Performance Metrics

Normally think of delay of unit• But that only matter if there is a dependent op

Many applications have many non-dependent ops• These are throughput based systems• Adding units improves performance

52

Page 53: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Rise of Multi-Core Processors

http://cpudb.stanford.edu/53

Page 54: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Stagnation of Multi-Core Processors

http://cpudb.stanford.edu/54

Page 55: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Throughput Based Designs

For applications with abundant parallelism• Leveraging parallelism helps energy efficiency

But when do you stop• Lower performance is almost always lower energy

Minimum energy designs, • Sea of very slow processors• Meters of silicon area

What to optimize?

55

Page 56: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Optimize Energy/Op vs. Area/Throughput

56

Page 57: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Floating Point Optimization180nm – ITRS 10nm

57

Page 58: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

In This Space the Details Matter

Implementation of Booth Mux• More important then whether Booth 2, or Booth 3

How you wire the CSA array• Is more important than the type of counter

Most fancy adder tricks• Produce worse designs

58

Page 59: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Built an FP Generator in 2013

59

https://sites.google.com/a/stanford.edu/fpgen/home

Page 60: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

FMA Output

60

Page 61: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

CMA vs. FMA

61

For Latency For Throughput

Page 62: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Have A Shiny Ball, Now What?

62

Page 63: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Today FP Units are Not the Problem

8 cores

L1/reg/TLB

L2

L3

63

Page 64: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Rough Energy Numbers (45nm)

IntegerAdd

8 bit 0.03pJ32 bit 0.1pJ

Mult8 bit 0.2pJ32 bit 3 pJ

FPFAdd

16 bit 0.4pJ32 bit 0.9pJ

FMult16 bit 1pJ32 bit 4pJ

MemoryCache (64bit)

8KB 10pJ32KB 20pJ1MB 100pJ

DRAM 1.3-2.6nJ

70 pJ

I-Cache Access Register FileAccess

25pJ 6pJ Control

Instruction Energy Breakdown

Add

64

Page 65: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

What Is Going On Here?

CPUs

GP DSPs

Dedicated

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1

1

10

100

1000

10000

Energy Efficien

cy (M

OPS/m

W)

CPUs+GPUs~1000x

65

Page 66: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

The Truth:It’s More About the Algorithm then the Hardware

All Algorithms

GPU Alg

66

Page 67: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Highly Local Computation Model

67

Page 68: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Highly Local Computation Model

68

Page 69: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Highly Local Computation Model

69

Page 70: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Compose These Cores into a Pipeline

Program in space, not time• Makes building programmable hardware more difficult

70

Page 71: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

71

User code

Cool images

Great, But Can A User Program It?Frankencamera 4

71

Page 72: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Goals

Have user code in a image friendly language• Language should facilitate writing image/vision processing

Analyze/compile the language for different targets• CPU / GPU / FPGA

Create not just the hardware bit file• But also the hardware drivers and application level API

72

Page 73: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

How:Constructors to Encode Domain Knowledge

Encapsulate domain knowledge in the system

Build constructor from lower level constructors

Clean interfaces are critical

Reuse both constructor and most of the configuration file

73

Page 74: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Halide Language

Language for creating fast image processing appsSeparate algorithm from scheduleTarget CPU and GPU

74

Page 75: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

What Halide Does For You

Tiled

Fused

Vectorized

Multithreaded

11x faster• And not readable

75

Page 76: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Architecture Template:Stencil Functions and Line Buffers

Stencil functions consume sliding windows of data• Huge locality

To capture this locality need to buffer a few lines• Line buffer is the hardware buffer block.

76

Page 77: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Design Flow

77

Page 78: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Performance Results

Performance compared to Nvidia TK1

78

Page 79: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Energy Results

79

Page 80: MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High Performance Arithmetic ... Psuedo-nMOS CMOS Pre-Charge W W 2W 2W. Carry Chains, ...

Conclusions

Designing the best arithmetic unit depends on:• Technology and constraints• Finding the right metrics is critical

Details matter• Must assess performance/area/energy of your idea• Generators (procedural knowledge) is a good approach to do this

Key to performance scaling in the future is the memory• Need applications with high locality

80