MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

Post on 18-Mar-2018

217 views 2 download

Transcript of MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

MOS High Performance Arithmetic

Mark Horowitz

Stanford Universityhorowitz@ee.stanford.edu

Arithmetic Is Important

2

Then

Now(TegraK1)

What Is Hard?

9999999+ 1

3

Proof:

4

2nd Gen

1st Gen

And Getting The Data You Need

But we didn’t notice this until much later …

To notice this problem• We need many advances in technology

5

3rd Gen – Relays (Z3 1940)

6

http://history-computer.com/ModernComputer/Relays/Zuse.html

A few adds/sec

4th Gen – Tubes (Eniac 1945)

7

5000 Adds/sec

5th Gen – Transistors (TX-0, Transistor 1 1953)

8

TX-0

Modern Era – IC, 1961

Image from State of the Art © Stan Augarten

9

Moore’s Law

Number of components on IC doubles every yearLater modified to doubling every 18 to 24 months

From Electronics, Volume 38, Number 8, April 19, 1965

10

ECL Computers

11

Microprocessor – MOS Processor (1974)

12

4004

nMOS 1978

13

8086

CMOS 1985 – To Present

14

80386

CMOS (Arithmetic) Design

What makes a good design?

15

Try to Balance 4 Parameters

Area

Performance

Power

Design Time

16

The Good News …

By the time CMOS came along

• There had been a lot of work on arithmetic• Booth coding• Wallace trees• Ling coding• Tree adders• Manchester carry chains• SRT division• …

17

The Bad News

The best logical design depends on technology

Remember the carry dependency?• For relays a Manchester carry chain is the best

• All the delay is in changing the relay state

18

P1_pv1

G1_pv1

P0_pv1

G0_pv1

P2_pv1

G2_pv1

More Bad News

The metrics you are optimizing work in opposition

19

Performance

Ener

gy

Just to Make Life More Complicated

Your metrics change w/ technology scaling

20

Must Use Technology Independent Metrics

Performance• In terms of a FO4 delay

21

0

100

200

300

400

500

600

700

0.20.40.60.811.2

Gat

e de

lay

(pS)

Technology Ldrawn (um)

Fanout=4 inverter delay at TT, 90% Vdd, 125C

500 * Ldrawn

1 4 16

FO4

Area and Energy

Area• Measure linear dimensions in “features”

Energy• CMOS energy is ∗• Normalize by ∗

22

23

Dennard’s Scaling

The triple play:• Get more gates, 1/L2 1/2

• Gates get faster, CV/i • Energy per switch CV2 3

Dennard, JSSC, pp. 256-268, Oct. 1974

Three Era’s of CMOS Arithmetic Design

Getting Going• Area constrained

Party time!• Performance Constrained

The hangover• Power Constrained

24

GETTING STARTED

25

Life in the 80’s

Just learning how to design complex chips• Chips had 100K transistors• Almost no CAD tools

• Worried about it getting the design done• And getting all the functions to fit on chip

Getting the design to fit was job one• Getting it to go fast was job two

26

Main Effects on Arithmetic Circuits

Merged Function Blocks• ALU

27

LookupTable

A

BP

F

Precharge Logic

28

Dual pMOSNetwork

staticcurrent

A

B

precharge

evaluate evaluate

precharge

non-overlapping(good, but not always

possible)

evaluate

precharge

Psuedo-nMOS CMOS Pre-Charge

WW

2W

2W

Carry Chains, and Carry Skip Adders

29

Cin Cout

P0*P1*P2*P3

Carry Carry Carry Carry

C0 C1 C2 C3

PGCin =0XORCin=1XORMux

Iterating Structures

Main processor• Just use instructions (micro-code) and ALU

Co-processor• Also used iterating structures• But built these structures for multiple or division• Often asynchronous

30

MIPS R3010 Multiplier

Clocked by internal oscillator, not external clock

31

CSA CSA

A Self-Timed Pipeline

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

1 1 1 1 1

prec

h

prec

h

prec

h

prec

h

32

A Self-Timed Pipeline

Data enters at the far left and the NOR gate flips• This activates the C-element

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 1 1 1 1

prec

h

prec

h

prec

h

prec

h

33

A Self-Timed Pipeline

First logic block goes into evaluate

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 1 1 1 1eval

prec

h

prec

h

prec

h

34

A Self-Timed Pipeline

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

0 0 1 1 1eval

prec

h

prec

h

prec

h

35

A Self-Timed Pipeline

Second block goes into evaluatePrimary inputs are deasserted, flipping the first NOR gate

in+in-

in+in-

in+in-

in+in-

out+out-

out+out-

out+out-

out+out-

C

clkclk clk clk

C CC

1 0 1 1 1eval

eval

prec

h

prec

h

36

Division - SRT

Ted Williams• Completely self-timed

37

PARTY TIME!

38

Performance, Performance, Performance

Scaling provided• Enough transistors• Low energy, and fast gates

Goal was to find the fastest structures• Lots of dual rail domino logic• Started to build full array/trees

• Many of the trees were regular (4:2 adder) for designer sanity

39

Ling Adder Implementation

Sam Naffziger (HP, 1996) presented a 64b adder• 7 FO4 delay (< 1nS): pretty darn fast• 0.5m CMOS

From VLSI lecture notes in early 2000’s

40

Kogge Stone Adders

41

H640 H64

62

(g0, t0)

H4/I4

H16/I16

H64

cin (g62, t62)

Alignment Shifter

Build full shifter

42

Even Fuse Multiplier and Adder Together

IBM Power 6 FMA

• 5 GHz 7-stage in 65nm

• Dependent unrounded results forwarded making dependent latency 6 cycles instead of 7

• (6,6,7) design

43

Life Was Good, For a While

44http://cpudb.stanford.edu/

THE HANGOVER

45

But You Have to Pay Eventually

46http://cpudb.stanford.edu/

The Power Limit

47http://cpudb.stanford.edu/

Watts/m

m2

48

Power Increased Because We Were Greedy

10x too large

Clever

http://cpudb.stanford.edu/

This Power Problem Is Not Going Away:P = C * Vdd2 * f

49http://cpudb.stanford.edu/

L0.6

Think About It

50

32 bit CMOS Adder Design Space

51

10

10

23468 1

1

Delay in 100ps

Ener

gy in

pJ

dual rail Sklansky Ling

static Sklanskydomino Sklansky Ling w/ 2bit sum select

Performance Metrics

Normally think of delay of unit• But that only matter if there is a dependent op

Many applications have many non-dependent ops• These are throughput based systems• Adding units improves performance

52

The Rise of Multi-Core Processors

http://cpudb.stanford.edu/53

The Stagnation of Multi-Core Processors

http://cpudb.stanford.edu/54

Throughput Based Designs

For applications with abundant parallelism• Leveraging parallelism helps energy efficiency

But when do you stop• Lower performance is almost always lower energy

Minimum energy designs, • Sea of very slow processors• Meters of silicon area

What to optimize?

55

Optimize Energy/Op vs. Area/Throughput

56

Floating Point Optimization180nm – ITRS 10nm

57

In This Space the Details Matter

Implementation of Booth Mux• More important then whether Booth 2, or Booth 3

How you wire the CSA array• Is more important than the type of counter

Most fancy adder tricks• Produce worse designs

58

Built an FP Generator in 2013

59

https://sites.google.com/a/stanford.edu/fpgen/home

FMA Output

60

CMA vs. FMA

61

For Latency For Throughput

Have A Shiny Ball, Now What?

62

Today FP Units are Not the Problem

8 cores

L1/reg/TLB

L2

L3

63

Rough Energy Numbers (45nm)

IntegerAdd

8 bit 0.03pJ32 bit 0.1pJ

Mult8 bit 0.2pJ32 bit 3 pJ

FPFAdd

16 bit 0.4pJ32 bit 0.9pJ

FMult16 bit 1pJ32 bit 4pJ

MemoryCache (64bit)

8KB 10pJ32KB 20pJ1MB 100pJ

DRAM 1.3-2.6nJ

70 pJ

I-Cache Access Register FileAccess

25pJ 6pJ Control

Instruction Energy Breakdown

Add

64

What Is Going On Here?

CPUs

GP DSPs

Dedicated

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1

1

10

100

1000

10000

Energy Efficien

cy (M

OPS/m

W)

CPUs+GPUs~1000x

65

The Truth:It’s More About the Algorithm then the Hardware

All Algorithms

GPU Alg

66

Highly Local Computation Model

67

Highly Local Computation Model

68

Highly Local Computation Model

69

Compose These Cores into a Pipeline

Program in space, not time• Makes building programmable hardware more difficult

70

71

User code

Cool images

Great, But Can A User Program It?Frankencamera 4

71

Goals

Have user code in a image friendly language• Language should facilitate writing image/vision processing

Analyze/compile the language for different targets• CPU / GPU / FPGA

Create not just the hardware bit file• But also the hardware drivers and application level API

72

How:Constructors to Encode Domain Knowledge

Encapsulate domain knowledge in the system

Build constructor from lower level constructors

Clean interfaces are critical

Reuse both constructor and most of the configuration file

73

Halide Language

Language for creating fast image processing appsSeparate algorithm from scheduleTarget CPU and GPU

74

What Halide Does For You

Tiled

Fused

Vectorized

Multithreaded

11x faster• And not readable

75

Architecture Template:Stencil Functions and Line Buffers

Stencil functions consume sliding windows of data• Huge locality

To capture this locality need to buffer a few lines• Line buffer is the hardware buffer block.

76

Design Flow

77

Performance Results

Performance compared to Nvidia TK1

78

Energy Results

79

Conclusions

Designing the best arithmetic unit depends on:• Technology and constraints• Finding the right metrics is critical

Details matter• Must assess performance/area/energy of your idea• Generators (procedural knowledge) is a good approach to do this

Key to performance scaling in the future is the memory• Need applications with high locality

80