MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

MOS High Performance Arithmetic

Mark Horowitz

Stanford Universityhorowitz@ee.stanford.edu

Arithmetic Is Important

Now(TegraK1)

What Is Hard?

9999999+ 1

Proof:

2nd Gen

1st Gen

And Getting The Data You Need

But we didn’t notice this until much later …

To notice this problem• We need many advances in technology

3rd Gen – Relays (Z3 1940)

http://history-computer.com/ModernComputer/Relays/Zuse.html

A few adds/sec

4th Gen – Tubes (Eniac 1945)

5000 Adds/sec

5th Gen – Transistors (TX-0, Transistor 1 1953)

Modern Era – IC, 1961

Image from State of the Art © Stan Augarten

Moore’s Law

Number of components on IC doubles every yearLater modified to doubling every 18 to 24 months

From Electronics, Volume 38, Number 8, April 19, 1965

ECL Computers

Microprocessor – MOS Processor (1974)

nMOS 1978

CMOS 1985 – To Present

CMOS (Arithmetic) Design

What makes a good design?

Try to Balance 4 Parameters

Performance

Design Time

The Good News …

By the time CMOS came along

• There had been a lot of work on arithmetic• Booth coding• Wallace trees• Ling coding• Tree adders• Manchester carry chains• SRT division• …

The Bad News

The best logical design depends on technology

Remember the carry dependency?• For relays a Manchester carry chain is the best

• All the delay is in changing the relay state

P1_pv1

G1_pv1

P0_pv1

G0_pv1

P2_pv1

G2_pv1

More Bad News

The metrics you are optimizing work in opposition

Performance

Just to Make Life More Complicated

Your metrics change w/ technology scaling

Must Use Technology Independent Metrics

Performance• In terms of a FO4 delay

0.20.40.60.811.2

Technology Ldrawn (um)

Fanout=4 inverter delay at TT, 90% Vdd, 125C

500 * Ldrawn

1 4 16

Area and Energy

Area• Measure linear dimensions in “features”

Energy• CMOS energy is ∗• Normalize by ∗

Dennard’s Scaling

The triple play:• Get more gates, 1/L2 1/2

• Gates get faster, CV/i • Energy per switch CV2 3

Dennard, JSSC, pp. 256-268, Oct. 1974

Three Era’s of CMOS Arithmetic Design

Getting Going• Area constrained

Party time!• Performance Constrained

The hangover• Power Constrained

GETTING STARTED

Life in the 80’s

Just learning how to design complex chips• Chips had 100K transistors• Almost no CAD tools

• Worried about it getting the design done• And getting all the functions to fit on chip

Getting the design to fit was job one• Getting it to go fast was job two

Main Effects on Arithmetic Circuits

Merged Function Blocks• ALU

LookupTable

Precharge Logic

Dual pMOSNetwork

staticcurrent

precharge

evaluate evaluate

precharge

non-overlapping(good, but not always

possible)

evaluate

precharge

Psuedo-nMOS CMOS Pre-Charge

Carry Chains, and Carry Skip Adders

Cin Cout

P0*P1*P2*P3

Carry Carry Carry Carry

C0 C1 C2 C3

PGCin =0XORCin=1XORMux

Iterating Structures

Main processor• Just use instructions (micro-code) and ALU

Co-processor• Also used iterating structures• But built these structures for multiple or division• Often asynchronous

MIPS R3010 Multiplier

Clocked by internal oscillator, not external clock

CSA CSA

A Self-Timed Pipeline

in+in-

out+out-

clkclk clk clk

1 1 1 1 1

Data enters at the far left and the NOR gate flips• This activates the C-element

in+in-

out+out-

clkclk clk clk

0 1 1 1 1

First logic block goes into evaluate

in+in-

out+out-

clkclk clk clk

0 1 1 1 1eval

in+in-

out+out-

clkclk clk clk

0 0 1 1 1eval

Second block goes into evaluatePrimary inputs are deasserted, flipping the first NOR gate

in+in-

out+out-

clkclk clk clk

1 0 1 1 1eval

Division - SRT

Ted Williams• Completely self-timed

PARTY TIME!

Performance, Performance, Performance

Scaling provided• Enough transistors• Low energy, and fast gates

Goal was to find the fastest structures• Lots of dual rail domino logic• Started to build full array/trees

• Many of the trees were regular (4:2 adder) for designer sanity

Ling Adder Implementation

Sam Naffziger (HP, 1996) presented a 64b adder• 7 FO4 delay (< 1nS): pretty darn fast• 0.5m CMOS

From VLSI lecture notes in early 2000’s

Kogge Stone Adders

H640 H64

(g0, t0)

H16/I16

cin (g62, t62)

Alignment Shifter

Build full shifter

Even Fuse Multiplier and Adder Together

IBM Power 6 FMA

• 5 GHz 7-stage in 65nm

• Dependent unrounded results forwarded making dependent latency 6 cycles instead of 7

• (6,6,7) design

Life Was Good, For a While

44http://cpudb.stanford.edu/

THE HANGOVER

But You Have to Pay Eventually

The Power Limit

Watts/m

Power Increased Because We Were Greedy

10x too large

Clever

http://cpudb.stanford.edu/

This Power Problem Is Not Going Away:P = C * Vdd2 * f

Think About It

32 bit CMOS Adder Design Space

23468 1

Delay in 100ps

dual rail Sklansky Ling

static Sklanskydomino Sklansky Ling w/ 2bit sum select

Performance Metrics

Normally think of delay of unit• But that only matter if there is a dependent op

Many applications have many non-dependent ops• These are throughput based systems• Adding units improves performance

The Rise of Multi-Core Processors

http://cpudb.stanford.edu/53

The Stagnation of Multi-Core Processors

http://cpudb.stanford.edu/54

Throughput Based Designs

For applications with abundant parallelism• Leveraging parallelism helps energy efficiency

But when do you stop• Lower performance is almost always lower energy

Minimum energy designs, • Sea of very slow processors• Meters of silicon area

What to optimize?

Optimize Energy/Op vs. Area/Throughput

Floating Point Optimization180nm – ITRS 10nm

In This Space the Details Matter

Implementation of Booth Mux• More important then whether Booth 2, or Booth 3

How you wire the CSA array• Is more important than the type of counter

Most fancy adder tricks• Produce worse designs

Built an FP Generator in 2013

https://sites.google.com/a/stanford.edu/fpgen/home

FMA Output

CMA vs. FMA

For Latency For Throughput

Have A Shiny Ball, Now What?

Today FP Units are Not the Problem

8 cores

L1/reg/TLB

Rough Energy Numbers (45nm)

IntegerAdd

8 bit 0.03pJ32 bit 0.1pJ

Mult8 bit 0.2pJ32 bit 3 pJ

FPFAdd

16 bit 0.4pJ32 bit 0.9pJ

FMult16 bit 1pJ32 bit 4pJ

MemoryCache (64bit)

8KB 10pJ32KB 20pJ1MB 100pJ

DRAM 1.3-2.6nJ

I-Cache Access Register FileAccess

25pJ 6pJ Control

Instruction Energy Breakdown

What Is Going On Here?

GP DSPs

Dedicated

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1

Energy Efficien

CPUs+GPUs~1000x

The Truth:It’s More About the Algorithm then the Hardware

All Algorithms

GPU Alg

Highly Local Computation Model

Compose These Cores into a Pipeline

Program in space, not time• Makes building programmable hardware more difficult

User code

Cool images

Great, But Can A User Program It?Frankencamera 4

Have user code in a image friendly language• Language should facilitate writing image/vision processing

Analyze/compile the language for different targets• CPU / GPU / FPGA

Create not just the hardware bit file• But also the hardware drivers and application level API

How:Constructors to Encode Domain Knowledge

Encapsulate domain knowledge in the system

Build constructor from lower level constructors

Clean interfaces are critical

Reuse both constructor and most of the configuration file

Halide Language

Language for creating fast image processing appsSeparate algorithm from scheduleTarget CPU and GPU

What Halide Does For You

Vectorized

Multithreaded

11x faster• And not readable

Architecture Template:Stencil Functions and Line Buffers

Stencil functions consume sliding windows of data• Huge locality

To capture this locality need to buffer a few lines• Line buffer is the hardware buffer block.

Design Flow

Performance Results

Performance compared to Nvidia TK1

Energy Results

Conclusions

Designing the best arithmetic unit depends on:• Technology and constraints• Finding the right metrics is critical

Details matter• Must assess performance/area/energy of your idea• Generators (procedural knowledge) is a good approach to do this

Key to performance scaling in the future is the memory• Need applications with high locality

MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

Documents

Transcript of MOS High Performance Arithmetic - Inriaarith23.gforge.inria.fr/slides/Horowitz.pdf · MOS High...

y^£ J« 2W??

Viewsonic VLCDS24349-2W Oz960

PowerPoint Presentation · Career Development Coordinators Teachers 8 ... MOS Excel / Expert 180 /1 MOS OneNote 3 MOS Outlook 61 MOS PowerPoint 642 MOS Word/ Expert 646 / 8 NC Hunter

Polycom sound station 2w user guide

Perimeter 2l 2w Area - Mel Conway

amplificador 2W

Arithmetic Mean & Arithmetic Series

Detector Humo System Sensor 2W-B

Telephone Systems Art. 275-275/2W PABX interface · Telephone Systems Art. 275 - Art. 275/2W - Installation instructions Art. 275-275/2W PABX interface To set the “Call Delay Time”:

Class-AB Speaker Amplifiers 2W + 2W Stereo Speaker ...rohmfs.rohm.com/.../ic/audio_video/audio_amplifier/bh7881efv-e.pdf · Class-AB Speaker Amplifiers 2W + 2W Stereo Speaker / Headphone

Swiss 2w Price List

Sony Wega Training Aa-2w Chassis

Amplificador de Audio 2w Con Transistores

2W~310W solar panel price list

2W Spare List 100cc

CIDECT 2W Final Report

Wolverines Brochure 2016 Sched 2w

2W User Manual

Norsat Element 2W BUC

Capital Market 2w 03-04-2015