JOHN HENNESSY - DARPA

31
JOHN CHAIRMAN HENNESSY ALPHABET

Transcript of JOHN HENNESSY - DARPA

Page 1: JOHN HENNESSY - DARPA

JOHN

CHAIRMAN

HENNESSY

ALPHABET

Page 2: JOHN HENNESSY - DARPA

The End of Moore’s Law & Faster

General Purpose Computing, and a

New Golden Age

John Hennessy

Stanford University

July 2018

Page 3: JOHN HENNESSY - DARPA

OUTLINE

• What’s happened and why?

• Opportunities

• Research possibilities

Future processors 2

Page 4: JOHN HENNESSY - DARPA

The End of an Era

• 40 years of stunning progress in microprocessor design

• 1.4x annual performance improvement for 40+ years ≈ 106 x faster (throughput)!

• Three architectural innovations:

• Width: 8->16->64 bit (~4x)

• Instruction level parallelism:

• 4-10 cycles per instruction to 4+ instructions per cycle (~10-20x)

• Multicore: one processor to 32 cores (~32x)

• Clock rate: 3 MHz to 4 GHz (through technology & architecture)

• Made possible by IC technology:

• Moore’s Law: growth in transistor count

• Dennard Scaling: power/transistor shrinks as speed & density increase

• Energy expended per computation is reducing

Future processors 3

Page 5: JOHN HENNESSY - DARPA

THREE CHANGES CONVERGE

• Technology

• End of Dennard scaling: power becomes the key constraint

• Slowdown in Moore’s Law: transistors cost (even unused)

• Architectural

• Limitation and inefficiencies in exploiting instruction level parallelism end the uniprocessor era.

• Amdahl’s Law and its implications end the “easy” multicore era

• Application focus shifts

• From desktop to individual, mobile devices and ultrascale cloud computing, IoT: new constraints.

Future processors 4

Page 6: JOHN HENNESSY - DARPA

UNIPROCESSOR PERFORMANCE

(SINGLE CORE)

Performance = highest SPECInt by year; from Hennessy & Patterson [2018]. Future processors 5

Page 7: JOHN HENNESSY - DARPA

6

MOORE’S LAW IN DRAMS

Future processors

Page 8: JOHN HENNESSY - DARPA

THE TECHNOLOGY SHIFTS

MOORE’S LAW SLOWDOWN IN INTEL PROCESSORS

Cost/transistor

slowing down

faster, due to

fab costs.

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014

Transistor Density

Moore's Law (1975 Version)

Future processors 7

Page 9: JOHN HENNESSY - DARPA

TECHNOLOGY, ENERGY, AND

DENNARD SCALING

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0

20

40

60

80

100

120

140

160

180

200

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Re

lati

ve P

ow

er

pe

r n

m^2

Na

mo

me

ters

Technology (nm)

Energy/nm^2

Power consumption

based on models in

Esmaeilzadeh

[2011].

Energy scaling for fixed task is better, since more & faster xistors.Future processors 8

Page 10: JOHN HENNESSY - DARPA

ENERGY EFFICIENCY IS THE NEW METRIC

Battery lifetime determines effectiveness!

LCD is biggest;

CPU close behind.

“Always on”

assistants likely to

increase CPU

demand.

Future processors 9

Page 11: JOHN HENNESSY - DARPA

AND IN THE CLOUD

Capital Costs

Shell and land

Power = cooling

Servers

Networking equipment

Effective Operating Cost

(with amortization)

Amortized servers

Amortized network

Power infrastructure +

electricity

Other amortized

infrastructure

Staff

Future processors 10

Page 12: JOHN HENNESSY - DARPA

END OF DENNARD SCALING IS A CRISIS

• Energy consumption has become more important to users

• For mobile, IoT, and for large clouds

• Processors have reached their power limit

• Thermal dissipation is maxed out (chips turn off to avoid overheating!)

• Even with better packaging: heat and battery are limits.

• Architectural advances must increase energy efficiency

• Reduce power or improve performance for same power

• But, the dominant architectural techniques have reached limits in energy efficiency!

• 1982-2005: Instruction level parallelism

• Compiler and processor find parallelism

• 2005-2017: Multicore

• Programmer identifies parallelism

• Caches: diminishing returns (small incremental improvements).

Future processors 11

Page 13: JOHN HENNESSY - DARPA

Instruction Level Parallelism Era

1982-2005

• Instruction level parallelism achieves significant performance advantages

• Pipelining: 5 stages to 15+ stages to allow faster clock rates (energy

neutralized by Dennard scaling)

• Multiple issue: <1 instruction/clock to 4+ instructions/clock

• Significant increase in transistors to increase issue rate

• Why did it end?

• Diminishing returns in efficiency

Future processors 12

Page 14: JOHN HENNESSY - DARPA

Getting More ILP

• Branches and memory aliasing are a major limit:

• 4 instructions/clock x 15 deep pipeline need more than 60 instructions “in flight”

• Speculation was introduced to allow this

• Speculation involves predicting program behavior

• Predict branches & predict matching memory addresses

• If prediction is accurate can proceed

• If the prediction is inaccurate, undo the work and restart

• How good must branch prediction be—very GOOD!

• 15-deep pipeline: ~4 branches 94% correct = 98.7%

• 60-instructions in flight: ~15 branches 90% = 99%

Future processors 13

Page 15: JOHN HENNESSY - DARPA

INTEL CORE I7: Theoretical CPI = 0.25

Achieved CPI

0

0.5

1

1.5

2

2.5

3

CP

I

Data collected by Professor Lu Peng and student Ying Zhang at LSU.Future processors 14

Page 16: JOHN HENNESSY - DARPA

WASTED WORK ON THE INTEL CORE I7

0%

5%

10%

15%

20%

25%

30%

35%

40%

Wo

rk w

aste

d /

To

tal w

ork

Data collected by Professor Lu Peng and student Ying Zhang at LSU.Future processors 15

Page 17: JOHN HENNESSY - DARPA

The Multicore Era

2005 - 2017

• Make the programmer responsible for identifying parallelism via threads

• Exploit the threads on multiple cores

• Increase cores if more transistors: easy scaling!

• Energy ≈ Transistor count ≈ Active cores

• So, we need Performance ≈ Active cores

• But, Amdahl’s Law says that this is highly unlikely

Future processors 16

Page 18: JOHN HENNESSY - DARPA

AMDAHL’S LAW LIMITS PERFORMANCE GAINS FROM PARALLEL PROCESSING

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Sp

ee

d-u

p

Processor Count

1%

10%

2%

4%

6%

8%

Speedup versus % ”Serial” Processing Time

Future processors 17

Page 19: JOHN HENNESSY - DARPA

SECOND CHALLENGE TO HIGHER PERFORMANCE FROM MULTICORE: END OF

DENNARD SCALING

• End of Dennard scaling means multicore scaling ends

• Full scaling will mean “dark silicon,” with cores OFF.

• Example

• Today: 22nm process, largest multicore

• Intel E7-8890: 24-core, 2.2 GHz, TDP = 165W (power limited)

• In 2019/2020: 11 nm process yields

• 96-cores @ 4.9 Ghz.

• Expected power consumption would be 295W

Power Limit Active Cores

165 W 54/96

180 W 59/96

200 W 65/96Future processors 18

Page 20: JOHN HENNESSY - DARPA

PUTTING THE CHALLENGES TOGETHER

DENNARD SCALING + AMDAHL’S LAW

0%

400%

800%

1200%

1600%

2000%

2400%

2800%

3200%

3600%

4000%

4400%

4800%

5200%

5600%

6000%

6400%

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

Sp

ee

d-u

p

Processor Count

1%

2%

4%

Speedup versus % ”Serial” Processing Time

Future processors 19

Page 21: JOHN HENNESSY - DARPA

What OPPORTUNITIES Left?

▪ SW-centric- Modern scripting languages are interpreted,

dynamically-typed and encourage reuse

- Efficient for programmers but not for execution

▪ HW-centric- Only path left is Domain Specific Architectures

- Just do a few tasks, but extremely well

▪ Combination- Domain Specific Languages & Architectures

Future processors 20

Page 22: JOHN HENNESSY - DARPA

WHAT’S THE OPPORTUNITY?

Matrix Multiply: relative speedup to a Python version (18 core Intel)

from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.

50X

7X

20X

9X

63,000X

Future processors 21

Page 23: JOHN HENNESSY - DARPA

DOMAIN SPECIFIC ARCHITECTURES (DSAS)

• Achieve higher efficiency by tailoring the architecture to characteristics of the

domain

• Not one application, but a domain of applications (different from strict ASIC)

• Requires more domain-specific knowledge then general purpose processors need

• Design DSAs and processors for targeted environments

• More variability than in GP processors

• Examples:

• Neural network processors for machine learning

• GPUs for graphics, virtual reality

• Some good news: demand for higher performance focused on such domains

Future processors 22

Page 24: JOHN HENNESSY - DARPA

WHY DSAS CAN WIN (NO MAGIC)

TAILOR THE ARCHITECTURE TO THE DOMAIN

• More effective use of parallelism for a specific domain:

• SIMD vs. MIMD

• VLIW vs. Speculative, out-of-order

• More effective use of memory bandwidth

• User controlled versus caches

• Processor + memory structures versus traditional

• Eliminate unneeded accuracy

• IEEE replaced by lower precision FP

• 32-bit,64-bit integers to 8-16 bits

• Domain specific programming model matches application to the processor architecture

Future processors 23

Page 25: JOHN HENNESSY - DARPA

DOMAIN SPECIFIC LANGUAGES

DSAs require targeting high level operations to architecture● Hard to start with C or Python-like language and recover

structure

● Need matrix, vector, or sparse matrix operations

● Domain Specific Languages specify these operations:

○ OpenGL, TensorFlow, P4

● If DSL programs retain architecture-independence,

interesting compiler challenges will exist○ XLA

“XLA - TensorFlow, Compiled”, XLA Team, March 6, 2017

Future processors 24

Page 26: JOHN HENNESSY - DARPA

Deep learning is causing

a machine learning revolution

From “A New Golden Age in

Computer Architecture:

Empowering the Machine-

Learning Revolution.” Dean,

J., Patterson, D., & Young, C. (2018). IEEE Micro, 38(2),

21-29.

Future processors 25

Page 27: JOHN HENNESSY - DARPA

RESEARCH OPPORTUNITIES: DSAS

• DSAs: especially focused on areas outside of industry focus

• National defense and security applications

• Computational materials science and chemistry

• Coevolution of DSLs and DSAs:

• Optimizing the mapping to a DSA.

• Portability (HW independence) and performance

Future processors 26

Page 28: JOHN HENNESSY - DARPA

Research Opportunity: Cheaper & Faster

Hardware Development

▪ Goal: HW development more like software:▪ Prototyping, reuse, abstraction

▪ New CAD tools

▪ Open HW stacks (ISA to IP libraries)

▪ Fast prototyping: FPGAs, small chips, etc.

▪ Role of ML in CAD?

Future processors 27

Page 29: JOHN HENNESSY - DARPA

Research Opportunity: New Technology

▪ Silicon:▪ Extend Dennard scaling and Moore’s Law

▪ New methods for efficient energy scaling

▪ Secure supply chains

▪ Packaging▪ Overcome TDP limits for high end

▪ Tighter integration = more performance & less power

▪ Integrated 3-5s for optical interconnect.

▪ Beyond Si:

▪ Carbon nanotubes?

▪ Quantum?

Future processors 28

Page 30: JOHN HENNESSY - DARPA

CONCLUDING THOUGHTS:

EVERYTHING OLD IS NEW AGAIN

• Dave Kuck, software architect for Illiac IV (circa 1975)

“What I was really frustrated about was the fact, with Iliac IV, programming the

machine was very difficult and the architecture probably was not very well suited

to some of the applications we were trying to run. The key idea was that I did not

think we had a very good match in Iliac IV between applications and architecture.”

• Achieving cost-performance in this era of DSAs will require matching the

applications, languages, architecture, and reducing design cost.

• Information technology (computing to electronics) is the most important

economic and security asset for any nation: Combine SW/HW/creativity to

compete by running faster.

Dave Kuck, ACM Oral HistoryFuture processors 29

Page 31: JOHN HENNESSY - DARPA