INVITED PLENARY TALK FOR VLSI TECHNOLOGY SYMPOSIUM 2011 TECHNOLOGY … · 2011-08-28 · VLSI...

INVITED PLENARY TALK FORINVITED PLENARY TALK FORVLSIVLSI TECHNOLOGY SYMPOSIUM 2011TECHNOLOGY SYMPOSIUM 2011

TECHNOLOGY IMPACTS FROM THETECHNOLOGY IMPACTS FROM THE

VLSI VLSI TECHNOLOGY SYMPOSIUM 2011TECHNOLOGY SYMPOSIUM 2011

TECHNOLOGY IMPACTS FROM THE TECHNOLOGY IMPACTS FROM THE NEW WAVE OF ARCHITECTURES NEW WAVE OF ARCHITECTURES FOR MEDIAFOR MEDIA--RICH WORKLOADSRICH WORKLOADS

Samuel NaffzigerAMD Corporate Fellow

August 26th, 2011 (Original presentation June 14th , 2011)

1 VLSI Technology Symposium | June 2011 | Public

OutlineOutline

Introduction

The new workloads and demands on computationThe new workloads and demands on computation

Characteristics of serial and parallel computation

The Accelerated Processing Unit (APU) architectureThe Accelerated Processing Unit (APU) architecture

APU architecture implications for technology

Summary


Now: Parallel/DataNow: Parallel/Data--DenseDense

The Big Experience/Small Form Factor ParadoxThe Big Experience/Small Form Factor ParadoxMid 2000sMid 2000sTechnologyTechnology Mid 1990sMid 1990s Now: Parallel/DataNow: Parallel/Data--DenseDense

16:9 @ 7 megapixels

HD video flipcams, phones,

4:3 @ 1.2 megapixels

Digital cameras, SD webcams (1-5

TechnologyTechnology Mid 1990sMid 1990s

DisplayDisplay 4:3 @ 0.5 megapixel

E il fil & webcams (1GB)

3D Internet apps and HD video online, social networking w/HD files

SD webcams (1-5 MB files)

WWW and streaming SD video

ContentContent Email, film & scanners

OnlineOnline Text and low res photos

3D Blu-ray HD

Multi-touch, facial/gesture/voice recognition + mouse & keyboard

DVDs

Mouse & keyboardMouse & keyboard

MultimediaMultimedia CD-ROM

InterfaceInterface Mouse & keyboard

All day computing (8+ Hours)All day computing (8+ Hours)

Immersive and Immersive and interactive performanceinteractive performance ss

33--4 Hours4 HoursStandardStandard--definition definition

InternetInternet

Battery Battery Life*Life* 1-2 Hours

rmrm tors

tors pp

Wor

kloa

dsW

orkl

oadsFor

For

Fact

Fact

Early Early Internet and Multimedia Internet and Multimedia ExperiencesExperiences

3 VLSI Technology Symposium| June 2011 | Public

*Resting battery life as measured with industry standard tests.

Focusing on the experiences that matterFocusing on the experiences that matterConsumer PC Usage New Experiences

Email

Web browsing

Office productivity

New Experiences

Accelerated Internet Accelerated Internet Office productivity

Listen to music

Online chat

Watching online video

Photo editing

Accelerated Internet Accelerated Internet and HD Videoand HD Video

Photo editing

Personal finances

Taking notes

Online web-based games

Simplified Content Simplified Content ManagementManagement

Social networking

Calendar management

Locally installed games

Educational apps

ImmersiveImmersiveGamingGaming

Video editing

Internet phone

0% 20% 40% 60% 80% 100%

4 VLSI Technology Symposium| June 2011 | Public

Source: IDC's 2009 Consumer PC Buyer Survey

People Prefer Visual CommunicationsPeople Prefer Visual Communications

Visual Visual PerceptionPerceptionVerbal Verbal PerceptionPerception

Words are processedWords are processedat only 150 wordsat only 150 wordsper minuteper minute

Pictures and videoare processed 400 to

2000 times faster

Rich visual experiencesM lti l t tAugmenting Today’s Content:Augmenting Today’s Content: Multiple content sources Multi-Display Stereo 3D


The Emerging World of New Data Rich Applications The Emerging World of New Data Rich Applications

The Ultimate VisualThe Ultimate Visual

Communicating

The Ultimate Visual Experience™

Fast Rich Web content, favorite HD Movies, games with realistic

graphics

The Ultimate Visual Experience™

Fast Rich Web content, favorite HD Movies, games with realistic

graphics • IM, Email, Facebook• Video Chat, NetMeeting

Using photos• Viewing& Sharing• Search, Recognition, Labeling? • Advanced Editing

graphicsgraphics

Gaming

• Advanced Editing

Using video• DVD, BLU-RAY™, HD • Search, Recognition, Labeling • Advanced Editing & Mixing g

• Mainstream Games• 3D gamesMusic

• Listening and Sharing• Editing and Mixing• Composing and compositing


ArcSoftTotalMedia®

Theatre 5

ArcSoftMedia

Converter® 7

CyberLinkMedia

Espresso 6

CyberLinkPower

Director 9

Corel VideoStudioPro

Corel Digital Studio2010

InternetExplorer 9

Microsoft® PowerPoint® 2010

Windows Live

Essentials

CodemastersF1 2010Nuvixa

Be Present

ViVuDesktop

TelepresenceViewdle

Uploader

New Workload Examples: New Workload Examples: Changing Consumer BehaviorChanging Consumer Behavior

24 hoursof video

Approximately

9 billionvideo files owned are of video

uploaded to YouTube

every minute

video files owned are

high-definition

50 million +digital media files

1000 imagesd g ta ed a es

added to personal content libraries

every day

imagesare uploaded to Facebook

every second

7 | 2011 VLSI Symposium| June 2011 | Public

What Are the Implications for Computation?What Are the Implications for Computation?Insatiable demand for highInsatiable demand for high bandwidth processing–Visual image processing–Natural user interfaces–Massive data mining for

associative searchesassociative searches, recognition

Some of these compute needs b ffl d d tcan be offloaded to servers,

some must be done on the mobile device– Similar compute needs and

massive growth in both spaces

How must CPU architecture change to d l ith th t d ?


pdeal with these trends?

Serial ComputationSerial Computation

Transistors(thousands)

35 Years of Microprocessor Trend DataSerial Code

(thousands)

Single-threadPerformance(SpecINT)

i 0

…Conditionalbranches

Frequency(MHz)

Typical Power(Watts)

i=0i++

load x(i)fmulstore

cmp i (16)bc

Number ofCores

Loops, branches and conditional evaluation

Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten


Parallel ComputationParallel Computation

GFLOPs Trendi=0

…

DataParallel Code

6000

7000

8000

P)

GFLOPs Trend

GPU

i++load x(i)

fmulstore

cmp i (1000000)bc

…

Loop 1M times for 1M pieces

of data

3000

4000

5000

eak GFlop

s (SPFP CPU

AMD projections

…

i,j=0i++j++

load x(i,j)fmulstore

0

1000

2000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014P projections

…cmp j (100000)

bccmp i (100000)

bc

2D array representingvery large dataset

Years

…


GPU/CPU Design DifferencesGPU/CPU Design Differences

L t f i t ti littl d t

CPU (Serial compute) GPU (parallel compute)

Lots of instructions little data

• Out of order exec, Branch prediction

Few instructions lots of data

• Single Instruction Multiple Data• Extensive fine threading capability• Few hardware threads

Weak performance gains through density

• Extensive fine-threading capability

Nearly linear performance gains with density

Maximize speed with fast devices

Maximize density with cool devices


Three Eras of Processor PerformanceThree Eras of Processor Performance

Single-Core Era

Enabled by:

Multi-Core Era

Enabled by: Moore’s Law

HeterogeneousSystems Era

Enabled by: M ’ L Moore’s Law

Voltage & Process Scaling Micro Architecture

Constrained by:P

Moore’s Law Desire for Throughput 20 years of SMP arch

Constrained by:Power

Moore’s Law Abundant data parallelism Power efficient GPUs

Temporarily constrained by:P i d lPower

ComplexityPowerParallel SW availabilityScalability

Programming modelsCommunication overheadsWorkloads

erfo

rman

ce

?o rform

ance

o

plic

atio

n an

ce

ingl

e-th

read

P

we arehere

Thro

ughp

ut P

e we arehere

Targ

eted

App

Per

form

a

we arehere

o


S Time

T

Time(# of Processors)

Time(Data-parallel exploitation)

Heterogeneous Computing with an APU ArchitectureHeterogeneous Computing with an APU Architecture

2010 G2010 G (“ ) f(“ ) f 20112011 (“ ) f(“ ) f

CPU CPU CoresCores

~17 GB/sec~17 GB/sec~17 GB/sec~17 GB/sec

2010 IGP2010 IGP--based based (“Danube”) Platform(“Danube”) Platform 2011 APU2011 APU--based based (“Llano”) Platform(“Llano”) Platform

CoresCores

~~7 GB/sec7 GB/sec

UNBUNB

MC

MC

DDR3 DIMMDDR3 DIMMMemoryMemory

CPU ChipCPU Chip

DDR3 DIMMDDR3 DIMMMemoryMemory

APU ChipAPU Chip

CPU CPU CoresCores

UVDUVD

UN

B / M

CU

NB

/ MC

GPUGPU UVDUVD

SB FunctionsSB Functions

FCH ChipFCH Chip

Graphics requires memory Graphics requires memory BW to bring full capabilities BW to bring full capabilities

to lifeto life~27 GB/sec~27 GB/sec

~27 GB/sec~27 GB/secPCIe

GPUGPU

OptionalOptional

PCIe®®

Bandwidth pinch points and latency Bandwidth pinch points and latency hold back the GPU capabilitieshold back the GPU capabilities

Integration Provides ImprovementIntegration Provides Improvement Eliminate power and latency of extra chip Eliminate power and latency of extra chip

GPUGPU

crossingcrossing 3X 3X bandwidth between GPU and Memory!bandwidth between GPU and Memory! Same Same sized GPU is substantially more sized GPU is substantially more effectiveeffective

P ffi i d d h l f b hP ffi i d d h l f b h


Power efficient, advanced technology for both Power efficient, advanced technology for both CPU and GPUCPU and GPU

The Challenges of Integration

erfo

rman

ceP

erDensity

Thick, fast metalBig devices

Dense, thin metal, small devicesBig devices devices

CPU flop area = 2 14

GPU flop area = 1 0

Flop count for 4 Llano CPU cores=0.66M

Flop count for Llano GPU =3.5M


area 2.14 area 1.0

How to Balance the Metal Stack?Cu Resistivity

22.12.22.32.42.5

uohm

-cm

)

rman

ce With barrierWithout barrier

1.51.61.71.81.92

0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1Res

istiv

ity (

Per

for

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Line Width (um)Density

With the 20nm node, even local metal will be seeing large RC increase compromises more diffi lt

Add metal layers?Add metal layers? Thin, dense layers for the GPUThin, dense layers for the GPU

difficultyy

Thick, low resistance layers for the CPUThick, low resistance layers for the CPU Cost issues?Cost issues? Via resistance?Via resistance?


Via resistance?Via resistance?Technology improvements in BEOL are requiredTechnology improvements in BEOL are required

R R vsvs C?C?

Given the grim RC prognosis, should we be re-shaping either the aspect ratio or stack

M1 M1 M1 – M2

M5 – M6

M5 – M1

composition?Maybe. However, there are times when

M1 – M1 M1 M2

However, there are times when RC is important, but there are also many times when only C matters 7000

8000

9000

m

Track Availability vs Distance

GPU Stack:8-1X; 1-2X

Moreover, metal stack aspect ratio is more or less maxed out, so that leaves stack

3000

4000

5000

6000

7000

able

trac

ks/1

00um CPU Stack:

2-1X; 2-1.3X; 4-2X; 1-4X

composition Different products will emphasize different metal 0

1000

2000

3000

0 20 40 60 80 100 120 140

Ava

ila


pstacks

0 20 40 60 80 100 120 140

Norm Dist/ps

The growth in metal layer countThe growth in metal layer count

14Number of CPU Metal Levels vs Technology Node

10

12

els

6

8

Met

al L

eve

0

2

4

010 100 1000

Technology Node


Factors driving growth in Metal LayersFactors driving growth in Metal LayersI t t i t f b i liInterconnect requirements from basic scaling–Transistor count N scales as S2 (with fixed die size)Total interconnect length (in lambda) scales N>1 because of semi-global and global routes Therefore interconnect length (in mm) global and global routes. Therefore, interconnect length (in mm) increases at a rate <1/S

Non-scaling design rules–In order to achieve tight pitch, more restrictive design rules In order to achieve tight pitch, more restrictive design rules are imposed that significantly reduce the routeability of metal layers: Unidirectional metal, increased overlap requirements, restrictive T2T

d T2L land T2L rulesEach metal layer is “worth less” in terms of routeability: need more metal layers

Reverse scalingReverse scaling–Long distance routes require lower RC than can be accommodated by scaled metal

–So move routes to thicker layers but fewer tracks available


–So, move routes to thicker layers, but fewer tracks available, so pressure on layer count

Factors driving increase in Metal LayersFactors driving increase in Metal Layers

Electromigration/Power Supply GridElectromigration/Power Supply Grid–As cross section scales with S2, and current increases as Vdddrops, so current densities increase dramaticallyHigher Via R Metal resistances significantly degrade Higher Via R, Metal resistances significantly degrade Drives improved E-M sophistication, process techniques (alloys/barriers), denser power networksPower Gating and Power Islands may drive the need for multipleo e Gat g a d o e s a ds ay d e t e eed o u t p esupply grids More metal consumed by power supply grid

All of the above can have the effect of increasing the number of metal layers– But it can be a tradeoff of Metal layers vs die size and/or route timeBut it can be a tradeoff of Metal layers vs die size and/or route time


Device OptimizationAPUVtMix

rman

ce

CPU GPU

0.8

1

1.2

APU Vt Mix

LC‐HVT

Per

for

0.2

0.4

0.6HVT

RVT

LC‐RVT

LVTDevice Ioff

To achieve breakthrough APU performance, the Llano GPU

0

Llano CPU Llano GPU

LVTDevice Ioff

p ,has ~5X the flops and ~5X the device count of the CPUs

200

250

FPG vs. Ioff

desired device range

Speed vs. Leakage

ed

A broader device suite is

Broader span of devices required

0

50

100

150

FPG

device range

RO

spe

e

LVT LC-RVTRVT HVT LC-HVT

suite is required


0

175 50 20 4.3 2.7 0.5 0.4

Ioff (nA/um room temp)

Power Transfers

110.0110.0

100.0

105.0

100.0

105.0

90.0

95.0

90.0

95.0

85.0

Balanced workload

85.0

GPU-centric data parallel workload

V l i i i l bliV l i i i l bliVoltage range is critical to enabling Voltage range is critical to enabling the efficient power transfers that the efficient power transfers that make for compelling APU make for compelling APU performanceperformance


performance performance

Power Transfers

110.0110.0

100.0

105.0

100.0

105.0

90.0

95.0

90.0

95.0

85.0

CPU-centric serial orkload

85.0

Balanced workloadV l i i i l bliV l i i i l bli workloadVoltage range is critical to enabling Voltage range is critical to enabling

the efficient power transfers that the efficient power transfers that make for compelling APU make for compelling APU performanceperformance


performance performance

Operating Voltage RangeE/op vs V

Operating voltage Operating voltage requirements:requirements: Low voltage necessary forLow voltage necessary for

2

2.5

E/op vs. V

Low voltage necessary for Low voltage necessary for power efficiencypower efficiency High voltage necessary for High voltage necessary for 0.5

1

1.5

g g yg g ya snappy user experience a snappy user experience enabled by turbo modeenabled by turbo mode

0

0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V


Power Density Limited GPU

Operating Voltage Challenges

0.952V

0.915V

0 886V0.900V

0.950V

1.000V

3.5

4

4.5

5

age

40nm to 14nm

Frequency

To maintain cost effective To maintain cost effective performance growth with performance growth with technology node the GPUtechnology node the GPU 0.886V

0.805V

0 750V

0.800V

0.850V

1

1.5

2

2.5

3

Nom

inal Volta

Frequency

Power Density

Voltage

technology node, the GPU technology node, the GPU must:must:

–– Hold power density Hold power density constantconstant

0.700V

0.750V

0

0.5

40nm 28nm 20nm 14nm

constantconstant–– Exploit density gains to add Exploit density gains to add

compute units compute units Juniper FrequencyData40nm GPU Frequency Data

This necessarily drives This necessarily drives operating voltage downoperating voltage down

This would be good for energy This would be good for energy 700MHz

800MHz

900MHz

1000MHz

Juniper Frequency Data

1

2

3

4

5

6

40nm GPU Frequency Data

efficiency except …efficiency except …–– Variation impacts are much Variation impacts are much

greater at low voltagegreater at low voltage 300MHz

400MHz

500MHz

600MHz

6

7

8

9

10

11

12

13Frequency


0MHz

100MHz

200MHz

0.85V 0.90V 0.95V 1.00V 1.10V 1.15V

14

15

16

17

18

spread increases at low voltage

The Operating Voltage Challenge

Many barriers to maintaining both high Many barriers to maintaining both high and low voltage as technology scalesand low voltage as technology scales

FD devices should enable FD devices should enable maintaining the functional maintaining the functional range for a generation or tworange for a generation or twoand low voltage as technology scalesand low voltage as technology scales

TDDB vs. SCE controlTDDB vs. SCE control ULK breakdown vs. denser pitchesULK breakdown vs. denser pitches

V i ti t lV i ti t l

g gg gWill turbo modes be too Will turbo modes be too

compromised?compromised?What’s next?What’s next? Variation controlVariation control What s next?What s next?

95.0

100.0

105.0

110.0

Poly

85.0

90.0

Fin


BOXFin

Cost issuesCost issues


1000 R = k1/NA

Lithography evolutionLithography evolution

R k1/NA is saturating

NA is saturating

k1 limit is about 0.25

i-linei-lineg-line 430nmg-line 430nm

100

"1:1"

i-line 365nmi-line

365nmKrF 248nmKrF 248nmArF

193 nmimmersion

430nm430nmArF193 nm

10 NA~1.35 NA< 0.8


1010 100 1000

Technology Node or min Feat. size

NA 1.35 NA 0.8

Scaling implicationsScaling implications

R = k1/NA . – stuck at 193nm for now, NA at 1.35, and k1 limit at 0.25–Reducing k1 to <0 3 has very considerable cost:Reducing k1 to <0.3 has very considerable cost: –Much OPC and RDR needed to achieve tight pitches

K1=0.36

- Net, a significant erosion of pitch-based scaling entitlement- Scale factors are proprietary … but block area scaling > pitch scaling^2!

Fundamental pitch limitation for 193nm lithography is ~ 80nm


Pitch splittingPitch splitting

Decomposing a layer into two effectively 72nm 144nm

doubles pitch, resolving k1 issue and allowing complex shapes

Decomposition requires Decomposition requires significant CAD effort to break the patterns into two printable layersp y


Pitch splittingPitch splitting

Decomposing a layer into two effectively 72nm 144nm

doubles pitch, resolving k1 issue and allowing complex shapes

Decomposition requires Decomposition requires significant CAD effort to break the patterns into two printable layersp y

However, now have within-layer overlay issues, and min space can be a Vmax issue or a Cap issue

4σ min space ~ 16.5nm (PS @ 72nm) versus ~28nm ( @ 80nm)

a Vmax issue, or a Cap issue


4σ max space ~ 44.8nm (PS @ 72nm) versus ~41.3nm (D@ 80nm) ~Ccap variation: +85% / -30% over nominal for PS @ 72nm

~Ccap variation: +25% / -15% over nominal for SE @ 80nm

Why do we care?Why do we care?

Foundries have settled on a 28nm node with a ~4:3 M1X:Poly pitch ratioFoundries have settled on a 28nm node with a ~4:3 M1X:Poly pitch ratio–Typical Design rules

assuming 0.7x scalingDesign Rule 28nm Desired 20nm

Contacted Poly Pitch ~113nm ~80nm

20nm node CPP is doable –but probably want >80nm for margin and gate oversize capability

Contacted Poly Pitch 113nm 80nmM1X Pitch ~90nm ~64nm

but probably want >80nm for margin and gate oversize capabilityDesired 1X metal scaling to 20nm is below pitch split limitCan get “true” scaling and pitch split 1X metals

–GPU’s have up to 8 1X metals–CPU’s have 2-5 1X metals

Choice: significant cost adder for “true” scaling, or reduced cost and reduced scaling


Other cost considerationsOther cost considerationsMOL: Conventional contacts at <90nm CPP don’t work, and a more complex scheme is required, analogous to LI used by Intel at 32nm (+2 masks)

BEOL Options: Only scale 1X metals to ~80nm pitch, get reduced scaling but lower cost

Add metal layers at 80nm pitch to recover scaling; increased cost and cycle time

Use some combination of pitch split and non-pitch split layers to obtain greater scaling at higher cost

Key questions to resolve:

Additional cost of pitch split layers– Additional cost of pitch split layers

– Additional defectivity of pitch split layers (~64 vs ~80nm pitch)

– Whether or not to pitch split vias


p p

Relative cost experimentRelative cost experiment


What about EUV?What about EUV?At = 13.5nm, EUV should make lithography simple, and eliminate the

f O C ?need for pitch splitting, as well as most OPC. Right?Maybe:

– Very expensive capital equipment– Complex, expensive reflective masks– Very low throughput due to illuminator

output >10X below requirementsoutput 10X below requirements– Very high power requirements

Th i b l blThese issues may be solvable, unlikely by the leading edge of 14nm

EUV

Other forms of advanced lithography such as MEBL look attractive, but are even further behind EUV.


3D Integration to the Rescue?

DRAM

TIM (Thermal Interface Material)

Heat Sink

DRAMi

Through GPU Die

Analog Die (SB, Power)Metal Layers

Metal LayersDRAMMicro-

bumps

ThroughSilicon Vias

(TSVs) CPU DieMetal Layers

G U eMetal Layers

Package Substrate

South Bridge


3D Integration to the Rescue?

Stacking offers many attractive benefits Stacking offers many attractive benefitsHigher bandwidth to local memory

E bl ll l d i l t di t b i th iEnables parallel and serial compute die to be in their own separate optimized technology – interconnect speed vs. density, device optimization etc.

Allows IO and southbridge content to remain in older, more analog-friendly technology


3D Integration Challenges Economical 3D stacking in high volume manufacturing presents co o ca 3 stac g g o u e a u actu g p ese tsmany challenges

Benefits must exceed the additional costs of TSVs, and yield fallout

Logistics of testing and assembling die from multiple sources can be immense

Countless mechanical and thermal issues to solve in high volume mfgCountless mechanical and thermal issues to solve in high volume mfg

DRAM

Clearly 3D provides compelling solutions to many problems, but the TIM (Thermal Interface Material)

Heat Sink

DRAMy

barriers to entry mean heavy R&D $$ and partnerships required

ThroughSilicon

Vias(TSVs)

CPU DieMetal Layers

GPU DieMetal Layers

Analog Die (SB, Power)Metal Layers

Metal LayersDRAMDie to

Die Vias


p p qPackage Substrate

South Bridge

SummarySummary

Insatiable demand for high bandwidth computation–Visual image processing–Natural user interfacesNatural user interfaces–Massive data mining for associate searches, recognition

Some of these compute needs can be offloaded to servers, some must be done on the mobile devicesome must be done on the mobile device–Similar compute needs and massive growth in both spaces–Combined serial and parallel computation architectures are

key in both spaceskey in both spacesHuge technology challenges to meeting this opportunity

–Interconnect scaling is hitting a wall that must be overcomeA broad device suite is necessary that operates efficiently at–A broad device suite is necessary that operates efficiently at low voltage while enabling high speed for response time

–Cost issues present a very real barrier to further scaling3D integration offers a promising long term solution


–3D integration offers a promising long term solution

INVITED PLENARY TALK FOR VLSI TECHNOLOGY SYMPOSIUM 2011 TECHNOLOGY … · 2011-08-28 · VLSI...

Documents

Transcript of INVITED PLENARY TALK FOR VLSI TECHNOLOGY SYMPOSIUM 2011 TECHNOLOGY … · 2011-08-28 · VLSI...