How’s the Parallel Computing Revolution Going?

70
How’s the Parallel Computing Revolution Going? 1 How’s the Parallel Revolution Going? McKinley Kathryn S. McKinley The University of Texas at Austin

description

How’s the Parallel Computing Revolution Going?. Kathryn S. McKinley The University of Texas at Austin. 20 th Century Simplicity. Hardware. software does not change it just runs faster. 20 th Century Simplicity. Software. hardware does not change it just runs faster. - PowerPoint PPT Presentation

Transcript of How’s the Parallel Computing Revolution Going?

Page 1: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 1

How’s the Parallel Computing Revolution

Going?

McKinley

Kathryn S. McKinleyThe University of Texas at Austin

Page 2: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 2

20th Century Simplicity

McKinley

Hardware

software does not change

it just runs faster

Page 3: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 3McKinley

hardware does not change

it just runs faster

Software

20th Century Simplicity

Page 4: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 4

How could they pretend?

McKinley

Page 5: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 5McKinley

Hardware Capabilities &

Complexity

sequential interface sequential interface

SoftwareCapabilities &

Complexity

Sequential interface hid explosion in capability & complexity

20th Century Virtuous Cycle

Page 6: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 6

20th Century Languagesinsufficient for software complexity

NativeProgrammingLanguages

McKinley

Page 7: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 7

21st Century Managed Language Revolution

McKinley

PHP

Page 8: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 8McKinley

Hardware Capabilities &

Complexity

sequential interface

20th Century Virtuous Cycle

sequential interface

Managed Languages

SoftwareCapabilities

Page 9: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 9

Processor Evolution

Power 41.3 GHz130nm 174M Tr.267 mm2

2 Cores2001

i72.7 GHz45nm

731M Tr.263mm2

4 Cores x 2 SMT2008

i53.4 GHz32nm

382M Tr.81mm2

2C x 2T2010

Power 52.3 GHz90nm

276M Tr.389 mm2

2 Cores2005

Page 10: How’s the  Parallel Computing Revolution  Going?

Processor Evolutionwhy multicore?

Page 11: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 11

Processor Evolutionwhy multicore?

on chip power constraints & wire delay slowed clock scaling

McKinley

Page 12: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 12McKinley

Hardware Capabilities &

Complexity

sequential interface

20th Century Virtuous Cycle

✗ sequential interface

Managed Languages

SoftwareCapabilities

Page 13: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 13McKinley

Parallel Hardware

Capabilities

Parallel interface Parallel interface

21st Century Virtuous Cycle?

? Managed Languages

SoftwareCapabilities

Page 14: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 14McKinley

21st Century Virtuous Cycle ?parallel interface

combines time and spacewicked to program

8MB L3

CPU CPU

8KB L1

512KB L2

Pentium 4w/ SMT

CPU CPU

32KB

4MB L2

Core 2 Quad

32KB

CPU CPU

32KB

4MB L2

32KB

system bus

32KB

256KB

32KB 32KB 32KB

256KB

256KB

256KB

Core i7

CPUs

Page 15: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 15

How is this new virtuous cycle going?

McKinley

Page 16: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 16

What should we measure?

McKinley

Page 17: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 17

performancepowerenergy

native languagesmanaged languages

sequential & parallel programs

McKinley

Page 18: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 18

How do we measure power?

McKinley

Page 19: How’s the  Parallel Computing Revolution  Going?

Measured Power, Performance & Scaling

19Esmaeilzadeh et al

Page 20: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 20

Looking Back on the Language & Hardware Revolutions:

Measured Power, Performance, and ScalingASPLOS 2011

McKinley

Stephen M. BlackburnAustralian National University

Kathryn S. McKinleyUniversity of Texas at Austin

Hadi EsmaeilzadehUniversity of Washington

Ting CaoAustralian National University

Xi YangAustralian National University

Page 21: How’s the  Parallel Computing Revolution  Going?

21

Workload4 groups weighed equally

61 benchmarks from 6 suites

Native Non-Scalable: SPECcpu 2006 Native Scalable: PARSEC 2008

Java Non-Scalable: SPECjvm98, JBB’05 DaCapo’06

Java Scalable: DaCapo’09

Page 22: How’s the  Parallel Computing Revolution  Going?

22

Intel Processors5 technology generations from

similar price points

Pentium 4130nm55M Tr.

131mm2

1C x 2T2003

Core 2 D65nm

291M Tr.143mm2

2C 2006

i745nm

731M Tr.263mm2

4C x 2T2008

Atom45nm

47M Tr.36mm2

1C x 2T2008

Core 2 D45nm

228M Tr.82mm2

2C2009

Atom D45nm

176M Tr.87mm2

2Cx2T+GPU2009

i532nm

382M Tr.81mm2

2C x 2T2010

Page 23: How’s the  Parallel Computing Revolution  Going?

Measured Power, Performance & Scaling

23

TDP & Measured Power

Esmaeilzadeh et al

2 20 2000

2

20

200

P4 (130)C2D (65)C2Q (65)i7 (45)Atom (45)C2D (45)AtomD (45)i5 (32)

TDP (W) (log)

Mea

sure

d Po

wer

(W)

(log)

Page 24: How’s the  Parallel Computing Revolution  Going?

Measured Power, Performance & Scaling

24

Measured Power vs Performance

Esmaeilzadeh et al

0.5 510

Performance / Reference Performance

Pow

er (

W)

20

40

80

100

60

1 2 3 4

??

2003Pentium 4 (130)

2008Core 2 Duo (45)

2006Core 2 Duo (65)

2008i7 (45)

2010i5 (32)

Page 25: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 25

How is this new virtuous cycle going

for native non-scalable?

McKinley

Page 26: How’s the  Parallel Computing Revolution  Going?

26

Native Non-Scalable Performance

McKinley

470.lbm

465.to

nto

437.les

lie3d

435.gr

omacs

434.ze

usmp

462.lib

quantu

m

464.h2

64ref

445.go

bmk

458.sje

ng

459.Gem

sFDTD

416.ga

mess

444.na

md

436.ca

ctusADM

400.pe

rlben

ch

454.ca

lculix

401.bz

ip2

447.de

alII

483.xa

lancbm

k

482.sp

hinx3

456.hm

mer

471.om

netpp

453.po

vray

429.m

cf

473.as

tar

403.gc

c

450.so

plex

433.m

ilc0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

2C1T 4C1T 4C2T

Perf

orm

ance

/ 1C

1T

Perf

orm

ance

Page 27: How’s the  Parallel Computing Revolution  Going?

27

Native Non-Scalable Energy

McKinley

470.lbm

465.to

nto

437.les

lie3d

435.gr

omacs

434.ze

usmp

462.lib

quantu

m

464.h2

64ref

445.go

bmk

458.sje

ng

459.Gem

sFDTD

416.ga

mess

444.na

md

436.ca

ctusADM

400.pe

rlben

ch

454.ca

lculix

401.bz

ip2

447.de

alII

483.xa

lancbm

k

482.sp

hinx3

456.hm

mer

471.om

netpp

453.po

vray

429.m

cf

473.as

tar

403.gc

c

450.so

plex

433.m

ilc0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2C1T 4C1T 4C2T

Ener

gy /

1C1T

Ene

rgy

Page 28: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 28

How is this new virtuous cycle going

for Java single threaded?

McKinley

Page 29: How’s the  Parallel Computing Revolution  Going?

29

Java Single Threaded Performance

McKinley

antlr fop

luindex

_209_d

bblo

at

_228_j

ack

_213_j

avac

_202_j

ess

_222_m

pega

udio

_201_c

ompre

ss0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

2C1T 4C1T 4C2T

Perf

orm

ance

/ 1C

1T

Perf

orm

ance

Page 30: How’s the  Parallel Computing Revolution  Going?

30

Java Single Threaded Energy

McKinley

antlr fop

luindex

_209_d

bblo

at

_228_j

ack

_213_j

avac

_202_j

ess

_222_m

pega

udio

_201_c

ompre

ss0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2C1T 4C1T 4C2T

Ener

gy /

1C1T

Ene

rgy

Page 31: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 31

How is this new virtuous cycle going

for native scalable?

McKinley

Page 32: How’s the  Parallel Computing Revolution  Going?

32

Native Scalable Performance

McKinley

ferret

swapt

ions

blacks

choles

raytra

ce

fluida

nimate x26

4

facesi

m

bodytr

ack

strea

mcluste

rvip

s

canne

al0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

2C1T 4C1T 4C2T

Perf

orm

ance

/ 1C

1T

Perf

orm

ance

Page 33: How’s the  Parallel Computing Revolution  Going?

33

Native Scalable Energy

McKinley

ferret

swapt

ions

blacks

choles

raytra

ce

fluida

nimate x26

4

facesi

m

bodytr

ack

strea

mcluste

rvip

s

canne

al0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2C1T 4C1T 4C2T

Ener

gy/ 1

C1T

Ener

gy

Page 34: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 34

How is this new virtuous cycle going

for Java scalable?

McKinley

Page 35: How’s the  Parallel Computing Revolution  Going?

35

Java Multithreaded Performance

McKinley

sunflow

tomcat xal

an

lusear

checl

ipse

pjbb2

005

_227_m

trt

tradeb

eans

jytho

nbat

ikavr

ora pmd h2

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

2C1T 4C1T 4C2T

Perf

orm

ance

/ 1C

1T

Perf

orm

ance

Page 36: How’s the  Parallel Computing Revolution  Going?

36

Java Multithreaded Energy

McKinley

sunflow

tomcat xal

an

lusear

checl

ipse

pjbb2

005

_227_m

trt

tradeb

eans

jytho

nbat

ikavr

ora pmd h2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2C1T 4C1T 4C2T

Ener

gy /

1C1T

Ene

rgy

Page 37: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 37

Is there hope?

McKinley

Page 38: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 38McKinley

parallel interfacecombines time and space

wicked to program

8MB L3

CPU CPU

8KB L1

512KB L2

Pentium 4w/ SMT

CPU CPU

32KB

4MB L2

Core 2 Quad

32KB

CPU CPU

32KB

4MB L2

32KB

system bus

32KB

256KB

32KB 32KB 32KB

256KB

256KB

256KB

Core i7

CPUs

Page 39: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 39

Vision• Algorithms must be space and time efficient

• Scalable Runtimes– Runtime & application parallelism & concurrency– CMP aware runtime improves application scalability

• Communication– Cache coherency is expensive and performance sensitive– Memory bandwidth scaling is problematic

• Heterogeneity– Move non-critical path off power-hungry cores– Smarter, more aggressive analysis

• Specialization?– Tuned cores? Special purpose cores?

McKinley

Page 40: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 40

Managed Languages

Challenges & Opportunities

McKinley

Page 41: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 41

Must start with a

scalable managed runtime

McKinley

Page 42: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 42

Sequential Managed Programs

McKinley

Application Managed Runtime

SingleCore

time

• Profiling• Dynamic Analysis• Compilation• Garbage Collection• Other Helper Threads• ……

Page 43: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 43

Steps towards scalability

McKinley

Step 1. Parallel application

ApplicationThreads

Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7

time

Unused cores

Each thread has different running time

Page 44: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 44

Steps towards scalability

McKinley

Step 2. Parallel runtime

Application

Threads

Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7

time

Runtime

Managed Application

Threads

Runtime waits for all application threads to pause

Page 45: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 45

Steps towards scalability

McKinley

Step 3. Parallel & concurrent runtime

Application

Threads

Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7

time

Runtime

Managed Application

Threads

Managed runtime on application’s critical pathmay perturb performance

Page 46: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 46

Steps towards scalability Ideal model

McKinley

Step 4. Minimize perturbation

Application

Threads

Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7

time

Threads

Analysis

Application

Threads

Offload work to concurrent runtime threads

Whole runtime task taken off critical path

Page 47: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 47

Steps towards scalability Ideal model

McKinley

Step 4. Minimize perturbation

Application

Threads

Core 0Core 1Core 2Core 3Core 4Core 5Core 6Core 7

time

Threads

Analysis

Application

Threads

Worst case is parallel & concurrent

Page 48: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 48

Scalable VM Services• Profiling (feedback directed optimization)

– Concurrent analysis– More invasive analysis on low-power cores– J. Ha et al. OOPSLA’09, Bond et al., PLDI’10, etc.

• GC– High performance parallel & concurrent GC– High performance mostly non-moving GC– Reduced synchronization overheads– Distributed & scratchpad GC– Blackburn et al. PLDI’10,CACM’08,PLDI’08,SIGMETRICS’04, etc.

• JIT– Concurrent, parallel JIT– Cost-benefit shift with low-power cores– Ha et al. PESPMA’09

• Architecture– Tuned and/or specialized cores for runtime services– Coherence tailored for restricted, common case of GC

McKinley

Page 49: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 49

Today• Profiling (feedback directed optimization)

– Concurrent analysis– More invasive analysis on low-power cores– J. Ha et al. OOPSLA’09, Bond et al., PLDI’10, etc.

• GC– High performance parallel & concurrent GC– High performance mostly non-moving GC– Reduced synchronization overheads– Distributed & scratchpad GC– Blackburn et al. PLDI’10,CACM’08,PLDI’08,SIGMETRICS’04, etc.

• JIT– Concurrent, parallel JIT– Cost-benefit shift with low-power cores– Ha et al. PESPMA’09

• Architecture– Tuned and/or specialized cores for runtime services– Coherence tailed for restricted, common case of GC

McKinley

Page 50: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 50

Garbage Collection

McKinley

Page 51: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 51

Isn’t Garbage Collection retro?

McKinley

Mark-CompactStyger, 1967

Mark-SweepMcCarthy, 1960

Semi-SpaceCheney, 1970

canonical algorithms

Page 52: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 52

Programmer Productivity

McKinley

Page 53: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 53

Programmer Productivity

& Performance?

McKinley

Page 54: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 54

GC FundamentalsAlgorithmic Components

Allocation Reclamation

McKinley

Identification

Bump Allocation

Free List

`

Tracing(implicit)

Reference Counting(explicit)

Sweep-to-Free

Compact

Evacuate3 1

Page 55: How’s the  Parallel Computing Revolution  Going?

55

Mark-Compact [Styger 1967]Bump allocation + trace + compact

GC FundamentalsCanonical Garbage Collectors

`

Sweep-to-Free

Compact

Evacuate

Mark-Sweep [McCarthy 1960]Free-list + trace + sweep-to-free

Semi-Space [Cheney 1970]Bump allocation + trace + evacuate

Page 56: How’s the  Parallel Computing Revolution  Going?

56

Garbage Collection

Space

Tim

e

Total PerformanceSemiSpaceMarkCompactMarkSweep

Space

Tim

e

Performance PathologiesMark-Sweep, Mark-Compact, Semi-Space

Mutator

Space

Tim

e

Minimum Heap

Spac

e

Geometric mean of DaCapo’06, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

Mark-SweepPoor locality

Semi-SpaceSpace

inefficient

Mark-Compact expensive multi-pass

McKinley

Page 57: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 57

Can we have space and time efficiency?

McKinley

Page 58: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 58

Mark-RegionPLDI 2008

McKinley

Kathryn S. McKinley Stephen M. BlackburnUniversity of Texas at Austin Australian National University

Page 59: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 59

Mark-Regionwith Sweep-To-Region

McKinley

`

Sweep-to-Free

Compact

Evacuate

Reclamation

Sweep-to-Region

Mark-SweepFree-list + trace + sweep-to-free

Mark-CompactBump allocation + trace + compact

Semi-SpaceBump allocation + trace + evacuate

Mark-RegionBump + trace + sweep-to-region

Page 60: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 60

Naïve Mark-Region

McKinley

• Contiguous allocation into regionsExcellent locality– Objects cannot span regions

• Simple mark phase– Mark objects and their region

• Free unmarked region

0

Page 61: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 61

Region Size?Lines and Blocks

McKinley

Small Regions

Large Regions

✗ Fragmentation (can’t fill blocks)

✓ More contiguous allocation ✗ Fragmentation (false marking)

Lines & BlocksN pages approx 1 cache line

✓ Less fragmentation Objects span lines

✓ Fast common case Lines marked with objects

✗ Increased metadata o/h✗ Constrained object sizes

0

TLB locality, cache locality Block > 4 X max object sizeFree FreeRecyclable lines Recyclable lines

Page 62: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 62

Allocation Policy(Recycling)

McKinley

• Recycle partially marked blocks first Minimizes fragmentation Maximizes sharing of freed blocks

Page 63: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 63

Immix Mark-RegionParallel

Opportunistic defragmentation

Overflow allocation

Implicit marking

McKinley

Page 64: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 64

Garbage Collection

Space

Tim

e

Total Performance

MarkSweepMarkCompactSemiSpaceImmix

Space

Tim

e

Immix Mark-RegionBump Allocation + Trace + Sweep-to-Region

Mutator

Space

Tim

e

Minimum Heap

Spac

e

✓ Simple, very fast collection

✓Space

efficient✓Good

locality

✓Excellent

performance

Geometric mean of DaCapo’06, jvm98, and jbb2000 on 2.4GHz Core 2 DuoMcKinley

Page 65: How’s the  Parallel Computing Revolution  Going?

A Better Space-Time Tradeoff 65

Space & time efficiency Why now?

Page 66: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 66

8MB L3

The PresentParallel interface

combines space & timewicked to program

McKinley

CPU CPU

8KB L1

512KB L2

Pentium 4w/ SMT

CPU CPU

32KB

4MB L2

Core 2 Quad

32KB

CPU CPU

32KB

4MB L2

32KB

system bus

32KB

256KB

32KB 32KB 32KB

256KB

256KB

256KB

Core i7

CPUs

Page 67: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 67

The FutureA parallel ecosystem?

space time efficiency

Parallel software stackruntime

applicationsalgorithms

McKinley

Page 68: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 68

Software Challenges and Opportunities

Communication (efficient coherency)Analysis (off critical path, new analyses)GC (concurrent, parallel, high throughput)JIT (concurrent, parallel, more aggressive)Heterogeneity (exploit it)Memory (PCM, bandwidth limits)

McKinley

Page 69: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 69

HardwareChallenges and Opportunities

Heterogeneity– Tune cores to specific workloads?– Specialize for workloads?

Coherence– SMT coherency does not scale– Software guarantees for simplified protocols?

Memory/Cache– Optimize access behavior of managed

languages

McKinley

Page 70: How’s the  Parallel Computing Revolution  Going?

How’s the Parallel Revolution Going? 70

The Future?

McKinley

Thank you