Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...

44
© 2011 IBM Corporation Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond Phillip Stanley-Marbell Joint work with Victoria Caparrós Cabezas, Rik Jongerius, and Ronald Luijten IBM Research—Zurich

Transcript of Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...

Page 1: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM Corporation

Performance and Energy-Efficiency:Device Technology, Parallelism, Algorithms, and Beyond

Phillip Stanley-MarbellJoint work with Victoria Caparrós Cabezas, Rik Jongerius, and Ronald Luijten

IBM Research—Zurich

Page 2: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

(SKA1construction)

2011 2016

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Vehicles of Performance Growth

๏ Enablers of performance growth over the last 5 decades– Complex instructions (CISC): more co-dependent work per cycle– Simple instructions / hardware (RISC): ride technology scaling curve, faster cycles– Parallelism (superscalar, multithread, multicore): more independent work per cycle

๏ Limits to performance vehicles– CISC: uncommon denominators—complicated compilation– RISC: no longer feasible to run at faster cycles—power-limited scaling– Parallelism: available parallelism at instruction, thread, task, and data levels

2

vehicle |ˈvēəkəl; ˈvēˌhikəl|noun1 a thing used to express, embody, or fulfill something

(today)

Page 3: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Device Technology, Parallelism, and Beyond

3

1 ox: single-thread performance 1024 chickens: parallelism

...

1 tractor: better algorithms

If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?

— Seymour Cray

Page 4: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Device Technology, Parallelism, and Beyond

3

1 ox: single-thread performance 1024 chickens: parallelism

...

1 tractor: better algorithms

Page 5: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

4

Page 6: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Technology, Supply Voltage, and Power Delivery Limits1 of 2

5

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

10 50 100 500 1000 5000100

104

106

108

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Transistorsperm

m2

�Dimensionless� � : Mean Values——

1980 1985 1990 1995 2000 2005 2010100

104

106

108

Year

Transistorsperm

m2

�Dimensionless� � : Mean Values——

๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult

Page 7: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Technology, Supply Voltage, and Power Delivery Limits1 of 2

5

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

10 50 100 500 1000 50001000

105

107

109

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Transistorsperdie

�Dimensionless� � : Mean Values——

10 50 100 500 1000 5000100

104

106

108

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Transistorsperm

m2

�Dimensionless� � : Mean Values——

1980 1985 1990 1995 2000 2005 2010100

104

106

108

Year

Transistorsperm

m2

�Dimensionless� � : Mean Values——

๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult

Page 8: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

Technology, Supply Voltage, and Power Delivery Limits2 of 2

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power

๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...

6

10 50 100 500 1000 5000

1.0

5.0

2.03.0

1.5

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

SupplyVoltage,

V dd�V�

� : Mean Values——

Page 9: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

Technology, Supply Voltage, and Power Delivery Limits2 of 2

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power

๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...

6

10 50 100 500 1000 50000.001

0.1

10

1000

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Peak

supplycurrent

�Amperes� � : Mean Values——

10 50 100 500 1000 5000

1.0

5.0

2.03.0

1.5

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

SupplyVoltage,

V dd�V�

� : Mean Values——

Page 10: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

Technology, Supply Voltage, and Power Delivery Limits2 of 2

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power

๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...

6

10 50 100 500 1000 50000.001

0.1

10

1000

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Peak

supplycurrent

�Amperes� � : Mean Values——

10 50 100 500 1000 5000

1.0

5.0

2.03.0

1.5

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

SupplyVoltage,

V dd�V�

� : Mean Values——

Page 11: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

7

Page 12: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Challenge: Systems and Architectures Tailored to Workloads

8

๏ Mapping application parallelism to hardware concurrency– Appropriate form and amount of hardware concurrency depends on applications

๏ Hardware properties, e.g., microarchitectures• Power7 : 8 cores × 4-way SMT × 12-wide superscalar (8-issue, but 12 FUs)

• ARM Cortex A8 : 1 core × 1 thread × 2-wide superscalar• Power A2 cores : 1 core × 4-way SMT × 2-wide superscalar

๏ Granularities of parallelism in software– Available data-, task-, thread- (TLP) and instruction-level parallelism (ILP)

Page 13: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

V. Caparrós Cabezas and P. Stanley-Marbell, “Parallelism and Data Movement Characterization for Contemporary Application Classes”, 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2011

0110100001100101011011000110110001101111

Input program binary

Input dataset

+

Instruction control-flow-, register- and memory-dependence trace

Post-execution analysis

...addu r3,r2,r6sw r0,0(xx)addu r2,r2,r5addiu r4,r4,1sw r0,0(r3)

...

011000100111100101100101

Task-level partitioning to minimize value communication

!

Inter-task data movement

"

Inter-dependent basic blocks with

potential data-level parallelism

#

Thread-level parallelism

$

Inter-thread data movement

%

Instruction-level parallelism

&

Per-instruction data movement

'

Execution-driven dataflow

analysis

Stage 1 Stage 2

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application Parallelism Characterization Framework

๏ Input– Compiled program binaries and their associated input datasets

๏ Output– Best-case instruction- (ILP) and thread-level parallelism (TLP), and potential data-level parallelism– Communication-minimizing MPMD code partitioning– Data movement characterization

9

Page 14: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

V. Caparrós Cabezas and P. Stanley-Marbell, “Parallelism and Data Movement Characterization for Contemporary Application Classes”, 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2011

0110100001100101011011000110110001101111

Input program binary

Input dataset

+

Instruction control-flow-, register- and memory-dependence trace

Post-execution analysis

...addu r3,r2,r6sw r0,0(xx)addu r2,r2,r5addiu r4,r4,1sw r0,0(r3)

...

011000100111100101100101

Task-level partitioning to minimize value communication

!

Inter-task data movement

"

Inter-dependent basic blocks with

potential data-level parallelism

#

Thread-level parallelism

$

Inter-thread data movement

%

Instruction-level parallelism

&

Per-instruction data movement

'

Execution-driven dataflow

analysis

Stage 1 Stage 2

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application Parallelism Characterization Framework

๏ Input– Compiled program binaries and their associated input datasets

๏ Output– Best-case instruction- (ILP) and thread-level parallelism (TLP), and potential data-level parallelism– Communication-minimizing MPMD code partitioning– Data movement characterization

9

Page 15: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

DLA1

DLA2

DLA3

DLA4

SLA1

SLA2

SLA3

SM 1

SM 2SM 3SG1

SG2

UG1MR1

CL1

CL2

CL3

GT1GT2

BT1

BT2

BT3

BT4

FSM 1FSM 2FSM 3

FSM 4Ar ith.M ean

1 10 100 1000

1

10

1 10 100 1000

1

10

ILP

TLP

Dwarf 1Dwarf 2Dwarf 3Dwarf 4Dwarf 5Dwarf 6Dwarf 7Dwarf 8Dwarf 9Dwarf 11Dwarf 13

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Characterization of Applications Spanning Berkeley “Dwarfs”

10

Dwarf 1: Dense Linear Algebra (DLA) JPEG(encode,decode)b, kmeansc,pcac

Dwarf 2: Sparse Linear Algebra (SLA) basicmathb, bitcountb, matrix_multiplyc

Dwarf 3: Spectral Methods (SM) lameb, FFTb, IFFTb

Dwarf 4: N-Body Methods (NB) ammpa

Dwarf 5: Structured Grids (SG) JPEGb(encode, decode)Dwarf 6: Unstructured Grids (UG) equakea

Dwarf 7: MapReduce (MR) mesaa

Dwarf 8: Combinational Logic (CL) rijndaelb(encode, decode), shab

Dwarf 9: Graph Traversal (GT) qsortb, Dijkstrab

Dwarf 10: Dynamic Programming (DP) NoneDwarf 11: Backtrack / Branch+Bound (BT) vpra, mcfa, arta, susanb

Dwarf 12: Graphical Models (GM) NoneDwarf 13: Finite State Machine (FSM) gzipa, gcca, parsera, stringsearchb

a SPEC 2000b MiBenchc Phoenix MapReduce Suite

26 Applications

Page 16: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

DLA1

DLA2

DLA3

DLA4

SLA1

SLA2

SLA3

SM 1

SM 2SM 3SG1

SG2

UG1MR1

CL1

CL2

CL3

GT1GT2

BT1

BT2

BT3

BT4

FSM 1FSM 2FSM 3

FSM 4Ar ith.M ean

1 10 100 1000

1

10

1 10 100 1000

1

10

ILP

TLP

Dwarf 1Dwarf 2Dwarf 3Dwarf 4Dwarf 5Dwarf 6Dwarf 7Dwarf 8Dwarf 9Dwarf 11Dwarf 13

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Characterization of Applications Spanning Berkeley “Dwarfs”

10

Dwarf 1: Dense Linear Algebra (DLA) JPEG(encode,decode)b, kmeansc,pcac

Dwarf 2: Sparse Linear Algebra (SLA) basicmathb, bitcountb, matrix_multiplyc

Dwarf 3: Spectral Methods (SM) lameb, FFTb, IFFTb

Dwarf 4: N-Body Methods (NB) ammpa

Dwarf 5: Structured Grids (SG) JPEGb(encode, decode)Dwarf 6: Unstructured Grids (UG) equakea

Dwarf 7: MapReduce (MR) mesaa

Dwarf 8: Combinational Logic (CL) rijndaelb(encode, decode), shab

Dwarf 9: Graph Traversal (GT) qsortb, Dijkstrab

Dwarf 10: Dynamic Programming (DP) NoneDwarf 11: Backtrack / Branch+Bound (BT) vpra, mcfa, arta, susanb

Dwarf 12: Graphical Models (GM) NoneDwarf 13: Finite State Machine (FSM) gzipa, gcca, parsera, stringsearchb

a SPEC 2000b MiBenchc Phoenix MapReduce Suite

26 Applications

Page 17: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

11

Page 18: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

0

2 10

Core-O-Meter

“Wimpy”Core

“Brawny”Core4 6 8

11?

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

How Low-Power? How-Parallel?

๏ How low-power and restricted-performance should one go? [Barroso 2010]

– Single-thread performance still matters for multi-core [Hill and Marty 2008]

– When cores are too simple, cost becomes dominated by packaging costs

12

[Barroso 2010] T. Mudge and U. Holzle. Challenges and Opportunities for Extremely Energy-Efficient Processors. IEEE Micro, (3) 4, 2010

[Hill and Marty 2008] M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. IEEE Computer, (41) 7, 2008

Page 19: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Common Properties:• Identical 4GB flash disk

• Lab-grade power meter, 1 measurement per second, sub-mA resolution

• Power measured for whole platform

Page 20: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Page 21: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Page 22: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Page 23: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Pictures are NOT to scale

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Page 24: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference

14

≈9100s

Etotal ≈ 31 kJ

AR

M

P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).

Page 25: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

AR

M

≈9100s

Etotal ≈ 31 kJ

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference

14

≈2030s

Etotal ≈ 33 kJ

Atom

P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).

Page 26: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

AR

M

≈9100s

Etotal ≈ 31 kJ

≈2030s

Etotal ≈ 33 kJ

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference

14

≈7900s

Etotal ≈ 104 kJ

Atom

PPC

P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).

Page 27: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark

๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)

15

Etotal ≈ 15 kJ

≈820s

Atom

Page 28: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Atom

Etotal ≈ 15 kJ

≈820s

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark

๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)

15

≈6400s

Etotal ≈ 86 kJ

PPC

Page 29: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

16

Page 30: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�

SupplyPinCount

Supply Pin Count� 58.8915 current0.53786

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�TotalPinCount

Total Pin Count� 215.151 current0.364102

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Interplay Between Power and Bandwidth: Pins

17

1980 1990 2000 2010 202010

50100

5001000

50001�104

Year

TotalPinCount

Total Pin Count� 100.0476779 year�92.6952

� : Mean Values——

Page 31: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�

SupplyPinCount

Supply Pin Count� 58.8915 current0.53786

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�TotalPinCount

Total Pin Count� 215.151 current0.364102

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Interplay Between Power and Bandwidth: Pins

๏ Bandwidth and power are both limiters; both require more pins– Limited pin count growth (doubling every ~6 years)– Supply pin count growing more rapidly than total pin count– Tradeoff b/n more supply pins for lower resistive losses, but restricted I/O bandwidth

17

1980 1990 2000 2010 202010

50100

5001000

50001�104

Year

TotalPinCount

Total Pin Count� 100.0476779 year�92.6952

� : Mean Values——

Page 32: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

100 1000500200 300150 1500700

1000

2000

3000

1500

Supply Current �Amperes�

Totalpins;SupplyPins

Total pins

Supply Pins

�Gap: Pins Available for Signals

1 10 100 1000 1040.1

1

10

100

1000

104

Number of On�Chip Threads

Mem

oryInterface

Pins Projected total pin count in 2020 �from Figure 2�

Projected supply pin count in 2020 �from Figures 1 and 7�ITRS 2:1 supply:signal pin count in 2020

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Interplay Between Power and Bandwidth: Pins

๏ Projections– From hardware trends of previous slide, and application memory reference analyses

๏ Takeaway message– In addition to constraints on available parallelism in applications, existing packaging

techniques and their improvement trends may not support >1000 cores circa 2020

18

P. Stanley-Marbell, V. Caparrós Cabezas, and R. Luijten “Pinned to the Walls—Impact of Packaging on the Memory and Power Walls”, to appear, IEEE/ACM International Symposium on Low-Power Electronics and Design (ISLPED 2011).

Page 33: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 34: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 35: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 36: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 37: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 38: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute Problems

Problem Specification (Sal)

Set free variablefree variable's type

Set's universe / typeSet's predicate expression

S1a = (x : U1[1] �= 1) : U1.

New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 39: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute Problems

(via Salc Compiler)

Problem Specification (Sal)

Set free variablefree variable's type

Set's universe / typeSet's predicate expression

S1a = (x : U1[1] �= 1) : U1.

New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 40: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute Problems

(via Salc Compiler)

...

Svm Machine State Representing Problem

U Registers<1,5, ..., 94>

<"en", ..., "bij">

...

{2.1, ..., -1.3}

S RegistersP1:U3

(P1 | P7) : U1

...

true:U2

P Registers

...

=

^

2

Runtime Symbol TableVariable

qx

...

mquantvar

Scope14

...

2

Value2

...

"hello"

v

...

Problem Specification (Sal)

Set free variablefree variable's type

Set's universe / typeSet's predicate expression

S1a = (x : U1[1] �= 1) : U1.

New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 41: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute Problems

(via Salc Compiler)

...

Svm Machine State Representing Problem

U Registers<1,5, ..., 94>

<"en", ..., "bij">

...

{2.1, ..., -1.3}

S RegistersP1:U3

(P1 | P7) : U1

...

true:U2

P Registers

...

=

^

2

Runtime Symbol TableVariable

qx

...

mquantvar

Scope14

...

2

Value2

...

"hello"

v

...

Problem Specification (Sal)

Set free variablefree variable's type

Set's universe / typeSet's predicate expression

S1a = (x : U1[1] �= 1) : U1.

New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

Answer Answer

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).

Page 42: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

20

Page 43: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Summary

๏ Context– Continued performance increases desirable for, e.g., future SKA backend

๏ Technology trends– Performance growth no longer through clock speed increases; power-limited

๏ Parallelism as a performance vehicle– Presented framework for characterizing available parallelism (ILP, TLP) in applications

๏ Parallelism, performance, and low-power tradeoffs– Characterization of not just average power, but energy-to-solution for whole systems

๏ Potential limits to scaling via parallelism– New approaches desirable for improved bandwidth to support parallelism– Alternatives to single-algorithm scaling: algorithmic choice

21

Page 44: Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology trends, spanning the last 30 years, for >150 real designs – Tools for characterizing

© 2011 IBM Corporation

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Backup Slides OPST / Zurich / IBM Aveiro, Portugal22