Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...

© 2011 IBM Corporation

Performance and Energy-Efficiency:Device Technology, Parallelism, Algorithms, and Beyond

Phillip Stanley-MarbellJoint work with Victoria Caparrós Cabezas, Rik Jongerius, and Ronald Luijten

IBM Research—Zurich

(SKA1construction)

2011 2016

© 2011 IBM CorporationPST / Zurich / IBM

Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond

Aveiro, Portugal

Vehicles of Performance Growth

๏ Enablers of performance growth over the last 5 decades– Complex instructions (CISC): more co-dependent work per cycle– Simple instructions / hardware (RISC): ride technology scaling curve, faster cycles– Parallelism (superscalar, multithread, multicore): more independent work per cycle

๏ Limits to performance vehicles– CISC: uncommon denominators—complicated compilation– RISC: no longer feasible to run at faster cycles—power-limited scaling– Parallelism: available parallelism at instruction, thread, task, and data levels

2

vehicle |ˈvēəkəl; ˈvēˌhikəl|noun1 a thing used to express, embody, or fulfill something

(today)

๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms



Aveiro, Portugal

Device Technology, Parallelism, and Beyond

3

1 ox: single-thread performance 1024 chickens: parallelism

...

1 tractor: better algorithms

If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?

— Seymour Cray

๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms



Aveiro, Portugal

Device Technology, Parallelism, and Beyond

3

1 ox: single-thread performance 1024 chickens: parallelism

...

1 tractor: better algorithms



Aveiro, Portugal

Outline

๏ Context

๏ Technology Trends

๏ Algorithms and Parallelism

๏ Low-Power versus High-Performance Tradeoff

๏ Concurrency Limits and Performance Beyond Parallelism

๏ Summary

4



Aveiro, Portugal

Technology, Supply Voltage, and Power Delivery Limits1 of 2

5

Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010

10 50 100 500 1000 5000100

104

106

108

Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�

Transistorsperm

m2

�Dimensionless� � : Mean Values——

1980 1985 1990 1995 2000 2005 2010100

104

106

108

Year

Transistorsperm

m2


๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult



Aveiro, Portugal


5


10 50 100 500 1000 50001000

105

107

109


Transistorsperdie


10 50 100 500 1000 5000100

104

106

108


Transistorsperm

m2


1980 1985 1990 1995 2000 2005 2010100

104

106

108

Year

Transistorsperm

m2


๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult





Aveiro, Portugal

๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power

๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...

6

10 50 100 500 1000 5000

1.0

5.0

2.03.0

1.5


SupplyVoltage,

V dd�V�

� : Mean Values——





Aveiro, Portugal

๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power

๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...

6

10 50 100 500 1000 50000.001

0.1

10

1000


Peak

supplycurrent

�Amperes� � : Mean Values——

10 50 100 500 1000 5000

1.0

5.0

2.03.0

1.5


SupplyVoltage,

V dd�V�




Aveiro, Portugal

Outline

๏ Context





๏ Summary

7



Aveiro, Portugal

Challenge: Systems and Architectures Tailored to Workloads

8

๏ Mapping application parallelism to hardware concurrency– Appropriate form and amount of hardware concurrency depends on applications

๏ Hardware properties, e.g., microarchitectures• Power7 : 8 cores × 4-way SMT × 12-wide superscalar (8-issue, but 12 FUs)

• ARM Cortex A8 : 1 core × 1 thread × 2-wide superscalar• Power A2 cores : 1 core × 4-way SMT × 2-wide superscalar

๏ Granularities of parallelism in software– Available data-, task-, thread- (TLP) and instruction-level parallelism (ILP)

V. Caparrós Cabezas and P. Stanley-Marbell, “Parallelism and Data Movement Characterization for Contemporary Application Classes”, 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2011

0110100001100101011011000110110001101111

Input program binary

Input dataset

+

Instruction control-flow-, register- and memory-dependence trace

Post-execution analysis

...addu r3,r2,r6sw r0,0(xx)addu r2,r2,r5addiu r4,r4,1sw r0,0(r3)

...

011000100111100101100101

Task-level partitioning to minimize value communication

!

Inter-task data movement

"

Inter-dependent basic blocks with

potential data-level parallelism

#

Thread-level parallelism

$

Inter-thread data movement

%

Instruction-level parallelism

&

Per-instruction data movement

'

Execution-driven dataflow

analysis

Stage 1 Stage 2



Aveiro, Portugal

Application Parallelism Characterization Framework

๏ Input– Compiled program binaries and their associated input datasets

๏ Output– Best-case instruction- (ILP) and thread-level parallelism (TLP), and potential data-level parallelism– Communication-minimizing MPMD code partitioning– Data movement characterization

9

DLA1

DLA2

DLA3

DLA4

SLA1

SLA2

SLA3

SM 1

SM 2SM 3SG1

SG2

UG1MR1

CL1

CL2

CL3

GT1GT2

BT1

BT2

BT3

BT4

FSM 1FSM 2FSM 3

FSM 4Ar ith.M ean

1 10 100 1000

1

10

1 10 100 1000

1

10

ILP

TLP

Dwarf 1Dwarf 2Dwarf 3Dwarf 4Dwarf 5Dwarf 6Dwarf 7Dwarf 8Dwarf 9Dwarf 11Dwarf 13



Aveiro, Portugal

Characterization of Applications Spanning Berkeley “Dwarfs”

10

Dwarf 1: Dense Linear Algebra (DLA) JPEG(encode,decode)b, kmeansc,pcac

Dwarf 2: Sparse Linear Algebra (SLA) basicmathb, bitcountb, matrix_multiplyc

Dwarf 3: Spectral Methods (SM) lameb, FFTb, IFFTb

Dwarf 4: N-Body Methods (NB) ammpa

Dwarf 5: Structured Grids (SG) JPEGb(encode, decode)Dwarf 6: Unstructured Grids (UG) equakea

Dwarf 7: MapReduce (MR) mesaa

Dwarf 8: Combinational Logic (CL) rijndaelb(encode, decode), shab

Dwarf 9: Graph Traversal (GT) qsortb, Dijkstrab

Dwarf 10: Dynamic Programming (DP) NoneDwarf 11: Backtrack / Branch+Bound (BT) vpra, mcfa, arta, susanb

Dwarf 12: Graphical Models (GM) NoneDwarf 13: Finite State Machine (FSM) gzipa, gcca, parsera, stringsearchb

a SPEC 2000b MiBenchc Phoenix MapReduce Suite

26 Applications



Aveiro, Portugal

Outline

๏ Context





๏ Summary

11

0

2 10

Core-O-Meter

“Wimpy”Core

“Brawny”Core4 6 8

11?



Aveiro, Portugal

How Low-Power? How-Parallel?

๏ How low-power and restricted-performance should one go? [Barroso 2010]

– Single-thread performance still matters for multi-core [Hill and Marty 2008]

– When cores are too simple, cost becomes dominated by packaging costs

12

[Barroso 2010] T. Mudge and U. Holzle. Challenges and Opportunities for Extremely Energy-Efficient Processors. IEEE Micro, (3) 4, 2010

[Hill and Marty 2008] M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. IEEE Computer, (41) 7, 2008



Aveiro, Portugal

Platforms

๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32

13

Platform General-Purpose CPU Cache HierarchyCores Clock

Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D

(per core)

and 1 M L2

Freescale™ P2020RDB 2 Power Architecture®

1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,

(Beagleboard-xM) 256 K L2

Common Properties:• Identical 4GB flash disk

• Lab-grade power meter, 1 measurement per second, sub-mA resolution

• Power measured for whole platform



Aveiro, Portugal

Platforms


13



(per core)

and 1 M L2


1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,




Aveiro, Portugal

Platforms


13

Pictures are NOT to scale



(per core)

and 1 M L2


1.0 GHz 32 K/32 K L1 I/D

e500 (per core),

512 K shared L2

TI DM3730 1 ARM®

Cortex™

-A8 1.0 GHz 32 K/32 K L1 I/D,




Aveiro, Portugal

Application-Driven Whole-System Power Analysis

๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference

14

≈9100s

Etotal ≈ 31 kJ

AR

M

P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).

AR

M

≈9100s

Etotal ≈ 31 kJ



Aveiro, Portugal



14

≈2030s

Etotal ≈ 33 kJ

Atom


AR

M

≈9100s

Etotal ≈ 31 kJ

≈2030s

Etotal ≈ 33 kJ



Aveiro, Portugal



14

≈7900s

Etotal ≈ 104 kJ

Atom

PPC




Aveiro, Portugal


๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark

๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)

15

Etotal ≈ 15 kJ

≈820s

Atom

Atom

Etotal ≈ 15 kJ

≈820s



Aveiro, Portugal


๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark

๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)

15

≈6400s

Etotal ≈ 86 kJ

PPC



Aveiro, Portugal

Outline

๏ Context





๏ Summary

16


0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�

SupplyPinCount

Supply Pin Count� 58.8915 current0.53786

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�TotalPinCount

Total Pin Count� 215.151 current0.364102



Aveiro, Portugal

Interplay Between Power and Bandwidth: Pins

17

1980 1990 2000 2010 202010

50100

5001000

50001�104

Year

TotalPinCount

Total Pin Count� 100.0476779 year�92.6952



0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�

SupplyPinCount

Supply Pin Count� 58.8915 current0.53786

0.01 0.1 1 10 10010

50100

5001000

5000

Peak Aggregate Supply Current �Amperes�TotalPinCount

Total Pin Count� 215.151 current0.364102



Aveiro, Portugal


๏ Bandwidth and power are both limiters; both require more pins– Limited pin count growth (doubling every ~6 years)– Supply pin count growing more rapidly than total pin count– Tradeoff b/n more supply pins for lower resistive losses, but restricted I/O bandwidth

17

1980 1990 2000 2010 202010

50100

5001000

50001�104

Year

TotalPinCount

Total Pin Count� 100.0476779 year�92.6952


100 1000500200 300150 1500700

1000

2000

3000

1500

Supply Current �Amperes�

Totalpins;SupplyPins

Total pins

Supply Pins

�Gap: Pins Available for Signals

1 10 100 1000 1040.1

1

10

100

1000

104

Number of On�Chip Threads

Mem

oryInterface

Pins Projected total pin count in 2020 �from Figure 2�

Projected supply pin count in 2020 �from Figures 1 and 7�ITRS 2:1 supply:signal pin count in 2020



Aveiro, Portugal


๏ Projections– From hardware trends of previous slide, and application memory reference analyses

๏ Takeaway message– In addition to constraints on available parallelism in applications, existing packaging

techniques and their improvement trends may not support >1000 cores circa 2020

18

P. Stanley-Marbell, V. Caparrós Cabezas, and R. Luijten “Pinned to the Walls—Impact of Packaging on the Memory and Power Walls”, to appear, IEEE/ACM International Symposium on Low-Power Electronics and Design (ISLPED 2011).



Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

19

P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).



Aveiro, Portugal


Compile

19




Aveiro, Portugal


Compile

110010000000000000011001010100110001

Instruction sequence for stored-program computer

(Essentially an efficient Turing Machine implementation)

19




Aveiro, Portugal


Compile

110010000000000000011001010100110001



Answer

19




Aveiro, Portugal

Beyond Parallelism: Algorithms versus Compute Problems

Problem Specification (Sal)

Set free variablefree variable's type

Set's universe / typeSet's predicate expression

S1a = (x : U1[1] �= 1) : U1.

New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is

Compile

110010000000000000011001010100110001



Answer

19




Aveiro, Portugal


(via Salc Compiler)




S1a = (x : U1[1] �= 1) : U1.


Compile

110010000000000000011001010100110001



Answer

19




Aveiro, Portugal


(via Salc Compiler)

...

Svm Machine State Representing Problem

U Registers<1,5, ..., 94>

<"en", ..., "bij">

...

{2.1, ..., -1.3}

S RegistersP1:U3

(P1 | P7) : U1

...

true:U2

P Registers

...

=

^

2

Runtime Symbol TableVariable

qx

...

mquantvar

Scope14

...

2

Value2

...

"hello"

v

...




S1a = (x : U1[1] �= 1) : U1.


Compile

110010000000000000011001010100110001



Answer

19




Aveiro, Portugal


(via Salc Compiler)

...

Svm Machine State Representing Problem

U Registers<1,5, ..., 94>

<"en", ..., "bij">

...

{2.1, ..., -1.3}

S RegistersP1:U3

(P1 | P7) : U1

...

true:U2

P Registers

...

=

^

2

Runtime Symbol TableVariable

qx

...

mquantvar

Scope14

...

2

Value2

...

"hello"

v

...




S1a = (x : U1[1] �= 1) : U1.


Compile

110010000000000000011001010100110001



Answer Answer

19




Aveiro, Portugal

Outline

๏ Context





๏ Summary

20



Aveiro, Portugal

Summary

๏ Context– Continued performance increases desirable for, e.g., future SKA backend

๏ Technology trends– Performance growth no longer through clock speed increases; power-limited

๏ Parallelism as a performance vehicle– Presented framework for characterizing available parallelism (ILP, TLP) in applications

๏ Parallelism, performance, and low-power tradeoffs– Characterization of not just average power, but energy-to-solution for whole systems

๏ Potential limits to scaling via parallelism– New approaches desirable for improved bandwidth to support parallelism– Alternatives to single-algorithm scaling: algorithmic choice

21

Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...

Documents

Transcript of Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...