Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...
Transcript of Performance and Energy-Efficiency: Device Technology ... · ๏ This talk – Device technology...
© 2011 IBM Corporation
Performance and Energy-Efficiency:Device Technology, Parallelism, Algorithms, and Beyond
Phillip Stanley-MarbellJoint work with Victoria Caparrós Cabezas, Rik Jongerius, and Ronald Luijten
IBM Research—Zurich
(SKA1construction)
2011 2016
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Vehicles of Performance Growth
๏ Enablers of performance growth over the last 5 decades– Complex instructions (CISC): more co-dependent work per cycle– Simple instructions / hardware (RISC): ride technology scaling curve, faster cycles– Parallelism (superscalar, multithread, multicore): more independent work per cycle
๏ Limits to performance vehicles– CISC: uncommon denominators—complicated compilation– RISC: no longer feasible to run at faster cycles—power-limited scaling– Parallelism: available parallelism at instruction, thread, task, and data levels
2
vehicle |ˈvēəkəl; ˈvēˌhikəl|noun1 a thing used to express, embody, or fulfill something
(today)
๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Device Technology, Parallelism, and Beyond
3
1 ox: single-thread performance 1024 chickens: parallelism
...
1 tractor: better algorithms
If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
— Seymour Cray
๏ This talk– Device technology trends, spanning the last 30 years, for >150 real designs– Tools for characterizing parallelism– Challenges to scaling via parallelism– Picking the right point in the low-power versus high-performance spectrum– Opportunities and challenges for the next “performance vehicle”: algorithms
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Device Technology, Parallelism, and Beyond
3
1 ox: single-thread performance 1024 chickens: parallelism
...
1 tractor: better algorithms
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Outline
๏ Context
๏ Technology Trends
๏ Algorithms and Parallelism
๏ Low-Power versus High-Performance Tradeoff
๏ Concurrency Limits and Performance Beyond Parallelism
๏ Summary
4
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Technology, Supply Voltage, and Power Delivery Limits1 of 2
5
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
10 50 100 500 1000 5000100
104
106
108
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
Transistorsperm
m2
�Dimensionless� � : Mean Values——
1980 1985 1990 1995 2000 2005 2010100
104
106
108
Year
Transistorsperm
m2
�Dimensionless� � : Mean Values——
๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Technology, Supply Voltage, and Power Delivery Limits1 of 2
5
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
10 50 100 500 1000 50001000
105
107
109
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
Transistorsperdie
�Dimensionless� � : Mean Values——
10 50 100 500 1000 5000100
104
106
108
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
Transistorsperm
m2
�Dimensionless� � : Mean Values——
1980 1985 1990 1995 2000 2005 2010100
104
106
108
Year
Transistorsperm
m2
�Dimensionless� � : Mean Values——
๏ Density scaling continues (transistors per unit area)– However, power per unit area is not scaling– Power delivery and cooling for large-transistor-count designs increasingly difficult
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
Technology, Supply Voltage, and Power Delivery Limits2 of 2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power
๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...
6
10 50 100 500 1000 5000
1.0
5.0
2.03.0
1.5
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
SupplyVoltage,
V dd�V�
� : Mean Values——
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
Technology, Supply Voltage, and Power Delivery Limits2 of 2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power
๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...
6
10 50 100 500 1000 50000.001
0.1
10
1000
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
Peak
supplycurrent
�Amperes� � : Mean Values——
10 50 100 500 1000 5000
1.0
5.0
2.03.0
1.5
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
SupplyVoltage,
V dd�V�
� : Mean Values——
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
Technology, Supply Voltage, and Power Delivery Limits2 of 2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
๏ Power per unit area not scaling because supply voltages stagnated– Due to concerns with sub-threshold leakage current– Lower supply voltages also lead to higher supply currents at constant power
๏ Power and cooling challenges limit clock speeds, # active transistors– Industry-wide shift to increased performance via parallelism and heterogeneity– Parallelism can be harnessed at different granularities...
6
10 50 100 500 1000 50000.001
0.1
10
1000
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
Peak
supplycurrent
�Amperes� � : Mean Values——
10 50 100 500 1000 5000
1.0
5.0
2.03.0
1.5
Approx. Process Generation �Leff � Ldrawn � M1 half�pitch, nm�
SupplyVoltage,
V dd�V�
� : Mean Values——
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Outline
๏ Context
๏ Technology Trends
๏ Algorithms and Parallelism
๏ Low-Power versus High-Performance Tradeoff
๏ Concurrency Limits and Performance Beyond Parallelism
๏ Summary
7
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Challenge: Systems and Architectures Tailored to Workloads
8
๏ Mapping application parallelism to hardware concurrency– Appropriate form and amount of hardware concurrency depends on applications
๏ Hardware properties, e.g., microarchitectures• Power7 : 8 cores × 4-way SMT × 12-wide superscalar (8-issue, but 12 FUs)
• ARM Cortex A8 : 1 core × 1 thread × 2-wide superscalar• Power A2 cores : 1 core × 4-way SMT × 2-wide superscalar
๏ Granularities of parallelism in software– Available data-, task-, thread- (TLP) and instruction-level parallelism (ILP)
V. Caparrós Cabezas and P. Stanley-Marbell, “Parallelism and Data Movement Characterization for Contemporary Application Classes”, 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2011
0110100001100101011011000110110001101111
Input program binary
Input dataset
+
Instruction control-flow-, register- and memory-dependence trace
Post-execution analysis
...addu r3,r2,r6sw r0,0(xx)addu r2,r2,r5addiu r4,r4,1sw r0,0(r3)
...
011000100111100101100101
Task-level partitioning to minimize value communication
!
Inter-task data movement
"
Inter-dependent basic blocks with
potential data-level parallelism
#
Thread-level parallelism
$
Inter-thread data movement
%
Instruction-level parallelism
&
Per-instruction data movement
'
Execution-driven dataflow
analysis
Stage 1 Stage 2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application Parallelism Characterization Framework
๏ Input– Compiled program binaries and their associated input datasets
๏ Output– Best-case instruction- (ILP) and thread-level parallelism (TLP), and potential data-level parallelism– Communication-minimizing MPMD code partitioning– Data movement characterization
9
V. Caparrós Cabezas and P. Stanley-Marbell, “Parallelism and Data Movement Characterization for Contemporary Application Classes”, 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2011
0110100001100101011011000110110001101111
Input program binary
Input dataset
+
Instruction control-flow-, register- and memory-dependence trace
Post-execution analysis
...addu r3,r2,r6sw r0,0(xx)addu r2,r2,r5addiu r4,r4,1sw r0,0(r3)
...
011000100111100101100101
Task-level partitioning to minimize value communication
!
Inter-task data movement
"
Inter-dependent basic blocks with
potential data-level parallelism
#
Thread-level parallelism
$
Inter-thread data movement
%
Instruction-level parallelism
&
Per-instruction data movement
'
Execution-driven dataflow
analysis
Stage 1 Stage 2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application Parallelism Characterization Framework
๏ Input– Compiled program binaries and their associated input datasets
๏ Output– Best-case instruction- (ILP) and thread-level parallelism (TLP), and potential data-level parallelism– Communication-minimizing MPMD code partitioning– Data movement characterization
9
DLA1
DLA2
DLA3
DLA4
SLA1
SLA2
SLA3
SM 1
SM 2SM 3SG1
SG2
UG1MR1
CL1
CL2
CL3
GT1GT2
BT1
BT2
BT3
BT4
FSM 1FSM 2FSM 3
FSM 4Ar ith.M ean
1 10 100 1000
1
10
1 10 100 1000
1
10
ILP
TLP
Dwarf 1Dwarf 2Dwarf 3Dwarf 4Dwarf 5Dwarf 6Dwarf 7Dwarf 8Dwarf 9Dwarf 11Dwarf 13
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Characterization of Applications Spanning Berkeley “Dwarfs”
10
Dwarf 1: Dense Linear Algebra (DLA) JPEG(encode,decode)b, kmeansc,pcac
Dwarf 2: Sparse Linear Algebra (SLA) basicmathb, bitcountb, matrix_multiplyc
Dwarf 3: Spectral Methods (SM) lameb, FFTb, IFFTb
Dwarf 4: N-Body Methods (NB) ammpa
Dwarf 5: Structured Grids (SG) JPEGb(encode, decode)Dwarf 6: Unstructured Grids (UG) equakea
Dwarf 7: MapReduce (MR) mesaa
Dwarf 8: Combinational Logic (CL) rijndaelb(encode, decode), shab
Dwarf 9: Graph Traversal (GT) qsortb, Dijkstrab
Dwarf 10: Dynamic Programming (DP) NoneDwarf 11: Backtrack / Branch+Bound (BT) vpra, mcfa, arta, susanb
Dwarf 12: Graphical Models (GM) NoneDwarf 13: Finite State Machine (FSM) gzipa, gcca, parsera, stringsearchb
a SPEC 2000b MiBenchc Phoenix MapReduce Suite
26 Applications
DLA1
DLA2
DLA3
DLA4
SLA1
SLA2
SLA3
SM 1
SM 2SM 3SG1
SG2
UG1MR1
CL1
CL2
CL3
GT1GT2
BT1
BT2
BT3
BT4
FSM 1FSM 2FSM 3
FSM 4Ar ith.M ean
1 10 100 1000
1
10
1 10 100 1000
1
10
ILP
TLP
Dwarf 1Dwarf 2Dwarf 3Dwarf 4Dwarf 5Dwarf 6Dwarf 7Dwarf 8Dwarf 9Dwarf 11Dwarf 13
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Characterization of Applications Spanning Berkeley “Dwarfs”
10
Dwarf 1: Dense Linear Algebra (DLA) JPEG(encode,decode)b, kmeansc,pcac
Dwarf 2: Sparse Linear Algebra (SLA) basicmathb, bitcountb, matrix_multiplyc
Dwarf 3: Spectral Methods (SM) lameb, FFTb, IFFTb
Dwarf 4: N-Body Methods (NB) ammpa
Dwarf 5: Structured Grids (SG) JPEGb(encode, decode)Dwarf 6: Unstructured Grids (UG) equakea
Dwarf 7: MapReduce (MR) mesaa
Dwarf 8: Combinational Logic (CL) rijndaelb(encode, decode), shab
Dwarf 9: Graph Traversal (GT) qsortb, Dijkstrab
Dwarf 10: Dynamic Programming (DP) NoneDwarf 11: Backtrack / Branch+Bound (BT) vpra, mcfa, arta, susanb
Dwarf 12: Graphical Models (GM) NoneDwarf 13: Finite State Machine (FSM) gzipa, gcca, parsera, stringsearchb
a SPEC 2000b MiBenchc Phoenix MapReduce Suite
26 Applications
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Outline
๏ Context
๏ Technology Trends
๏ Algorithms and Parallelism
๏ Low-Power versus High-Performance Tradeoff
๏ Concurrency Limits and Performance Beyond Parallelism
๏ Summary
11
0
2 10
Core-O-Meter
“Wimpy”Core
“Brawny”Core4 6 8
11?
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
How Low-Power? How-Parallel?
๏ How low-power and restricted-performance should one go? [Barroso 2010]
– Single-thread performance still matters for multi-core [Hill and Marty 2008]
– When cores are too simple, cost becomes dominated by packaging costs
12
[Barroso 2010] T. Mudge and U. Holzle. Challenges and Opportunities for Extremely Energy-Efficient Processors. IEEE Micro, (3) 4, 2010
[Hill and Marty 2008] M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. IEEE Computer, (41) 7, 2008
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Platforms
๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32
13
Platform General-Purpose CPU Cache HierarchyCores Clock
Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D
(per core)
and 1 M L2
Freescale™ P2020RDB 2 Power Architecture®
1.0 GHz 32 K/32 K L1 I/D
e500 (per core),
512 K shared L2
TI DM3730 1 ARM®
Cortex™
-A8 1.0 GHz 32 K/32 K L1 I/D,
(Beagleboard-xM) 256 K L2
Common Properties:• Identical 4GB flash disk
• Lab-grade power meter, 1 measurement per second, sub-mA resolution
• Power measured for whole platform
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Platforms
๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32
13
Platform General-Purpose CPU Cache HierarchyCores Clock
Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D
(per core)
and 1 M L2
Freescale™ P2020RDB 2 Power Architecture®
1.0 GHz 32 K/32 K L1 I/D
e500 (per core),
512 K shared L2
TI DM3730 1 ARM®
Cortex™
-A8 1.0 GHz 32 K/32 K L1 I/D,
(Beagleboard-xM) 256 K L2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Platforms
๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32
13
Platform General-Purpose CPU Cache HierarchyCores Clock
Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D
(per core)
and 1 M L2
Freescale™ P2020RDB 2 Power Architecture®
1.0 GHz 32 K/32 K L1 I/D
e500 (per core),
512 K shared L2
TI DM3730 1 ARM®
Cortex™
-A8 1.0 GHz 32 K/32 K L1 I/D,
(Beagleboard-xM) 256 K L2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Platforms
๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32
13
Platform General-Purpose CPU Cache HierarchyCores Clock
Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D
(per core)
and 1 M L2
Freescale™ P2020RDB 2 Power Architecture®
1.0 GHz 32 K/32 K L1 I/D
e500 (per core),
512 K shared L2
TI DM3730 1 ARM®
Cortex™
-A8 1.0 GHz 32 K/32 K L1 I/D,
(Beagleboard-xM) 256 K L2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Platforms
๏ Candidate processors for scale-out systems: Atom, PowerPC, ARM– Three hardware reference designs; all processors implemented in 45nm CMOS– All three systems running Linux distributions based on kernel 2.6.32
13
Pictures are NOT to scale
Platform General-Purpose CPU Cache HierarchyCores Clock
Atom™ D510MO 2 x86-64 1.66 GHz 32 K/24 K L1 I/D
(per core)
and 1 M L2
Freescale™ P2020RDB 2 Power Architecture®
1.0 GHz 32 K/32 K L1 I/D
e500 (per core),
512 K shared L2
TI DM3730 1 ARM®
Cortex™
-A8 1.0 GHz 32 K/32 K L1 I/D,
(Beagleboard-xM) 256 K L2
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application-Driven Whole-System Power Analysis
๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference
14
≈9100s
Etotal ≈ 31 kJ
AR
M
P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).
AR
M
≈9100s
Etotal ≈ 31 kJ
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application-Driven Whole-System Power Analysis
๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference
14
≈2030s
Etotal ≈ 33 kJ
Atom
P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).
AR
M
≈9100s
Etotal ≈ 31 kJ
≈2030s
Etotal ≈ 33 kJ
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application-Driven Whole-System Power Analysis
๏ Serial workload– Essentially tied between ARM (30.8kJ) and Atom (33.1kJ): 7.8% difference
14
≈7900s
Etotal ≈ 104 kJ
Atom
PPC
P. Stanley-Marbell and V. Caparrós Cabezas “Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-Out Systems”, In IEEE High-Performance and Power-Aware Computing (HPPAC 2011).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application-Driven Whole-System Power Analysis
๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark
๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)
15
Etotal ≈ 15 kJ
≈820s
Atom
Atom
Etotal ≈ 15 kJ
≈820s
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Application-Driven Whole-System Power Analysis
๏ “Throughput” workload– All 16 applications launched simultaneously; on Atom and PPC, utilize 2 cores– PPC: has limited FPU; perf. limited by floating-point-intensive Equake benchmark
๏ Highest-average-power system is the most energy-efficient– Atom platform uses least energy (15.4kJ), vs. ARM (30.8kJ) and PPC (86.8kJ)
15
≈6400s
Etotal ≈ 86 kJ
PPC
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Outline
๏ Context
๏ Technology Trends
๏ Algorithms and Parallelism
๏ Low-Power versus High-Performance Tradeoff
๏ Concurrency Limits and Performance Beyond Parallelism
๏ Summary
16
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
0.01 0.1 1 10 10010
50100
5001000
5000
Peak Aggregate Supply Current �Amperes�
SupplyPinCount
Supply Pin Count� 58.8915 current0.53786
0.01 0.1 1 10 10010
50100
5001000
5000
Peak Aggregate Supply Current �Amperes�TotalPinCount
Total Pin Count� 215.151 current0.364102
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Interplay Between Power and Bandwidth: Pins
17
1980 1990 2000 2010 202010
50100
5001000
50001�104
Year
TotalPinCount
Total Pin Count� 100.0476779 year�92.6952
� : Mean Values——
Data: from an analysis of ~150 HW publications in IEEE ISSCC and IEEE JSSC Journal, from 1980–2010
0.01 0.1 1 10 10010
50100
5001000
5000
Peak Aggregate Supply Current �Amperes�
SupplyPinCount
Supply Pin Count� 58.8915 current0.53786
0.01 0.1 1 10 10010
50100
5001000
5000
Peak Aggregate Supply Current �Amperes�TotalPinCount
Total Pin Count� 215.151 current0.364102
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Interplay Between Power and Bandwidth: Pins
๏ Bandwidth and power are both limiters; both require more pins– Limited pin count growth (doubling every ~6 years)– Supply pin count growing more rapidly than total pin count– Tradeoff b/n more supply pins for lower resistive losses, but restricted I/O bandwidth
17
1980 1990 2000 2010 202010
50100
5001000
50001�104
Year
TotalPinCount
Total Pin Count� 100.0476779 year�92.6952
� : Mean Values——
100 1000500200 300150 1500700
1000
2000
3000
1500
Supply Current �Amperes�
Totalpins;SupplyPins
Total pins
Supply Pins
�Gap: Pins Available for Signals
1 10 100 1000 1040.1
1
10
100
1000
104
Number of On�Chip Threads
Mem
oryInterface
Pins Projected total pin count in 2020 �from Figure 2�
Projected supply pin count in 2020 �from Figures 1 and 7�ITRS 2:1 supply:signal pin count in 2020
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Interplay Between Power and Bandwidth: Pins
๏ Projections– From hardware trends of previous slide, and application memory reference analyses
๏ Takeaway message– In addition to constraints on available parallelism in applications, existing packaging
techniques and their improvement trends may not support >1000 cores circa 2020
18
P. Stanley-Marbell, V. Caparrós Cabezas, and R. Luijten “Pinned to the Walls—Impact of Packaging on the Memory and Power Walls”, to appear, IEEE/ACM International Symposium on Low-Power Electronics and Design (ISLPED 2011).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute ProblemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute Problems
Problem Specification (Sal)
Set free variablefree variable's type
Set's universe / typeSet's predicate expression
S1a = (x : U1[1] �= 1) : U1.
New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute Problems
(via Salc Compiler)
Problem Specification (Sal)
Set free variablefree variable's type
Set's universe / typeSet's predicate expression
S1a = (x : U1[1] �= 1) : U1.
New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute Problems
(via Salc Compiler)
...
Svm Machine State Representing Problem
U Registers<1,5, ..., 94>
<"en", ..., "bij">
...
{2.1, ..., -1.3}
S RegistersP1:U3
(P1 | P7) : U1
...
true:U2
P Registers
...
=
^
2
Runtime Symbol TableVariable
qx
...
mquantvar
Scope14
...
2
Value2
...
"hello"
v
...
Problem Specification (Sal)
Set free variablefree variable's type
Set's universe / typeSet's predicate expression
S1a = (x : U1[1] �= 1) : U1.
New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Beyond Parallelism: Algorithms versus Compute Problems
(via Salc Compiler)
...
Svm Machine State Representing Problem
U Registers<1,5, ..., 94>
<"en", ..., "bij">
...
{2.1, ..., -1.3}
S RegistersP1:U3
(P1 | P7) : U1
...
true:U2
P Registers
...
=
^
2
Runtime Symbol TableVariable
qx
...
mquantvar
Scope14
...
2
Value2
...
"hello"
v
...
Problem Specification (Sal)
Set free variablefree variable's type
Set's universe / typeSet's predicate expression
S1a = (x : U1[1] �= 1) : U1.
New: “Binaries” capture compute problemsOld: “Binaries” capture algorithmsApplication specifies an algorithm for solving problem, without being specific about what compute problem is
Compile
110010000000000000011001010100110001
Instruction sequence for stored-program computer
(Essentially an efficient Turing Machine implementation)
Answer Answer
19
P. Stanley-Marbell “Sal/Svm—An Assembly Language and Virtual Machine for Computing with Non-Enumerated Sets”, ACM Virtual Machines and Intermediate Languages (VMIL 2010).
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Outline
๏ Context
๏ Technology Trends
๏ Algorithms and Parallelism
๏ Low-Power versus High-Performance Tradeoff
๏ Concurrency Limits and Performance Beyond Parallelism
๏ Summary
20
© 2011 IBM CorporationPST / Zurich / IBM
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Aveiro, Portugal
Summary
๏ Context– Continued performance increases desirable for, e.g., future SKA backend
๏ Technology trends– Performance growth no longer through clock speed increases; power-limited
๏ Parallelism as a performance vehicle– Presented framework for characterizing available parallelism (ILP, TLP) in applications
๏ Parallelism, performance, and low-power tradeoffs– Characterization of not just average power, but energy-to-solution for whole systems
๏ Potential limits to scaling via parallelism– New approaches desirable for improved bandwidth to support parallelism– Alternatives to single-algorithm scaling: algorithmic choice
21
© 2011 IBM Corporation
Performance and Energy-Efficiency: Device Technology, Parallelism, Algorithms, and Beyond
Backup Slides OPST / Zurich / IBM Aveiro, Portugal22