L 2 ILP, DLP AND TLP IN MODERN...
Transcript of L 2 ILP, DLP AND TLP IN MODERN...
![Page 1: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/1.jpg)
6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE
SPRING 2013
LECTURE 2
ILP, DLP AND TLP IN MODERN MULTICORES
DANIEL SANCHEZ AND JOEL EMER
![Page 2: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/2.jpg)
Review: ILP Challenges
Clock frequency: getting close to pipelining limits
Clocking overheads, CPI degradation
Branch prediction & memory latency limit the practical benefits of out-of-order execution
Power grows superlinearly with higher clock & more OOO logic
Design complexity grows exponentially with issue width
Limited ILP Must exploit TLP and DLP
Thead-Level Parallelism: Multithreading and multicore
Data-Level Parallelism: SIMD
2
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 3: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/3.jpg)
Review: Memory Hierarchy 3
Caching: Reduce latency, energy, BW of memory accesses
Why multilevel?
Why not just on-chip memories?
How does parallelism impact latency/BW constraints?
Prefetching: Trade-off latency for bandwidth, energy, capacity (pollution)
6.888 Spring 2013 - Sanchez and Emer - L02
L3
6MB
Core
Main Memory
16GB
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
4 cycles, 72GB/s
8 cycles, 48GB/s
26-31 cycles, 96GB/s
(24GB/s/core)
150 cycles, 21GB/s
![Page 4: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/4.jpg)
Flynn’s Taxonomy 4
Single instruction Multiple instruction
Single data SISD MISD (?)
Multiple data SIMD MIMD
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 5: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/5.jpg)
SIMD Processing 5
Same instruction sequence applies to multiple elements
Vector processing Amortize instruction costs (fetch, decode,
…) across multiple operations
Requires regular data parallelism (no or minimal divergence)
Exploiting SIMD:
Explicit & low-level, using vector intrinsics
Explicit & high-level, convey parallel semantics (e.g., foreach)
Implicitly: Parallelizing compiler infers loop dependencies
How easy is this in C++? Java?
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 6: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/6.jpg)
SIMD Implementations 6
Modern CPUs: SIMD extensions & wider regs
SSE: 128-bit operands (4x32-bit or 2x64-bit)
AVX (2011): 256-bit operands (8x32-bit or 4x64-bit)
LRB (upcoming): 512-bit operands
Explicit SIMD: Parallelization performed at compile time
GPUs: Architected for SIMD from the ground up
32 to 64 32-bit floats
Implicit SIMD: Scalar binary, multiple instances always run in
lockstep
How to handle divergence?
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 7: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/7.jpg)
Multithreading: Options 7
Motivation: Hardware underutilized on stalls TLP to increase utilization
CGMT, SMT typically increase throughput with moderate cost, maintain single-thread performance
FGMT typically trades throughput and simplicity at the expense of single-thread performance
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 8: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/8.jpg)
Example 1: SMT (Nehalem) 8
SMT design choices: For each component,
Replicate, partition statically, or share
Tradeoffs? Complexity, utilization, interference & fairness
Example: Intel Nehalem
4-wide superscalar, 2-way SMT
Replicated: Register file, RAS predictor, large-page ITLB
Partitioned: Load buffer, store buffer, ROB, small-page ITLB
Shared: Instruction window, execution units, predictors, caches, DTLBs
SMT policies:
Fetch policies: Utilization vs fairness
Long-latency stall tolerance: Flushing vs stalling
6.888 Spring 2013 - Sanchez and Emer - L02
[See: “Exploiting Choice: Instruction
Fetch and Issue on an Implementable
Simultaneous Multithreading
Processor”, Tullsen et al, ISCA 96]
![Page 9: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/9.jpg)
Example 2: FGMT (Niagara) 9
4 threads/core, round-robin scheduling
No branch prediction, minimal bypasses more stalls
Small L1 caches (can tolerate higher L1 miss rates)
But L2 is still large… performance with long-latency stalls?
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 10: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/10.jpg)
Example 3: Extreme FGMT (Tera MTA) 10
Use FGMT to hide all instruction latencies
Worst case instruction latency is 128 cycles 128 threads
Benefits: no interlocks, no bypass, and no cache
Problem: single-thread performance
GPUs also exploit high FGMT for latency tolerance (e.g., Fermi, 48-way MT)
Throughput-oriented functional units: Longer latency, deeply pipelined
Throughput-oriented memory system: Small caches, aggressive memory scheduler
6.888 Spring 2013 - Sanchez and Emer - L02
InstT1
InstT3
A B C F G H
A B C D E E G H
D E
A B E F G H C D
A B E F G H C D
InstT2
InstT4
A B E F G H C D
A B E F G H C D
A B E F G H C D
A B E F G H C D
A B E F G H C D
InstT5
InstT7
InstT6
InstT8
InstT1
![Page 11: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/11.jpg)
Multithreading: How Many Threads? 11
With more HW threads:
Larger/multiple register files
Replicated & partitioned resources Lower utilization, lower single-thread performance
Shared resources Utilization vs interference and thrashing
Impact of MT/MC on memory hierarchy?
6.888 Spring 2013 - Sanchez and Emer - L02
[“Many-Core vs. Many-Thread Machines: Stay
Away From the Valley”, Guz et al, CAL 09]
![Page 12: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/12.jpg)
Amdahl’s Law 12
Amdahl’s Law: If a change improves a fraction f of the
workload by a factor K, the total speedup is:
Not only valid for performance!
Energy, complexity, …
I/D/TLP techniques make different tradeoffs between K
and f
SIMD vs MIMD f and K?
6.888 Spring 2013 - Sanchez and Emer - L02
)1(/
1
Time
TimeSpeedup
after
before
fKf
![Page 13: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/13.jpg)
Amdahls’ Law in the Multicore Era
[Hill & Marty, CACM 08]
Should we focus on a single approach to extract parallelism?
At what point should we trade ILP for TLP?
Assume a resource-limited multi-core
N base core equivalent (BCEs) due to area or power constraints
A 1-BCE core leads to performance of 1
A R-BCE core leads to performance of perf(R)
Assuming perf(R) = sqrt(R) in following drawings (Pollack’s rule)
How should we design the multi-core?
Select type & number of cores
Assume caches & interconnect are rather constant
Assume no application scaling (or equal scaling for seq/par portions)
13
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 14: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/14.jpg)
Three Multicore Approaches
Large Cores (R BCEs/core)
Simple Cores (1 BCE/core)
Number Performance Number Performance
Symmetric CMP N/R Seq: Perf(R) Par: N/R*Perf(R)
- -
Asymmetric CMP
1 Seq: Perf(R) Par: Perf(R)
N-R Seq: - Par: N-R
Dynamic CMP 1 Seq: Perf(R) Par: -
N Seq: - Par: N
16 1-BCE cores Symmetric:
4 4-BCE cores
Asymmetric:
1 4-BCE core
& 12 1-BCE cores
Dynamic:
Adapt between
16 1-BCEs and 1 16-BCE
14
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 15: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/15.jpg)
Amdahl’s Law x3
Symmetric CMP
Asymmetric CMP
Dynamic CMP
Symmetric Speedup =
1
+ 1 - F
Perf(R)
F * R
Perf(R)*N
Asymmetric Speedup =
1
+ 1 - F
Perf(R)
F
Perf(R) + N - R
Dynamic Speedup =
1
+ 1 - F
Perf(R)
F
N
15
6.888 Spring 2013 - Sanchez and Emer - L02
![Page 16: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/16.jpg)
6.888 Spring 2013 - Sanchez and Emer - L01
Symmetric Multicore Chip
N = 256 BCEs
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Sym
me
tric
Sp
ee
du
p
R BCEs/core
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5 F=0.9
R=28 (vs. 2)
Cores=9 (vs. 8)
Speedup=26.7 (vs. 6.7)
CORE ENHANCEMENTS!
F1
R=1 (vs. 1)
Cores=256 (vs. 16)
Speedup=204 (vs. 16)
MORE CORES!
F=0.99
R=3 (vs. 1)
Cores=85 (vs. 16)
Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS
& MORE CORES!
Results in parentheses N= 16
Higher N Higher R (more ILP) for fixed f
16
![Page 17: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/17.jpg)
6.888 Spring 2013 - Sanchez and Emer - L01
Asymmetric Multicore Chip
N = 256 BCEs
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Asy
mm
etri
c Sp
ee
du
p
R BCEs
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5 F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
Results in parentheses N= 16
Better speedups than symmetric
Software complexity?
17
![Page 18: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/18.jpg)
6.888 Spring 2013 - Sanchez and Emer - L01
Dynamic Multicore Chip
N = 256 BCEs
Results in parenthesis refer to N=16
Dynamic offers even higher speedups than asymmetric
SW and HW complexity?
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Dyn
amic
Sp
ee
du
p
R BCEs
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5
F=0.99
R=256 (vs. 41)
Cores=256 (vs. 216)
Speedup=223 (vs. 166)
Note:
#Cores
always
N=256
18
![Page 19: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL](https://reader033.fdocuments.in/reader033/viewer/2022060605/605a55de9837fc53836b2ce2/html5/thumbnails/19.jpg)
Readings for Wednesday
1. Is Dark Silicon Useful?
2. Dark Silicon and the End of Multicore Scaling
3. Single-Chip Heterogeneous Computing
19
6.888 Spring 2013 - Sanchez and Emer - L02