8/2/2019 Multicore Architecture CTrinitis
1/79
Technische Universitt Mnchen
Multicore ArchitecturesCLGrid5 Workshop
Valparaiso, ChileSeptember 29 th , 2008
LRR-TUM, September 9 th, 2008
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR)Institut fr Informatik, Technische Universitt Mnchen
8/2/2019 Multicore Architecture CTrinitis
2/79
Technische Universitt MnchenHow did it all evolve?
Mechanical devices
Abacus,3000 BC (?)
1642, add & sub, Blaise Pascal
1822Charles
Babbage
8/2/2019 Multicore Architecture CTrinitis
3/79
Technische Universitt MnchenElectromechanical Machines
Based on Relays Konrad Zuse (1910-1995)
8/2/2019 Multicore Architecture CTrinitis
4/79
Technische Universitt MnchenZuse Z3 & Z4
Z1 / 1938,Z3 / 1941:
First freelyprogrammable
machinesin the world
Z3 and itssuccessor Z4
can be seenat Deutsches
Museum!
8/2/2019 Multicore Architecture CTrinitis
5/79
Technische Universitt MnchenElectronic Computers
First Generation No mechanical components any more Vacuum Tubes
Principle Basic: Triode Controllable flow within
diode by a fence On / Off
1946: ENIAC machine E lectronic N umerical I ntegrator A nd C omputer
8/2/2019 Multicore Architecture CTrinitis
6/79
Technische Universitt MnchenENIAC (1946)
8/2/2019 Multicore Architecture CTrinitis
7/79
Technische Universitt MnchenOrganization
Question: How to structure / organize computational machines? How to control and steer execution?
Original work (1946) Burks, Goldstine, von Neumann:
Preliminary discussion of the logical design of anelectronic computing instrument.
Result: von Neumann Architecture Most dominant architecture even today !
8/2/2019 Multicore Architecture CTrinitis
8/79
Technische Universitt MnchenThe IAS machine
Developed 1952 by John von Neumann
First machine based on his design principle I nstitute for A dvanced S tudies computer
8/2/2019 Multicore Architecture CTrinitis
9/79
Technische Universitt MnchenTechnology Development
Vacuum tubes replaced Transistors Smaller, more power efficient DEC PDP-1, IBM 7094
Still large machines Next step: Integrated Circuits
Many transistors packed on one die High density & reliability, low power
IBM 360 family & first Intel chips Many subsequent improvements
8/2/2019 Multicore Architecture CTrinitis
10/79
Technische Universitt Mnchen1971: 1 st Microprocessor Intel 4004
~2300 Transistors , 108 KHz, 10000nm
8/2/2019 Multicore Architecture CTrinitis
11/79
Technische Universitt Mnchen
Intel 4004 First Microprocessor
8/2/2019 Multicore Architecture CTrinitis
12/79
Technische Universitt Mnchen
Pentium 4 (55 Million Transistors)
8/2/2019 Multicore Architecture CTrinitis
13/79
28.09.08
Technische Universitt MnchenIntel Montecito
1.7 Billion Transistors, Intel's 1 st Dual Core, 90nm
8/2/2019 Multicore Architecture CTrinitis
14/79
28.09.08
Technische Universitt MnchenDual Core 2 (Woodcrest)
, 2.4-3 GHz, 65nm290 Million Transistors
8/2/2019 Multicore Architecture CTrinitis
15/79
28.09.08
Technische Universitt MnchenCore i7 (Nehalem)
731 Million Transistors, 45nm
8/2/2019 Multicore Architecture CTrinitis
16/79
28.09.08
Technische Universitt MnchenAMD Shanghai
705 Million Transistors, 45nm
8/2/2019 Multicore Architecture CTrinitis
17/79
28.09.08
Technische Universitt MnchenIntel Larrabee
... Transistors, 45nm
8/2/2019 Multicore Architecture CTrinitis
18/79
Technische Universitt MnchenAnd the Future ... ?
Many-core array
CMP with 10s-100s lowpower cores
Scalar cores Capable of TFLOPS+
Full System-on-Chip Servers, workstations,
embeddedDual core Symmetric multithreading
Multi-core array CMP with ~10 cores
Evolution
Large, Scalar cores for high single-thread
performance
Scalar plus many core for highly threaded workloads
8/2/2019 Multicore Architecture CTrinitis
19/79
8/2/2019 Multicore Architecture CTrinitis
20/79
Technische Universitt MnchenFrom Single- to Multi-Core
Netburst: >30 Pipeline Stages
No longer feasible...
2005: Move to dual core (and less pipeline stages)
2, 4, 6, 8, ... cores
But: The free lunch is over!
The good news is: This is good for parallel programmers.
8/2/2019 Multicore Architecture CTrinitis
21/79
Technische Universitt MnchenImpact
What does multi-core mean in particular?
Is it just an SMP system, i.e. programmable withOpenMP, Pthreads, etc. ?
Or does it differ from SMP Systems?
How do multi-core systems fit into clusters?
8/2/2019 Multicore Architecture CTrinitis
22/79
Technische Universitt MnchenJust an SMP system?
Partly, but those issues will be covered by my colleagues...
8/2/2019 Multicore Architecture CTrinitis
23/79
Technische Universitt MnchenIs Multi-Core different?
Yes, with regard to memory hierarchies and interconnect!
8/2/2019 Multicore Architecture CTrinitis
24/79
Technische Universitt MnchenThe Memory Wall
Processor speed is increasing much faster than memoryspeed
Microprocessors: 50-100% per year (Moores law)DRAMs: 7-15% per year
The gap is widening
Time P e r f o r m a n c e
Me mor y a cce s s C P U
p e r f o r m a n
c e
8/2/2019 Multicore Architecture CTrinitis
25/79
Technische Universitt MnchenCaches
Main Memory: Problems with Bandwidth & LatencyMemory bus located off-chip / on boardPhysical boundariesResults: Memory too far away
Cache: Memory closer to CPU which hold asubset of the main memory
+
Lower latency, higher bandwidth, On-chip- Which subset should be present?- Can we manage this transparently?
8/2/2019 Multicore Architecture CTrinitis
26/79
8/2/2019 Multicore Architecture CTrinitis
27/79
8/2/2019 Multicore Architecture CTrinitis
28/79
Technische Universitt MnchenTerminology
Accesses to memory can be aCache hit : Data is in CacheCache miss : Data has to be retrieved from memoryCache misses are expensive!
Cache size : Total size of CacheCache line size/length :
Caches do not store individual bytes/wordsManagement overhead too high
Unit of storage: Cache linesConsecutive number of bytes / memory
8/2/2019 Multicore Architecture CTrinitis
29/79
Technische Universitt Mnchen
Replacement policy :Which cache line to evict if new space is needed?Optimal: Data not used in the near futureMake prediction from the pastOften used: Least recently used (LRU)
How are writes treated?Write back caching
Writes are stored in cachesData is written back in case of line eviction
Write through caching Data is written directly to main memory
Terminology
8/2/2019 Multicore Architecture CTrinitis
30/79
Technische Universitt MnchenCache Associativity
Caches are a collection of cache linesEqually sized & Much smaller than memory
Question of mapping between CLs and memoryWhere to look for a cache hit?
Where to put a newly loaded cache line?Free mapping is very costly
Difficult lookup function for cache accessesTarget CLs for a particular access restrictedOnly a certain number of CLs possible
Associativity of a cache
8/2/2019 Multicore Architecture CTrinitis
31/79
Technische Universitt MnchenCache Structures
Where is a block stored in the cache?0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
11 11 11 11 11 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7
Main Memory
Cache
Block B j
Block Z i
j = 0, 1, ..., (n-1)
i = 0, 1, ...(m-1)
(Cache-Line)
n >> m, n = 2 s , m = 2 r Each block contains b wordswith b = 2 w
Capacity:
m * b = 2 r+w WordsCapacity:
n * b = 2 s+w
Words
Mapping from{B j } to {Z i}
8/2/2019 Multicore Architecture CTrinitis
32/79
Technische Universitt Mnchen
Direct-Mapped Cache
Direct mapping of n/m = 2 s-r memory blocks intoone cache line:
Mapping: B j --> Z i, where i = j mod m
Cache Structures
B0
B2B1
B3B4B5B6B7B8B9B10B11B12B13B14B15
BlockHaupspeicherZ0Tag
Z1Tag
Z2Tag
Z3Tag
Cache
Zeile
ZeileMain Memory
Line
8/2/2019 Multicore Architecture CTrinitis
33/79
Technische Universitt Mnchen
Direct-Mapped CacheLow hardware complexity.Fixed mapping block line yields fixed replacementstrategy.
Cache Structures
8/2/2019 Multicore Architecture CTrinitis
34/79
Technische Universitt MnchenCache Structures
Fully Associative Cache
Any block in main memory can be mapped to anycache line (flexibility).
Replacement strategy tells which line is to beoverwritten when loading the cache (e.g. Least-Recently-Used).
High hardware complexity.
8/2/2019 Multicore Architecture CTrinitis
35/79
Technische Universitt Mnchen
Set Associative Cache
Compromise between Direct-Mapped- and fullyassociative Cache.
k-way set associative cache:
k lines form one set.
m cache-lines are divided into v = m/k sets with k each.
Cache Structures
8/2/2019 Multicore Architecture CTrinitis
36/79
Technische Universitt MnchenProgrammability
Caches have no impact (from a logical point of view)Designed to be transparent
BUT: large performance impactNeed to use caches efficiently
E.g. Try to reuse data in caches
HPC Applications need to be tailored to caches Adapt to cache sizes, cache line sizes, and hierarchiesGood understanding of architecture requiredSignificant performance gains possible!
8/2/2019 Multicore Architecture CTrinitis
37/79
Technische Universitt MnchenExample
Parameters (taken from typical L1 Cache)32 KB sizeCache line size 32 BytesCache has 1024 cache lines
Address format:
Rest 10 bit CL select 5 bit / CL offset
8/2/2019 Multicore Architecture CTrinitis
38/79
Technische Universitt MnchenExample (cont.)
Full Associativity (m-way associativity)New cache line can be stored in any of the 1024 CLSelection e.g. by Least Recently Used (LRU)
Direct mapped (1-way associativity)New cache line can only be placed in defined line
2-way associativityOnly use 9 bits for CL selection, i.e for 512 setsSelection within set can again be done e.g. using LRU
8/2/2019 Multicore Architecture CTrinitis
39/79
8/2/2019 Multicore Architecture CTrinitis
40/79
Technische Universitt MnchenCache Hierarchies
Caches are layeredSeveral levels of lachesEach level works independentlyTransparency still maintainedCurrently up to 3 levels
Higher levelsSlower, but larger
CPU
L1 Cache
L2 Cache
Main Memory
L3 Cache
8/2/2019 Multicore Architecture CTrinitis
41/79
Technische Universitt MnchenInstruction vs. Data Caches
L1 Caches are often splitI-Cache for InstructionsD-Cache for Data
Reduces conflictsSignificantly different access patterns
Allows additional optimizationsProcessor layout (CPU design)Make use of the special access patterns
Example: Trace Caches for I-CacheStore longer instruction sequences/traces
8/2/2019 Multicore Architecture CTrinitis
42/79
Technische Universitt MnchenCache Optimization
Why does cache architecture have an impact onperformance?
Data should be reused as much as possible!
Locality of reference:
Temporal locality : recently accessed data is likely be be
accessed in the future.Spatial locality: Data located closely together is likely tobe accessed closely together in time.
h
8/2/2019 Multicore Architecture CTrinitis
43/79
Technische Universitt MnchenCache Optimization
How can this be optimized?Code transformations: Change order of loop iterationexecutions.
Must not change numerical results!
Maintain data dependencies!
h
8/2/2019 Multicore Architecture CTrinitis
44/79
Technische Universitt MnchenCache Optimization
Loop interchange:
Stride = 8 Stride = 1
O h T h i
8/2/2019 Multicore Architecture CTrinitis
45/79
Technische Universitt MnchenOther Techniques
PrefetchingTry to preload data that will potentially be usedPro: Data can be pre-requestedCon: May waste bandwidth / not used loads
Controlled by HardwareSpeculative loads
Controlled by programmer / compiler
Insert the prefetching statements into the codeTraditionally disturbed pipeline!Can be used with multi-core processors with sharedcache!
E l CMP
8/2/2019 Multicore Architecture CTrinitis
46/79
Technische Universitt MnchenEarly CMPs
Intel Montecito
Intel Pentium-D
AMD Dual Core Opteron
IBM Cell
I t l M t it
8/2/2019 Multicore Architecture CTrinitis
47/79
Technische Universitt MnchenIntel Montecito
I l P i D
8/2/2019 Multicore Architecture CTrinitis
48/79
Technische Universitt MnchenIntel Pentium-D
E l CMP
8/2/2019 Multicore Architecture CTrinitis
49/79
Technische Universitt MnchenEarly CMPs
IBM / Sony / Toshiba CellProcessor:
1 Power Processor Element (PPE) 8 Synergistic Processing Elements
(SPE) Element Interface Bus (EIB),
384GB/s 25,6 GB/s memory bandwidth 50-80 Watts energy consumption
C ll P
8/2/2019 Multicore Architecture CTrinitis
50/79
Technische Universitt MnchenCell Processor
SUN Ult S T1
8/2/2019 Multicore Architecture CTrinitis
51/79
Technische Universitt MnchenSUN UltraSparc T1
Eight cores, connectedvia Crossbar
134 GB/s
Each core can process 4threads
25,6GB/s memory bandwidth
70 Watts energy consumption => 2 Watts/Thread
Trends through Multi Core
8/2/2019 Multicore Architecture CTrinitis
52/79
28.09.08
Technische Universitt MnchenTrends through Multi-Core
Computers move into chip!
New memory hierarchies==> Caches!
New interconnect topologies.
Three levels of parallelism: On-chip
On-board Cluster
Trends through Multi Core
8/2/2019 Multicore Architecture CTrinitis
53/79
28.09.08
Technische Universitt MnchenTrends through Multi-Core
Computers move into chip!
New memory hierarchies==> Caches!
New interconnect topologies.
Three levels of parallelism: On-chip
On-board Cluster
Contemporary Multicore Chips
8/2/2019 Multicore Architecture CTrinitis
54/79
28.09.08
Technische Universitt Mnchen
Intel Clovertown/Penryn:
4 Cores
Split L1 Cache
Partly Shared L2 Cache!
FSB
Contemporary Multicore Chips
Contemporary Multicore Chips
8/2/2019 Multicore Architecture CTrinitis
55/79
28.09.08
Technische Universitt MnchenContemporary Multicore Chips
AMD Barcelona:
4 Cores
Split L1/L2 Cache
Shared L3 Cache!
On Chip Crossbar
Contemporary Multicore Chips
8/2/2019 Multicore Architecture CTrinitis
56/79
28.09.08
Technische Universitt MnchenContemporary Multicore Chips
SUN Niagara 2
2 Cores
4 Threads / Core
32 Threads
On Chip Crossbar
IBM Power 5 / Power 6
Upcoming Archs: Dunnington
8/2/2019 Multicore Architecture CTrinitis
57/79
28.09.08
Technische Universitt MnchenUpcoming Archs: Dunnington
Core i7 (Nehalem)
8/2/2019 Multicore Architecture CTrinitis
58/79
28.09.08
Technische Universitt MnchenCore i7 (Nehalem)
731 Million Transistors, 45nm
Nehalem: Intel's Next Generation
8/2/2019 Multicore Architecture CTrinitis
59/79
28.09.08
Technische Universitt Mnchen
AMD Shanghai
8/2/2019 Multicore Architecture CTrinitis
60/79
28.09.08
Technische Universitt MnchenAMD Shanghai
705 Million Transistors, 45nm
8/2/2019 Multicore Architecture CTrinitis
61/79
28.09.08
Technische Universitt Mnchen
Larrabee: Intel's Many-Core Architecture
8/2/2019 Multicore Architecture CTrinitis
62/79
28.09.08
Technische Universitt Mnchen
Larrabee: Intel's Many-Core Architecture
Plenty of x86 in-order cores plus standard 64bitextensions
16 wide SIMD unit per core
Fully coherent L1 (32KB) /L2 (256KB) caches
Bidirectional ring bus
Short in order pipeline
4-way SMT
8/2/2019 Multicore Architecture CTrinitis
63/79
28.09.08
Technische Universitt Mnchen
Larrabee vs. Core
8/2/2019 Multicore Architecture CTrinitis
64/79
28.09.08
Technische Universitt Mnchen
Larrabee: Intel's Many-Core Architecture
Shared Memory Programming Model: Pthreads OpenMP Prromises to be standard conform
C / FORTRAN Compiler Key advantage: x86 binary compatibility!
8/2/2019 Multicore Architecture CTrinitis
65/79
28.09.08
Technische Universitt Mnchen
autopin:A Tool for automatic Optimization of PinningProcesses in Multicore Architectures
8/2/2019 Multicore Architecture CTrinitis
66/79
can lead to non-deterministic runtimes
8/2/2019 Multicore Architecture CTrinitis
67/79
Technische Universitt Mnchencan lead to non-deterministic runtimes...
8/2/2019 Multicore Architecture CTrinitis
68/79
h h h
8/2/2019 Multicore Architecture CTrinitis
69/79
28.09.08
Technische Universitt Mnchen
The autopin Approach
User-level tool
Start multi-threaded application under autopin control
User can specify pinnings of interest
Pin threads to cores
Assess performance of chosen pinning using performance counters
Try alternative pinnings until optimal pinning is found
69
T h i h U i i M hPerformance Counters
8/2/2019 Multicore Architecture CTrinitis
70/79
Technische Universitt Mnchen
Multiple Event Sensors ALU Utilization
Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization
Two Uses: Read: Get Precise Count of Events in Code Regions =>
Counting Interrupt on Overflow => Statistical Sampling
Well-known tools: Oprofile
Perfctr Intel Vtune Perfmon2
8/2/2019 Multicore Architecture CTrinitis
71/79
Technische Universitt Mnchenautopin Strategy
8/2/2019 Multicore Architecture CTrinitis
72/79
Technische Universitt Mnchen
numOfPinnings = 3; pinning = {"1984", "182B", "58BE"};
for (i=0; i
8/2/2019 Multicore Architecture CTrinitis
73/79
8/2/2019 Multicore Architecture CTrinitis
74/79
Technische Universitt Mnchen
8/2/2019 Multicore Architecture CTrinitis
75/79
Technische Universitt Mnchen
Barcelona
Technische Universitt Mnchen
8/2/2019 Multicore Architecture CTrinitis
76/79
Technische Universitt Mnchen
Results
4242842
332.ammp
330.art
328.fma3d
324.apsi
320.equake
316.applu
314.mgrid
312.swim
310.wupwise
BarcelonaClovertownCaneland
8/2/2019 Multicore Architecture CTrinitis
77/79
Technische Universitt Mnchen
8/2/2019 Multicore Architecture CTrinitis
78/79
28.09.08
Conclusions and Outlook 2
Pinning is essential!
We will see more and more GPU features in main processors!We will see more and more GPU features in main processors!
78
Technische Universitt Mnchen
8/2/2019 Multicore Architecture CTrinitis
79/79
Gracias!
Thank you!
Top Related