Post on 10-Sep-2018
PROCESSOR EVOLUTION
NOV. 28, 2016
Accelerated Computing For Fusion| Guillaume Colin de Verdière
GCdV | PAGE 1 AccelComp4Fusion, Nov. 28, 2016
Plan
• General Context Power Wall
Memory Wall
• Example of technologies
• Conclusion
GCdV | PAGE 2 AccelComp4Fusion, Nov. 28, 2016
Scope of the talk
Technological background
| PAGE 3 AccelComp4Fusion, Nov. 28, 2016
Motherboard
CPU RAM CPU RAM
CPU RAM CPU
Memory
(RAM)
A supercomputer (Curie) Is a cluster Built from nodes
NIC IO
A processor (CPU)
Me
mo
ry
Con
trolle
r
Core
Core
Core
Core A core
Interconnexion network
Interconnexion network Storage
L3
ca
ch
e
ALU SIMD
unit
L1 c
ac
he
Th
read
1
Th
read
2
Th
read
3
Th
read
4
L2 cache
L2 cache
GCdV
The first challenge: the power wall
• Exaflops = 1018 (1 000 000 000 000 000 000) floating point
operations per second EXA 1 will be a cluster
• Running costs will be driven by electricity 20 MW is the maximum agreed by the HPC community
• Most of the power is used by the processor • ~ 50 % processors • ~ 30 % memories • ~ 10 % interconnection
• ~ 10 % disks
| PAGE 4 AccelComp4Fusion, Nov. 28, 2016
Credit K.J. Roche, Jul 2011, U. of Michigan
?
GCdV
Power wall: consequences
• Limit the frequencies and voltage P=C * V * F2
KNL & GPU: ~1.4 GHz
• Add an increasing number of cores GPU > 2000
CPU > 20
• Use more powerful computing units Increasing role of SIMD (data parallel)
• Work on processor architectures
GCdV | PAGE 5 AccelComp4Fusion, Nov. 28, 2016
Processor Architecture
Intel desktop processors
µproc Year Freq. Pipeline
stages
Issue O-o-O Core
/chip
Power Process
486 1989 25 MHz 5 1 No 1 5 W 1000 nm
Pentium 1993 66 MHz 5 2 No 1 10 W 350 nm
Pentium
Pro
1997 200
MHz
10 3 Yes 1 29 W 350 nm
Pentium
4 W.
2001 2000
MHz
22 3 Yes
1 75 W 180 nm
Pentium
4 P
2004 3600
MHz
31 3 Yes
1 103 W 130 nm
Core 2006 2930
MHz
14 4 Yes
2 75 W 65 nm
Core I5
NHM
2010 3300
MHz
14 4 Yes
4 87 W 32 nm
Core I5
IVB
2012 3400
MHz
14 4 Yes 8 77 W 22 nm GCdV | PAGE 6 AccelComp4Fusion, Nov. 28, 2016
pipeline
Vectorization : Link to the past?
• CRAY : pipeline architecture • « vertical » parallelism
• no caches
Vectors should be as long as possible
• SIMD Architecture • « horizontal » parallelism
• Beware of the caches
Vector size must be compatible with the LLC*
GCdV 7 AccelComp4Fusion, Nov. 28, 2016
line 1A line 2A … line 128A
line 1D line 2D line 128D
a
0
a
1
a
2
a
3
a
4
a
5
a
6
a
7
b
0
b
1
b
2
b
3
b
4
b
5
b
6
b
7
c
0
c
1
c
2
c
3
c
4
c
5
c
6
c
7
vmovupd
vmovupd
vmovupd vaddpd
𝐶 =𝐴 + 𝐵 Cache
DDR
* LLC = Last Level Cache
ARM SVE (announced in Aug. 2016)
• Neon SIMD engine no well adapted to HPC
• How to help compiler to do better? Fujitsu will introduce SVE in its future processor
GCdV | PAGE 8 AccelComp4Fusion, Nov. 28, 2016
The second challenge: the memory wall
• Computing is “cheaper” than moving data
• Memory hierarchies have associated costs L1 Cache small size (kB), small latency (~1cy) , high bandwidth
• L2 Cache medium size (kB), medium latency (~14cy), high bandwidth
L3 Cache larger size (MB), high latency (~40 cy)
- MCDRAM (KNL) very high latency (~140 cy), focus on bandwidth
- DDR4 very high latency (~140 cy), focus on storage size
- NIC
| PAGE 9 AccelComp4Fusion, Nov. 28, 2016
Credit Bill Dally, NVIDIA, SC10 Credit K.J. Roche, Jul 2011, U. of Michigan
(Cy = CPU cycle; 2GHz = 0.5ns)
GCdV
Moving data around is more and more expensive
• Put the functional units closer to each other Introduction of stacked memory
KNL & GPUS
Put the GPU closer to the CPUs
SOCs (system-on-chip)
3D Technologies
• Chiplets
| PAGE 10 AccelComp4Fusion, Nov. 28, 2016
HM
C b
y M
icro
n
AP
U b
y A
MD
GCdV
Chiplet
Interposer
3D-IC
Multi-Chip-Module
Memory device
FPGA bare die
Difficulties of micro electronics
• Mastering finer processes is more and more difficult Intel moves away from the tick-tock model
• More time between to nodes
Nodes to come : 10 – 7 – 5 nm
• Transistor technology must evolve FinFET has reach its limit
• Costs are exploding ~2X per technology node
• To update a « fab » is > 4 G€
Mergers will continue
Emerging importance of « fabless » companies
• Active research of new solutions Material: graphene, …
Processor Architecture: e.g. BIG.little of ARM
GCdV | PAGE 11 AccelComp4Fusion, Nov. 28, 2016
Evolution of Nvidia GPUs
• More and more powerful Pascal: ~12 Tflops SP, 5 Tflops DP
12/32 GB HBM 2 1TB/s internally
300W
• Exchanges between GPUs are faster NVLINK-1: 80 GB/s peak, 64 GB/s sustained
• Usage of CPU memory is getting better Through PCI-Ex for the time being
Will be coherent with NVLINK-2
• Programability (portability) is getting better OpenMP 4.x available soon
GCdV | PAGE 12 AccelComp4Fusion, Nov. 28, 2016
Is a next turn coming?
• Last big turn was at the end of the 90s Moved away from vector machines
Usage of commodity processors
Introduction of MPI
• 2016: fast raise of AI Special version Pascal GPU for Machine Learning (ML)
NVIDIA Xavier Platform automotive
Announcement of KNM, a KNL for ML
Neuromorphic processor from IBM
• Influence of FPGAs?
• 20??: Quantum Computing? The next turn?
• Still there are invariants for HPC Threading, vectorization, Memory management
GCdV | PAGE 13 AccelComp4Fusion, Nov. 28, 2016
Some existing processors
# of cores X86_64 IBM ARM Other
<= 32 Intel
Haswell
AMD Zen IBM Power Broadcom
Vulcan
< 256 Intel KNL Phytium
FT2000
>= 256 IBM
Kilocore
Sunway
SW26010
GCdV | PAGE 15 AccelComp4Fusion, Nov. 28, 2016
=> Conclusions
A real (2015)
• Haswell-EP (here18 cores) 22nm, 5 560 000 000 transistors!
| PAGE 16 AccelComp4Fusion, Nov. 28, 2016 GCdV
KNL
• Arrival of memory hierarchies Beware of array placement
• SKU with 68 cores max for the time being 34 tiles: ideally 34 Mpi + 4 OpenMP
GCdV | PAGE 18 AccelComp4Fusion, Nov. 28, 2016
SKU = Stock Keeping Unit
KNL
• VPU are the sole source of computing power Atom core for the scalar side
AVX512 & FMA
• Low frequencies (1.4 GHz TDP – 1.2 GHz AVX)
GCdV | PAGE 19 AccelComp4Fusion, Nov. 28, 2016
FMA
AMD: Zen architecture
• Introduction of multiple threads per core (2)
• AVX2
• The foundation of an HPC solution at AMD To monitor in the future
Hotchips 2016: “AMD also conducted the first public demonstration of its upcoming 32-
core, 64-thread “Zen”-based server processor, codenamed “Naples,” in a dual
processor server running the Windows® Server operating system.”
GCdV | PAGE 20 AccelComp4Fusion, Nov. 28, 2016
Sunway SW26010
• 1.45 GHZ
• 260 cores per node 64 compute + 1 management
Supports SIMD
• 8 DP ops / cycle
But DDR3 memory and no cache coherency between clusters
• scratchpad memory of 64KB per little core!
Manual usage of memory
• Low Byte / flop
GCdV | PAGE 22 AccelComp4Fusion, Nov. 28, 2016
IBM Power 9
• More threads and cores Vector engines as always
• Good memory bandwidth 8 memory channels
GCdV | PAGE 23 AccelComp4Fusion, Nov. 28, 2016
Kilocore
GCdV | PAGE 24 AccelComp4Fusion, Nov. 28, 2016
• An example of a large number of cores
• 1000 core, 32nm
• Not usable for DP
Vulcan
• 32 cores, xx GHz
• Performances ~Haswell
• ARMv8
• 4 threads
• SIMD 128 bits (Neon)
• 8 DDR4 channels Good bandwidth expected
• First chips showcased at
SC16
GCdV | PAGE 25 AccelComp4Fusion, Nov. 28, 2016
The chinese ARM: Phytium FT-2000/64
• Architecture ARM v8 compatible • 1 panel = 8 cores
• 28 nm
But slow memory (DDR3)
• Reuse of modern ideas from Intel &
others Mesh topology (KNL & SKL)
• Performance should be good Xeon E5-2698 v3 – 16 cores – 135W – 588
Gflops
• 16c * 2,3GHz * 4 DP * 2 * 2 (FMA)
Phytium – 64 cores – 120 W – 512 Gflops
• 64c * 2 GHz * 2 DP * 2
GCdV | PAGE 26 AccelComp4Fusion, Nov. 28, 2016
The change is coming
• The Coral procurement US DOE
Delivery: 2018
• LLNL & ORNL IBM + NVIDIA
Power 9 + Volta
~150 PFlops
~8 MW
• ANL INTEL + CRAY
KNH
~180 PFlops
13 MW
| PAGE 28 AccelComp4Fusion, Nov. 28, 2016
0
2
4
6
8
10
12
14
16
18
20
0 50 100 150 200
MW
Linpack RMax (PFlops)
Top500
Coral
Manycore
Manycore
CPU
+ GPU
Tera 1000Manycore
Curie
Manycore
Manycore
CPU
+
GPU
GCdV
Real efficiency: the growing gap!
Machine # Proc Rpeak
(PF/s)
Rmax
(PF/s)
HPL %
Peak
HPCG
(PF/s)
%
Rmax
Tianhe-2 2 Xeon 54,902 33,863 61,7% 0,5800 1,7%
K
computer
5 Sparc 11,280 10,510 93,2% 0,5544 5,3%
Sunway
TaihuLight
1 SW26010 125,435 93,015 74,2% 0,3712 0,3%
Sequoia 4 BG/Q 20,132 17,173 85,3% 0,3304 1,9%
Titan 3 Opteron 27,112 17,590 64,9% 0,3223 1,8%
Trinity 7 Xeon 11,078 8,101 73,1% 0,1826 2,3%
Curie TN 62 Xeon 1,667 1,359 81,5% 0,0510 3,8%
T1k-1 44 Xeon 2,586 1,871 72,3% 0,0340 1,8%
GCdV | PAGE 29 AccelComp4Fusion, Nov. 28, 2016
Top500 June 2016
“The big picture”
• Frequencies don’t increase
• More cores per processors
• Increase of number of threads per core
• Vector instruction are essential for performance
• Memory access is the most important problem
• Competition is awaking $$$
• Codes must be modernized:
GCdV | PAGE 30 AccelComp4Fusion, Nov. 28, 2016
What is “Modernized” Code?
What Defined Tools of the
trade
Thread
Scaling
Increase concurrent
thread use across
coherent shared
memory
OpenMP, TBB, Cilk
Plus
Vector
Scaling
Use many wide-
vector (512-bit)
instructions
Vector loops, vector
functions, array
notations
Cache
Blocking
Use algorithms to
reduce memory
bandwidth pressure
and improve cache
hit rate
Blocking algorithms
Fabric
Scaling
Tune workload to
increased node
count
MPI
Data
Layout
Optimize data layout
for unconstrained
performance
AoSSoA,
directives for
alignment
X4
Y4
Z4
X3
Y3
Z3
X2
Y2
Z2
X1
Y1
Z1
0 X8
Y8
Z8
X7
Y7
Z7
X6
Y6
Z6
X5
Y5
Z5
X12
Y12
Z12
X11
Y11
Z11
X10
Y10
Z10
X9
Y9
Z9
X16
Y16
Z16
X15
Y15
Z15
X14
Y14
Z14
X13
Y13
Z13
512
1
2
3
4
5
Credit: Intel 31
34
Débit et latence d’un pipeline
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
IF MEM1 ID I1 EX WB MEM1
IF MEM1 ID I2 EX WB MEM1
IF MEM1 ID I3 EX WB MEM1
IF MEM1 ID I4 EX WB MEM1
IF MEM1 ID I5 EX WB MEM1
IF MEM1 ID I6 EX WB MEM1
IF MEM1 ID I7 EX WB MEM1
Source J.N. Amaral
Cas idéal
GCdV AccelComp4Fusion, Nov. 28, 2016
35
Débit et latence d’un pipeline
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
L’étage le plus lent du pipeline limite la latence !
L(I1) = L(I2) = L(I3) = L(I4) = 50ns
IF MEM ID I1 EX WB IF MEM ID I2 L(I2) = 50ns EX WB
IF MEM ID EX WB IF MEM ID EX
0 10 20 30 40 50 60
I3 I4
Source J.N. Amaral GCdV AccelComp4Fusion, Nov. 28, 2016
Instruction
Fetch Instruction
Decode Instruction
eXecute Accès
mémoire
Write
Back
36
Limites d’un pipeline
Des aléas empêchent l’instruction suivante de s’exécuter durant son cycle d’horloge prévu • Aléas structurels : Le matériel ne peut pas supporter une
combinaison d’instructions (accès au même cycle à la même ressource matérielle)
• Aléas de données : L’instruction dépends du résultat d’une instruction précédente qui est toujours dans le pipeline
• Aléas de contrôle : Causé par le délai entre le chargement d’instructions et une décision de changement de flot de contrôle (branchements).
Source: J.N. Amaral GCdV AccelComp4Fusion, Nov. 28, 2016
37
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Aléas de données: un exemple
IF ID/RF EX MEM WB
Source: J.N. Amaral
O r d r e d’ i n s t r.
Temps (cycles d’horloge)
Op res, src1, src2 GCdV AccelComp4Fusion, Nov. 28, 2016
<=