Evolution des processeurs - Centre national de la ... · Evolution of Nvidia GPUs • More and more...

PROCESSOR EVOLUTION

NOV. 28, 2016

Accelerated Computing For Fusion| Guillaume Colin de Verdière

GCdV | PAGE 1 AccelComp4Fusion, Nov. 28, 2016

• General Context Power Wall

Memory Wall

• Example of technologies

• Conclusion

Scope of the talk

Technological background

| PAGE 3 AccelComp4Fusion, Nov. 28, 2016

Motherboard

CPU RAM CPU RAM

CPU RAM CPU

Memory

A supercomputer (Curie) Is a cluster Built from nodes

NIC IO

A processor (CPU)

trolle

Core A core

Interconnexion network

Interconnexion network Storage

ALU SIMD

L2 cache

The first challenge: the power wall

• Exaflops = 1018 (1 000 000 000 000 000 000) floating point

operations per second EXA 1 will be a cluster

• Running costs will be driven by electricity 20 MW is the maximum agreed by the HPC community

• Most of the power is used by the processor • ~ 50 % processors • ~ 30 % memories • ~ 10 % interconnection

• ~ 10 % disks

Credit K.J. Roche, Jul 2011, U. of Michigan

Power wall: consequences

• Limit the frequencies and voltage P=C * V * F2

KNL & GPU: ~1.4 GHz

• Add an increasing number of cores GPU > 2000

CPU > 20

• Use more powerful computing units Increasing role of SIMD (data parallel)

• Work on processor architectures

Processor Architecture

Intel desktop processors

µproc Year Freq. Pipeline

stages

Issue O-o-O Core

Power Process

486 1989 25 MHz 5 1 No 1 5 W 1000 nm

Pentium 1993 66 MHz 5 2 No 1 10 W 350 nm

Pentium

1997 200

10 3 Yes 1 29 W 350 nm

Pentium

2001 2000

22 3 Yes

1 75 W 180 nm

Pentium

2004 3600

31 3 Yes

1 103 W 130 nm

Core 2006 2930

14 4 Yes

2 75 W 65 nm

Core I5

2010 3300

14 4 Yes

4 87 W 32 nm

Core I5

2012 3400

14 4 Yes 8 77 W 22 nm GCdV | PAGE 6 AccelComp4Fusion, Nov. 28, 2016

pipeline

Vectorization : Link to the past?

• CRAY : pipeline architecture • « vertical » parallelism

• no caches

Vectors should be as long as possible

• SIMD Architecture • « horizontal » parallelism

• Beware of the caches

Vector size must be compatible with the LLC*

GCdV 7 AccelComp4Fusion, Nov. 28, 2016

line 1A line 2A … line 128A

line 1D line 2D line 128D

vmovupd

vmovupd vaddpd

𝐶 =𝐴 + 𝐵 Cache

* LLC = Last Level Cache

ARM SVE (announced in Aug. 2016)

• Neon SIMD engine no well adapted to HPC

• How to help compiler to do better? Fujitsu will introduce SVE in its future processor

The second challenge: the memory wall

• Computing is “cheaper” than moving data

• Memory hierarchies have associated costs L1 Cache small size (kB), small latency (~1cy) , high bandwidth

• L2 Cache medium size (kB), medium latency (~14cy), high bandwidth

L3 Cache larger size (MB), high latency (~40 cy)

- MCDRAM (KNL) very high latency (~140 cy), focus on bandwidth

- DDR4 very high latency (~140 cy), focus on storage size

Credit Bill Dally, NVIDIA, SC10 Credit K.J. Roche, Jul 2011, U. of Michigan

(Cy = CPU cycle; 2GHz = 0.5ns)

Moving data around is more and more expensive

• Put the functional units closer to each other Introduction of stacked memory

KNL & GPUS

Put the GPU closer to the CPUs

SOCs (system-on-chip)

3D Technologies

• Chiplets

Chiplet

Interposer

Multi-Chip-Module

Memory device

FPGA bare die

Difficulties of micro electronics

• Mastering finer processes is more and more difficult Intel moves away from the tick-tock model

• More time between to nodes

Nodes to come : 10 – 7 – 5 nm

• Transistor technology must evolve FinFET has reach its limit

• Costs are exploding ~2X per technology node

• To update a « fab » is > 4 G€

Mergers will continue

Emerging importance of « fabless » companies

• Active research of new solutions Material: graphene, …

Processor Architecture: e.g. BIG.little of ARM

Evolution of Nvidia GPUs

• More and more powerful Pascal: ~12 Tflops SP, 5 Tflops DP

12/32 GB HBM 2 1TB/s internally

• Exchanges between GPUs are faster NVLINK-1: 80 GB/s peak, 64 GB/s sustained

• Usage of CPU memory is getting better Through PCI-Ex for the time being

Will be coherent with NVLINK-2

• Programability (portability) is getting better OpenMP 4.x available soon

Is a next turn coming?

• Last big turn was at the end of the 90s Moved away from vector machines

Usage of commodity processors

Introduction of MPI

• 2016: fast raise of AI Special version Pascal GPU for Machine Learning (ML)

NVIDIA Xavier Platform automotive

Announcement of KNM, a KNL for ML

Neuromorphic processor from IBM

• Influence of FPGAs?

• 20??: Quantum Computing? The next turn?

• Still there are invariants for HPC Threading, vectorization, Memory management

Example of technologies

| PAGE 14

AccelComp4Fusion, Nov. 28, 2016

Some existing processors

# of cores X86_64 IBM ARM Other

<= 32 Intel

Haswell

AMD Zen IBM Power Broadcom

Vulcan

< 256 Intel KNL Phytium

FT2000

>= 256 IBM

Kilocore

Sunway

SW26010

=> Conclusions

A real (2015)

• Haswell-EP (here18 cores) 22nm, 5 560 000 000 transistors!

| PAGE 16 AccelComp4Fusion, Nov. 28, 2016 GCdV

KNL (2016)

• Arrival of memory hierarchies Beware of array placement

• SKU with 68 cores max for the time being 34 tiles: ideally 34 Mpi + 4 OpenMP

SKU = Stock Keeping Unit

• VPU are the sole source of computing power Atom core for the scalar side

AVX512 & FMA

• Low frequencies (1.4 GHz TDP – 1.2 GHz AVX)

AMD: Zen architecture

• Introduction of multiple threads per core (2)

• AVX2

• The foundation of an HPC solution at AMD To monitor in the future

Hotchips 2016: “AMD also conducted the first public demonstration of its upcoming 32-

core, 64-thread “Zen”-based server processor, codenamed “Naples,” in a dual

processor server running the Windows® Server operating system.”

AMD Zen

Sunway SW26010

• 1.45 GHZ

• 260 cores per node 64 compute + 1 management

Supports SIMD

• 8 DP ops / cycle

But DDR3 memory and no cache coherency between clusters

• scratchpad memory of 64KB per little core!

Manual usage of memory

• Low Byte / flop

IBM Power 9

• More threads and cores Vector engines as always

• Good memory bandwidth 8 memory channels

Kilocore

• An example of a large number of cores

• 1000 core, 32nm

• Not usable for DP

Vulcan

• 32 cores, xx GHz

• Performances ~Haswell

• ARMv8

• 4 threads

• SIMD 128 bits (Neon)

• 8 DDR4 channels Good bandwidth expected

• First chips showcased at

The chinese ARM: Phytium FT-2000/64

• Architecture ARM v8 compatible • 1 panel = 8 cores

• 28 nm

But slow memory (DDR3)

• Reuse of modern ideas from Intel &

others Mesh topology (KNL & SKL)

• Performance should be good Xeon E5-2698 v3 – 16 cores – 135W – 588

Gflops

• 16c * 2,3GHz * 4 DP * 2 * 2 (FMA)

Phytium – 64 cores – 120 W – 512 Gflops

• 64c * 2 GHz * 2 DP * 2

Conclusions

| PAGE 27

The change is coming

• The Coral procurement US DOE

Delivery: 2018

• LLNL & ORNL IBM + NVIDIA

Power 9 + Volta

~150 PFlops

• ANL INTEL + CRAY

~180 PFlops

0 50 100 150 200

Linpack RMax (PFlops)

Top500

Manycore

Tera 1000Manycore

Manycore

Real efficiency: the growing gap!

Machine # Proc Rpeak

(PF/s)

Tianhe-2 2 Xeon 54,902 33,863 61,7% 0,5800 1,7%

computer

5 Sparc 11,280 10,510 93,2% 0,5544 5,3%

Sunway

TaihuLight

1 SW26010 125,435 93,015 74,2% 0,3712 0,3%

Sequoia 4 BG/Q 20,132 17,173 85,3% 0,3304 1,9%

Titan 3 Opteron 27,112 17,590 64,9% 0,3223 1,8%

Trinity 7 Xeon 11,078 8,101 73,1% 0,1826 2,3%

Curie TN 62 Xeon 1,667 1,359 81,5% 0,0510 3,8%

T1k-1 44 Xeon 2,586 1,871 72,3% 0,0340 1,8%

Top500 June 2016

“The big picture”

• Frequencies don’t increase

• More cores per processors

• Increase of number of threads per core

• Vector instruction are essential for performance

• Memory access is the most important problem

• Competition is awaking $$$

• Codes must be modernized:

What is “Modernized” Code?

What Defined Tools of the

Thread

Scaling

Increase concurrent

thread use across

coherent shared

memory

OpenMP, TBB, Cilk

Vector

Scaling

Use many wide-

vector (512-bit)

instructions

Vector loops, vector

functions, array

notations

Blocking

Use algorithms to

reduce memory

bandwidth pressure

and improve cache

hit rate

Blocking algorithms

Fabric

Scaling

Tune workload to

increased node

Layout

Optimize data layout

for unconstrained

performance

AoSSoA,

directives for

alignment

Credit: Intel 31

Annexes

| PAGE 32

Débit et latence d’un pipeline

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

IF MEM1 ID I1 EX WB MEM1

Source J.N. Amaral

Cas idéal

GCdV AccelComp4Fusion, Nov. 28, 2016

Débit et latence d’un pipeline

IF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns

L’étage le plus lent du pipeline limite la latence !

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

IF MEM ID I1 EX WB IF MEM ID I2 L(I2) = 50ns EX WB

IF MEM ID EX WB IF MEM ID EX

0 10 20 30 40 50 60

Source J.N. Amaral GCdV AccelComp4Fusion, Nov. 28, 2016

Instruction

Fetch Instruction

Decode Instruction

eXecute Accès

mémoire

Limites d’un pipeline

Des aléas empêchent l’instruction suivante de s’exécuter durant son cycle d’horloge prévu • Aléas structurels : Le matériel ne peut pas supporter une

combinaison d’instructions (accès au même cycle à la même ressource matérielle)

• Aléas de données : L’instruction dépends du résultat d’une instruction précédente qui est toujours dans le pipeline

• Aléas de contrôle : Causé par le délai entre le chargement d’instructions et une décision de changement de flot de contrôle (branchements).

Source: J.N. Amaral GCdV AccelComp4Fusion, Nov. 28, 2016

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

DMem Ifetch Reg

Aléas de données: un exemple

IF ID/RF EX MEM WB

Source: J.N. Amaral

O r d r e d’ i n s t r.

Temps (cycles d’horloge)

Op res, src1, src2 GCdV AccelComp4Fusion, Nov. 28, 2016

Commissariat à l’énergie atomique et aux énergies alternatives

Centre DAM Ile-de-France | 91297 Arpajon Cedex, France

T. +33 (0)1 69 26 40 00

Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019 GCdV

| PAGE 38

Evolution des processeurs - Centre national de la ... · Evolution of Nvidia GPUs • More and more...

Documents

Transcript of Evolution des processeurs - Centre national de la ... · Evolution of Nvidia GPUs • More and more...

Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

Lexique Sommaire - users.polytech.unice.frusers.polytech.unice.fr/~pfz/LANGDOC/COURS/10_11_RDF_AJAX.pdf · Chap VII - XSD: Schémas XML Chap VIII - XML "Advanced" Processeurs et Dialectes:

Massimo Celino ENEA - in.bgu.ac.il Assets/Pages... · CRESCO3 (20 Tflops, 2016 cores) • 84 nodes 2x12 cores AMD 6234 @2.4 GHz; CRESCO4 (100 Tflops, 4864 cores) ... CMAST, Computational

GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

année BTS DSI · 2015. 2. 22. · Le Microprocesseur 80x86. 1èrePartie ARCHITECTURE INTERNE DES 1ère année BTS DSI Prof:EL HADARI zouhair PROCESSEURS 3. 1ère année BTS DSI Prof:EL

Scott Le Grand - on-demand.gputechconf.comon-demand.gputechconf.com/gtc/2015/presentation/S5478-Scott-Le… · Scott Le Grand Some Things Never ... Nervana is getting ~3.7 TFLOPs

Programming with hardware/software functions · Résumé : Les FPGA supportent l’implémentation d’une large game de fonctionnalités, depuis les processeurs généraux (Softcores)

Architectural Considerations for a 500 TFLOPS Heterogeneous HPC

What is scientific computing? - BOINC · 2002 NEC Earth Simulator 35 TFLOPS 2006 IBM Blue Gene/L 280 TFLOPS Supercomputer history . PCs versus Supercomputers

JETSON XAVIER NX - NVIDIA€¦ · 9 a re JETSON NANO JETSON TX2 JETSON XAVIER NX JETSON AGX XAVIER GPU 128 Core Maxwell 0.5 TFLOPs (FP16) 256 Core Pascal 1.3 TFLOPS (FP16) 384 Core

San Diego, March 27th 2003Roberto De Pietri -- chep031 apeNEXT * The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status.

NVIDIA TURING 22 TFLOPS, 768 TENSOR CORES, WOLF FGX · 2019. 10. 18. · Two NVIDIA Quadro Turing TU104 with 22 TFLOPS, 6144 CUDA Cores, 768 Tensor Cores, 96 RT Cores 32 GB GDDR6

BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.

KEENELAND - ENABLING HETEROGENEOUS COMPUTING FOR … · Keeneland –Initial Delivery System Architecture 9 Initial Delivery system procured and installed in Oct 2010 201 TFLOPS in

VOLTA: PROGRAMMABILITY AND PERFORMANCE...3 P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL Inferencing 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2

NXP PROCESSEURS POUR ENVIRONNEMENT CONTRAINTS ET …

NGCHC addresses a problem of major national importance ... · NGCHC addresses a problem of major national importance – engineering design, coastal system ... 50.7 Tflops LONI Queen

Ascii_red - An Overview of the Intel TFLOPS Supercomputer

FARO - garr.it · Poster F. Beone “I Laoratori Virtuali di ENEA ... CRESCO HPC system, located in Portici (NA) ~ 17.1 Tflops, ... (Fluent, Ansys, Abaqus, ...