Post on 22-Apr-2022
Astrosim 2017
Les « Houches » the 7th ;-)
GPU : the disruptive technology of 21th century
Little comparison between CPU, MIC, GPU
Emmanuel Quémener
Emmanuel QUÉMENER CC BY-NC-SA 2/48September 29, 2017
A (human) generation before...A serial B film in 1984
● 1984 : The Last Starfighter– 27 minutes of synthetic images– 22.5 10 operations per image⁹
– Use of Cray X-MP (130 kW)– 66 days (in fact, 1 year needed)
● 2016: on GTX 1080Ti (250 W)– 3.4 secondes– Comparison GTX1080Ti / Cray
● Performance : 1 600 000 !● Consommation ~ 900 000 000 !
Emmanuel QUÉMENER CC BY-NC-SA 3/48September 29, 2017
« The beginning of all things »● « Nvidia Launches Tesla Personal Supercomputer »● When : on 19/11/2008 Qui : Nvidia● Where : sur Tom's Hardware● What : a PCIe card C1060 PCIe with 240 cores● How much : 933 Gflops SP (but 78 Gflops DP)
Emmanuel QUÉMENER CC BY-NC-SA 4/48September 29, 2017
What position for GPUs ?The others « accelerators »● Accelerator or the old history of coprocessors...
– 1980 : 8087 (on 8086/8088) for floating points operations– 1989: 80387 (on 80386) and the respect of IEEE 754– 1990 : 80486DX and the integration of FPU inside the CPU– 1997 : K6-3DNow ! & Pentium MMX : SIMD inside the CPU– 1999 : SSE functions and the birth of a long serie (SSE4 & AVX)
● When chips stand out from CPU– 1998 : DSP style TMS320C67x as tools – 2008: Cell inside the PS3, IBM inside Road Runner and Top1 of Top500
● Business of compilers & very accurate programming model
Emmanuel QUÉMENER CC BY-NC-SA 5/48September 29, 2017
Why a GPU is so powerful ?To construct 3D scene !
● 2 approaches:– Raytracing : PovRay– Shadering : 3 operations
● Raytracing :– Starting from the eye to the objects of scene
● Shadering– Model2World : vectorial objects placed in the scene– World2View : projection of objects on a plan of view– View2Projection : pixelisation of vectorial plan of view
Emmanuel QUÉMENER CC BY-NC-SA 6/48September 29, 2017
Why a GPU is so powerfulShadering & Matrix computing● Modele 2 World : 3 matrix products
– Rotation– Translation– Scaling
● World 2 View : 2 matrix products– Camera position– Direction de l'endroit pointé
● View 2 Projection– Pixellisation
A GPU : a « huge » Matrix Multiplier
W2V
M2W
V2P
Emmanuel QUÉMENER CC BY-NC-SA 7/48September 29, 2017
Is GPU really so powerful ?With my best MyriALUs...
Tesla P100 GTX 1080Ti R9 Fury Xeon Phi Ryzen7 1800 E5-2680v4 x20.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
9000.00
10000.00
OpenBLAS DP
clBLAS DP
Thunking DP
CuBLAS DP
OpenBLAS SP
clBLAS SP
Thunking SP
CuBLAS SP
Teslafor
Pros
GTXfor
Gamers
AMDforAlt*
CPU
Emmanuel QUÉMENER CC BY-NC-SA 8/48September 29, 2017
Tesl
a C10
60
Tesl
a M20
90
Tesl
a K20m
Tesl
a K80
Tesl
a P100
Quadr
o k400
0
GTX 5
60 Ti
GTX 6
90
GTX 7
80
GTX Ti
tan
GTX 7
80Ti
GTX 98
0
GTX 9
80Ti
GTX 1
080
GTX 1
080T
i
HD7970
R9-29
5x #1
R9 Fur
y
Nano
Xeon
Phi
FX95
90
Ryzen
7 18
00
E6300
x555
0 x2
E5-266
5 x2
E5-26
40v3
x2
E5-268
7v3 x2
E5-264
0v4 x2
E5-268
0v4 x2
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
9000.00
10000.00OpenBLAS DP clBLAS DP
Thunking DP CuBLAS DP
OpenBLAS SP clBLAS SP
Thunking SP CuBLAS SP
AMDForAlt*
GTXfor
Gamers
BLAS match : CPU vs GPUxGEMM when x=D(P) or S(P)
clBLAS
OpenBLASCuBLAS
Teslafor
Pros
CPU
Emmanuel QUÉMENER CC BY-NC-SA 9/48September 29, 2017
What Nvidia never broadcasts...xGEMM on a large domain ...
Tesla P100GTX 1080
Performance
Size
Yes, performance exists, in specific conditions !
Emmanuel QUÉMENER CC BY-NC-SA 10/48September 29, 2017
How a GPU must be consideredA laboratory of parallelism● You’ve got dozen of thousands of Equivalent PU !● You have the choice between : CPU GPUMIC
Emmanuel QUÉMENER CC BY-NC-SA 11/48September 29, 2017
How many elements inside ?Difference between GPU & CPU
● Operations– Matrix Multiply– Vectorization– Pipelining– Shader (multi)processor
● Programmation : 1993– OpenGL, Glide, Direct3D,
● Généricité : 2002– CgToolkit, CUDA, OpenCL
Tesla P100Tesla P100
Emmanuel QUÉMENER CC BY-NC-SA 12/48September 29, 2017
Why GPU is so powerful ?Come back generic processing
Inside the GPU● Specialized processing units● Pipelining efficiency
Adaptation Loss● Changing scenes● Different nature of details
The Solution More generalist processing units !
Emmanuel QUÉMENER CC BY-NC-SA 13/48September 29, 2017
Why is GPU so powerfulAll « M » Flynn taxonomy included
● Vectorial : SIMD (Simple Instruction Multiple Data)– Addition of 2 positions (x,y,z) : 1 uniq command
● Parallel : MIMD (Multiple Instructions Multiple Data)– Several executions in parallel with the (almost) same data
● In fact, SIMT : Simple Instruction Multiple Threads– All processing units share the Threads– Each processing unit can work independently from others
● Need to synchronize the Threads
Emmanuel QUÉMENER CC BY-NC-SA 14/48September 29, 2017
Important dates
● 1992-01 : OpenGL and the birth of a standard● 1998-03 : OpenGL 1.2 and interesting functions● 2002-12 : Cg Toolkit (Nvidia) and the extensions of language
– Wrappers for all languages (Python)
● 2007-06 : CUDA (Nvidia) or the arrival of a real langague● 2008-06 : Snow Leopard (Apple) integrates OpenCL
– La volonté d'utiliser au mieux sa machine ?
● 2008-11 : OpenCL 1.0 and its first specifications● 2011-04 : WebCL and its first version by Nokia
Emmanuel QUÉMENER CC BY-NC-SA 15/48September 29, 2017
Who use OpenCL ?As applications ● OS : Apple in MacOSX● « Big » applications :
– Libreoffice– Ffmpeg
● Known applications– Graphical ones : photoscan, – Scientific : OpenMM, Gromacs, Lammps, Amber, …
● Hundreds of softwares, librairies ! But checking is essentiel...– https://www.khronos.org/opencl/resources/opencl-applications-using-opencl
Emmanuel QUÉMENER CC BY-NC-SA 16/48September 29, 2017
Why GPU is a disruptive technologyHow to reshuffle the cards in Scientific Computing
● In a conference in february 2006, this… – x100 in 5 years
● Between 2000 and 2015– GeForce 2 Go/GTX 980Ti : from 286 to 2816000 MOperations/s : x10000– Classical CPU: x100
Emmanuel QUÉMENER CC BY-NC-SA 17/48September 29, 2017
Just a remind of last friday !
● OpenCL is the most versatile one● CUDA can ONLY be used with Nvidia GPUs● Accelerator seems to be the most universal, but...
Cluster Node CPU Node GPU Node Nvidia Accelerator
MPI Yes Yes No No Yes*
PVM Yes Yes No No Yes*
OpenMP No Yes No No Yes*
Pthreads No Yes No No Yes*
OpenCL No Yes Yes Yes Yes
CUDA No No No Yes No
TBB No Yes No No Yes*
Parallel programming models
Emmanuel QUÉMENER CC BY-NC-SA 18/48September 29, 2017
Act as integrator : do not reinvent the wheel !
● Classic librairies can be used under GPU– Nvidia provides lots of implementations– OpenCL provides several on them, but not so complete
Parallel programming librairiesCluster Node CPU Node GPU Node Nvidia Accelerator
BLAS BLACSMKL
OpenBLASMKL
clBLAS CuBLAS OpenBLASMKL
LAPACK ScalapackMKL
AtlasMKL
clMAGMA MAGMA MagmaMIC
FFT FFTw3 FFTw3 clFFT CuFFT FFTw3
Emmanuel QUÉMENER CC BY-NC-SA 19/48September 29, 2017
Why OpenCL ? Non Politically Correct History● In the middle of 2000, Apple needs power for MacOSX
– Some processing can be done by Nvidia chipsets– CUDA exists as a good successor of CgToolkit
● But Apple did not want to be jailed– (as their users :-) )
● Apple promotes the creation of Khronos consortium– AMD, IBM, ARM came quickly– … and Nvidia, Intel, Altera, … came backwards...
Emmanuel QUÉMENER CC BY-NC-SA 20/48September 29, 2017
What OpenCL offers10 implementations on x86● 3 implementations for CPU :
– AMD one : the original– Intel one : efficient for specific parallel rate– PortableCL (POCL) : OpenSource
● 4 implementations for GPU :– AMD one : for AMD/ATI graphical boards– Nvidia one : for Nvidia graphical boards– « Beignet » one : Open Source one for Intel Graphical Chipset– « Mesa » one : Open Source one for all Open Source drivers in Linux Kernel
● 1 implementation for Accelerator : Intel one for Xeon Phi● 1 implementation for FPGA : Intel one for Altera FPGA● On other platforms, it seems to be possible
Emmanuel QUÉMENER CC BY-NC-SA 21/48September 29, 2017
Goal : replace loop by distribute2 types of distributions
● Blocks/WorkItems– Domain « global », large but slow amount of memory available
● Threads– Domain « local », small but quick amount of memory available– Needed for synchronisation of process
● Different access to memory :● « clinfo » to get the properties of CL devices : Max work items
– Max work items : 1024x1024x1024 for CPU so 1.1 billion distribution
Emmanuel QUÉMENER CC BY-NC-SA 22/48September 29, 2017
How to do it ? Little example as...A « Hello World » in OpenCL...● Define two vectors in ASCII● Duplicate them in 2 huge vectors● Add them using a OpenCL kernel● Définir deux vecteurs en « Ascii »● Print them on screen
Add of
2 vectors a+b =c
For each n :
c[n] = a[n] + b[n]
Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World!
The End
Emmanuel QUÉMENER CC BY-NC-SA 23/48September 29, 2017
Kernel Code & Build and CallAdd a vector in OpenCL__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)
{
// Index of the elements to add
unsigned int n = get_global_id(0);
// Sum the n th element of vectors a and b and store in c
c[n] = a[n] + b[n];
}
OpenCLProgram = cl.Program(ctx, OpenCLSource).build()
OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)
Emmanuel QUÉMENER CC BY-NC-SA 24/48September 29, 2017
How to OpenCL ?Write « Hello World ! » in C
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
const char* OpenCLSource[] = {
"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
"{",
" // Index of the elements to add \n",
" unsigned int n = get_global_id(0);",
" // Sum the n’th element of vectors a and b and store in c \n",
" c[n] = a[n] + b[n];",
"}"
};
int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
#define SIZE 2048
int main(int argc, char **argv)
{
int HostVector1[SIZE], HostVector2[SIZE];
for(int c = 0; c < SIZE; c++)
{
HostVector1[c] = InitialData1[c%20];
HostVector2[c] = InitialData2[c%20];
}
cl_platform_id cpPlatform;
clGetPlatformIDs(1, &cpPlatform, NULL);
cl_int ciErr1;
cl_device_id cdDevice;
ciErr1 = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);
cl_context GPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, &ciErr1);
cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext,
cdDevice, 0, NULL);
cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector1, NULL);
cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector2, NULL);
cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,sizeof(int) * SIZE, NULL, NULL);
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource,NULL, NULL);
clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);
clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);
size_t WorkSize[1] = {SIZE}; // one dimensional Range
clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);
int HostOutputVector[SIZE];
clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
clReleaseKernel(OpenCLVectorAdd);
clReleaseProgram(OpenCLProgram);
clReleaseCommandQueue(cqCommandQueue);
clReleaseContext(GPUContext);
clReleaseMemObject(GPUVector1);
clReleaseMemObject(GPUVector2);
clReleaseMemObject(GPUOutputVector);
for (int Rows = 0; Rows < (SIZE/20); Rows++) {
printf("\t");
for(int c = 0; c <20; c++) {
printf("%c",(char)HostOutputVector[Rows * 20 + c]);
}
}
printf("\n\nThe End\n\n");
return 0;
}
OpenCL kernelNumber of lines
Of OpenCL kernel
Kernel Call
Emmanuel QUÉMENER CC BY-NC-SA 25/48September 29, 2017
How to OpenCL ?Write « Hello World ! » in Python
import pyopencl as cl
import numpy
import numpy.linalg as la
import sys
OpenCLSource = """
__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)
{
// Index of the elements to add
unsigned int n = get_global_id(0);
// Sum the n th element of vectors a and b and store in c
c[n] = a[n] + b[n];
}
"""
InitialData1=[37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17]
InitialData2=[35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15]
SIZE=2048
HostVector1=numpy.zeros(SIZE).astype(numpy.int32)
HostVector2=numpy.zeros(SIZE).astype(numpy.int32)
for c in range(SIZE):
HostVector1[c] = InitialData1[c%20]
HostVector2[c] = InitialData2[c%20]
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
GPUVector1 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector1)
GPUVector2 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector2)
GPUOutputVector = cl.Buffer(ctx, mf.WRITE_ONLY, HostVector1.nbytes)
OpenCLProgram = cl.Program(ctx, OpenCLSource).build()
OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)
HostOutputVector = numpy.empty_like(HostVector1)
cl.enqueue_copy(queue, HostOutputVector, GPUOutputVector)
GPUVector1.release()
GPUVector2.release()
GPUOutputVector.release()
OutputString=''
for rows in range(SIZE/20):
OutputString+='\t'
for c in range(20):
OutputString+=chr(HostOutputVector[rows*20+c])
print OutputString
sys.stdout.write("\nThe End\n\n");
OpenCL kernel
Kernel Call
Emmanuel QUÉMENER CC BY-NC-SA 26/48September 29, 2017
How to OpenCL ?« Hello World » C/Python : to weight
● On previous OpenCL implementations :– In C : 75 lines, 262 words, 2848 bytes– In Python : 51 lines, 137 words, 1551 bytes– Factors : 0.68, 0.52, 0.54 in lines, words and bytes.
● Programming OpenCL :– Difficult programming context
● « To open the box in more difficult than to assemble the furniture unit ! »● Requirement to simplify the calls by an API
– No compatibility between SDK of AMD and Nvidia● Everyone rewrite their own API
● One solution, however « The » solution : Python
Emmanuel QUÉMENER CC BY-NC-SA 27/48September 29, 2017
Which simple code to implement in //?PiMC: Pi by the method of the target
● Historical example of the Monte Carlo method● Parallel implementation:
– Equivalent distribution of iterations
● From 2 to 4 parameters– Number of iterations– Parallel Rate (PR)– (Type of variable: INT32, INT64, FP32, FP64)– (RNG: MWC, CONG, SHR3, KISS)
● 2 simple observables:– Estimation of Pi (just indicative, Pi not being rational :-))– Elapsed Time (Individual or Global)
Emmanuel QUÉMENER CC BY-NC-SA 28/48September 29, 2017
Differences between CUDA/OpenCL
__device__ ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work)
{
uint jcong=seed_z+work;
ulong total=0;
for (ulong i=0;i<iterations;i++) {
float x=CONGfp ;
float y=CONGfp ;
ulong inside=((x*x+y*y) <= THEONE) ? 1:0;
total+=inside;
}
__global__ void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z)
{
ulong total=MainLoop(iterations,seed_z,seed_w,blockIdx.x);
s[blockIdx.x]=total;
__syncthreads();
}
ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work)
{
uint jcong=seed_z+work;
ulong total=0;
for (ulong i=0;i<iterations;i++) {
float x=CONGfp ;
float y=CONGfp ;
ulong inside=((x*x+y*y) <= THEONE) ? 1:0;
total+=inside;
}
return(total);
}
__kernel void MainLoopGlobal(__global ulong *s,ulong iterations,uint seed_w,uint seed_z)
{
ulong total=MainLoop(iterations,seed_z,seed_w,get_global_id(0));
barrier(CLK_GLOBAL_MEM_FENCE);
s[get_global_id(0)]=total;
}
Emmanuel QUÉMENER CC BY-NC-SA 29/48September 29, 2017
You think Python is not efficient ?Let’s have a quick comparison...
MPI/C
MPI+OpenMP/C
MPI+PyOpenCL NoHT
MPI+PyOpenCL HT
0.00
E+00
2.00
E+10
4.00
E+10
6.00
E+10
8.00
E+10
1.00
E+11
1.20
E+11
1.40
E+11
Itops
ITOPS
Emmanuel QUÉMENER CC BY-NC-SA 30/48September 29, 2017
Tesl
a C
1060
Tesl
a M
2070
Tesl
a M
2090
Tesl
a K
20m
Tesl
a K
40m
Tesl
a K
80 #
1
Tesl
a K
80 +
pThr
eads
2x T
esla
K80
+pT
hrea
ds
Tesl
a P
100
2x T
esla
P10
0 +
pThr
eads
NV
S 3
15
Qua
dro
4000
Qua
dro
K20
00
Qua
dro
K40
00
GT
X 5
60Ti
GT
X 6
80
GT
X 6
90 #
1
GT
X 6
90 +
pThr
eads
GT
X 7
80
GT
X T
itan
GT
X 7
80Ti
GT
X 7
50
GT
X 7
50Ti
GT
X 9
60
GT
X 9
70
GT
X 9
80
GT
X 9
80Ti
GT
X 1
050T
i
GT
X 1
080
GT
X 1
080T
i
GT
430
GT
620
GT
640
GT
710
GT
730
HD
585
0
HD
645
0
HD
797
0
Kav
eri A
10-7
850K
GP
U
R7
240
Nan
o F
ury
R9
Fur
y
R9
380
R9
290
R9
295X
2 #1
R9
295X
2 +
pThr
eads
MIC
Phi
712
0P
AM
D F
X-9
590
AM
D R
yzen
7 18
00
1x In
tel P
entiu
m M
755
1x In
tel P
entiu
m E
6300
2x In
tel X
5550
4c
MP
I+C
2x In
tel X
5550
4c
2x In
tel 2
687v
3 10
c
2x In
tel 2
680v
4 14
c
Clu
ster
64n
8c M
PI+
PyC
L N
oHT
Clu
ster
64n
8c M
PI+
PyC
L H
T
Clu
ster
64n
8c M
PI/C
Clu
ster
64n
8c M
PI+
Ope
nMP
/C
0.0E+00
5.0E+10
1.0E+11
1.5E+11
2.0E+11
2.5E+11
Pi Dart Board match (yesterday) : Python vs C & CPU vs GPU vs Phi
2 Boards on Host !
2 Chips on Board !
Teslafor
Pros GTX for GamersAMD/ATI
CPU
Emmanuel QUÉMENER CC BY-NC-SA 31/48September 29, 2017
X555
0
E5-26
65
E5-26
20v2
E5-26
37v2
E5-26
87v3
Xeon
Phi
R9-29
5X
R9-29
0
GTX Ti
tan
Quadr
o K40
00
Tesla
K40
c
GTX 9
80Ti
GTX 1
080
1000000
10000000
100000000
1000000000
PiOpenCLPiOCL BasePiOpenCL Boost
With 1 QPU, low "regime" ...The CPU explodes the GPU!● From 20 to 50x slower!
● 1.5 order of magnitude ...X5550
E5-2665
E5-2620v2
E5-2637v2
E5-2687v3
Xeon Phi
R9-295X
R9-290
GTX Titan
Quadro K4000
Tesla K40c
GTX 980Ti
GTX 1080
0.00
E+00
5.00
E+07
1.00
E+08
1.50
E+08
2.00
E+08
2.50
E+08
PiOpenCL
PiOCL Base
PiOpenCL Boost
Emmanuel QUÉMENER CC BY-NC-SA 32/48September 29, 2017
From 1 to 1024 PR
Small comparison between *PUMIC Phi emerges, GPU ambushed
Emmanuel QUÉMENER CC BY-NC-SA 33/48September 29, 2017
Deep exploration of Parallel RatesParallel Rate from 1024 to 16384
PR from 1024 to 16384
Xeon Phi with 60 cores
60x64
60x4160x2760x64
Nvidia GTX Titan & Tesla K40All prime numbers > 1024
AMD HD7970 with 2048 ALU AMD R9-290X with 2816 ALU
2816x42048x4
Emmanuel QUÉMENER CC BY-NC-SA 34/48September 29, 2017
Deep exploration of Parallel RatesParallel Rate from 1024 to 65536
PR from 1024 to 65536 AMD HD7970 & R9-290
Long Period : 4x number of ALU
Nvidia GTX Titan & Tesla K40Short Period : number of SMX units
Period of 14(14 SMX)
Period of 15(15 SMX)
Emmanuel QUÉMENER CC BY-NC-SA 35/48September 29, 2017
And the latest Nvidia GTX1080,Always affected with oddities?
Simple Precision Double Precision
Variability without real structure
Emmanuel QUÉMENER CC BY-NC-SA 36/48September 29, 2017
Behaviour of 20 (GP)GPU vs PR
Emmanuel QUÉMENER CC BY-NC-SA 37/48September 29, 2017
Is Parallel Envelop reproducible ?For 2 specific boards...
Nvidia GT620Nvidia GT620 AMD HD6450AMD HD6450
Emmanuel QUÉMENER CC BY-NC-SA 38/48September 29, 2017
Variability : A distinctive criterion !What a strange property :-/ ...
Mode BlocksMode Blocks Mode ThreadsMode Threads
Emmanuel QUÉMENER CC BY-NC-SA 39/48September 29, 2017
GPU AMD/ATIHD 7970 & R290X Intel IviBridge & Haswell
A « memory-bound » testSplutter code on OpenCL devices
GPU Nvidia GT, GTX, Tesla
Accelerator Xeon Phi
Emmanuel QUÉMENER CC BY-NC-SA 40/48September 29, 2017
"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"
– Code N-body very simply integrated– Newton's second law, autogravitating system
● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta
● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64
Emmanuel QUÉMENER CC BY-NC-SA 41/48September 29, 2017
"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"
– Code N-body very simply integrated– Newton's second law, autogravitating system
● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta
● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64
Emmanuel QUÉMENER CC BY-NC-SA 42/48September 29, 2017
Tesla
C10
60
Tesla
M20
90
Tesla
K20
m
Tesla
K40
m
Tesla
K80
Tesla
P10
0
GTX56
0Ti
GTX68
0
GTX78
0Ti
GTX98
0
GTX98
0Ti
GTX10
80
GTX10
80Ti
HD5850
HD6450
HD7970
R9-29
0X
R9-29
5X #
1Nan
o
A10-7
850K
GPU
Xeon
Phi
FX-9
590
A10-7
850K
CPU
Ryzen
7-18
00
M 755
x1
E630
0 x1
X5550
x2
E5-2
665
x2
E5-2
609v
2 x2
E5-2
637v
2 x2
E5-2
670v
2 x2
E5-2
640v
3 x2
E5-2
687v
3 x2
E5-2
640v
4 x2
E5-2
680v
4 x2
0
5000000000
10000000000
15000000000
20000000000
25000000000
30000000000D²/T_FP32
D²/T_FP64
Very different performancesNvidia, AMD, Intel, Single/Double
CPUsAMD Intel
AMD/ATIforAlt*
GTXfor
Gamers
Teslafor
Pros
Emmanuel QUÉMENER CC BY-NC-SA 43/48September 29, 2017
32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05
1E+06
1E+07
1E+08
1E+09
1E+10GTX 1080
Nano
Xeon Phi
FX-9590
E5-2687v3 x2
For high powered *PUDe la « charge » à la saturation...
● Results expressed in unit Size²/Elapsed
Double precisionSimple precision
32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05
1E+06
1E+07
1E+08
1E+09
1E+10
1E+11GTX 1080
Nano
Xeon Phi
FX-9590
E5-2687v3 x2
Emmanuel QUÉMENER CC BY-NC-SA 44/48September 29, 2017
But if GPU is to powerful…Quid about its original goal ?
● When using as compute GPU, beware to Xorg !● For AMD/ATI, boring when non root user...
32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05
1E+06
1E+07
1E+08
1E+09
1E+10
1E+11
Xorg Off SP
Xorg On SP
Xorg Off DP
Xorg On DP
32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
80.0E+00
5.0E+09
1.0E+10
1.5E+10
2.0E+10
2.5E+10
Xorg Off SP
Xorg On SP
Xorg Off DP
Xorg On DP
Emmanuel QUÉMENER CC BY-NC-SA 45/48September 29, 2017
Tesla
C10
60
Tesl
a M209
0
Tesla
K20m
Tesla
K40m
Tesl
a K80
Tesla
P100
GTX560
Ti
GTX680
GTX780
Ti
GTX98
0
GTX980
Ti
GTX10
80
GTX108
0Ti
HD5850
HD6450
HD7970
R9-29
0X
R9-29
5X #1
Nano
A10-7
850K
GPU
Xeon
Phi
FX-9
590
A10-
7850
K CPU
Ryzen7
-1800
M 755
x1
E6300
x1
X5550
x2
E5-266
5 x2
E5-260
9v2 x2
E5-263
7v2
x2
E5-267
0v2 x2
E5-264
0v3
x2
E5-268
7v3 x2
E5-264
0v4
x2
E5-268
0v4 x2
0
20000000
40000000
60000000
80000000
100000000
120000000
Ratio32W
Ratio64W
Performances normalized to TDPNvidia, AMD, Intel, Single / Double
Emmanuel QUÉMENER CC BY-NC-SA 46/48September 29, 2017
Tesla
C10
60
Tesla
M20
90
Tesla
K20
m
Tesla
K40
m
Tesla
K80
Tesla
P10
0
GTX56
0Ti
GTX68
0
GTX78
0Ti
GTX98
0
GTX98
0Ti
GTX10
80
GTX10
80Ti
HD5850
HD6450
HD7970
R9-29
0X
R9-29
5X #
1Nan
o
A10-7
850K
GPU
Xeon
Phi
FX-95
90
A10-7
850K
CPU
Ryzen
7-18
00
M 7
55 x
1
E6300
x1
X555
0 x2
E5-26
65 x
2
E5-26
09v2
x2
E5-26
37v2
x2
E5-26
70v2
x2
E5-26
40v3
x2
E5-26
87v3
x2
E5-26
40v4
x2
E5-26
80v4
x2
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
Ratio32Pr
Ratio64Pr
Performances normalized to PriceNvidia, AMD, Intel, Single/Double
Emmanuel QUÉMENER CC BY-NC-SA 47/48September 29, 2017
Tesla
C10
60
Tesla
M20
90
Tesla
K20
m
Tesla
K40
m
Tesla
K80
Tesla
P10
0
GTX56
0Ti
GTX68
0
GTX78
0Ti
GTX98
0
GTX98
0Ti
GTX10
80
GTX10
80Ti
HD5850
HD6450
HD7970
R9-29
0X
R9-29
5X #
1Nan
o
A10-7
850K
GPU
Xeon
Phi
FX-95
90
A10-7
850K
CPU
Ryzen
7-18
00
M 7
55 x
1
E6300
x1
X555
0 x2
E5-26
65 x
2
E5-26
09v2
x2
E5-26
37v2
x2
E5-26
70v2
x2
E5-26
40v3
x2
E5-26
87v3
x2
E5-26
40v4
x2
E5-26
80v4
x2
0
10000
20000
30000
40000
50000
60000
70000
80000
Ratio32
Ratio64
Performances normalized to Cores.MHzNvidia, AMD, Intel, Single / Double
Emmanuel QUÉMENER CC BY-NC-SA 48/48September 29, 2017
But what's the use of all this?Characterize & avoid regrets ...● Now, the « brute » power is inside GPUs● But, QPU (PR=1) of GPU is 50x slower than CPU● To exploit GPU, parallel rate MUST be over 1000● To program GPU, prefer :
– Developer approach : Python with PyOpenCL (or PyCUDA)– Integrator approach : external libraries optimized
● In all cases, instrument your launches !