Astrosim 2017 Les « Houches » the 7th ;-)

Astrosim 2017

Les « Houches » the 7th ;-)

GPU : the disruptive technology of 21th century

Little comparison between CPU, MIC, GPU

Emmanuel Quémener

Emmanuel QUÉMENER CC BY-NC-SA 2/48September 29, 2017

A (human) generation before...A serial B film in 1984

● 1984 : The Last Starfighter– 27 minutes of synthetic images– 22.5 10 operations per image⁹

– Use of Cray X-MP (130 kW)– 66 days (in fact, 1 year needed)

● 2016: on GTX 1080Ti (250 W)– 3.4 secondes– Comparison GTX1080Ti / Cray

● Performance : 1 600 000 !● Consommation ~ 900 000 000 !

« The beginning of all things »● « Nvidia Launches Tesla Personal Supercomputer »● When : on 19/11/2008 Qui : Nvidia● Where : sur Tom's Hardware● What : a PCIe card C1060 PCIe with 240 cores● How much : 933 Gflops SP (but 78 Gflops DP)

What position for GPUs ?The others « accelerators »● Accelerator or the old history of coprocessors...

– 1980 : 8087 (on 8086/8088) for floating points operations– 1989: 80387 (on 80386) and the respect of IEEE 754– 1990 : 80486DX and the integration of FPU inside the CPU– 1997 : K6-3DNow ! & Pentium MMX : SIMD inside the CPU– 1999 : SSE functions and the birth of a long serie (SSE4 & AVX)

● When chips stand out from CPU– 1998 : DSP style TMS320C67x as tools – 2008: Cell inside the PS3, IBM inside Road Runner and Top1 of Top500

● Business of compilers & very accurate programming model

Why a GPU is so powerful ?To construct 3D scene !

● 2 approaches:– Raytracing : PovRay– Shadering : 3 operations

● Raytracing :– Starting from the eye to the objects of scene

● Shadering– Model2World : vectorial objects placed in the scene– World2View : projection of objects on a plan of view– View2Projection : pixelisation of vectorial plan of view

Why a GPU is so powerfulShadering & Matrix computing● Modele 2 World : 3 matrix products

– Rotation– Translation– Scaling

● World 2 View : 2 matrix products– Camera position– Direction de l'endroit pointé

● View 2 Projection– Pixellisation

A GPU : a « huge » Matrix Multiplier

Is GPU really so powerful ?With my best MyriALUs...

Tesla P100 GTX 1080Ti R9 Fury Xeon Phi Ryzen7 1800 E5-2680v4 x20.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00

OpenBLAS DP

clBLAS DP

Thunking DP

CuBLAS DP

OpenBLAS SP

clBLAS SP

Thunking SP

CuBLAS SP

Teslafor

GTXfor

Gamers

AMDforAlt*

a K20m

a P100

o k400

GTX Ti

GTX 98

HD7970

R9 Fur

E5-266

E5-268

7v3 x2

E5-264

0v4 x2

E5-268

0v4 x2

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00OpenBLAS DP clBLAS DP

Thunking DP CuBLAS DP

OpenBLAS SP clBLAS SP

Thunking SP CuBLAS SP

AMDForAlt*

GTXfor

Gamers

BLAS match : CPU vs GPUxGEMM when x=D(P) or S(P)

clBLAS

OpenBLASCuBLAS

Teslafor

What Nvidia never broadcasts...xGEMM on a large domain ...

Tesla P100GTX 1080

Performance

Yes, performance exists, in specific conditions !

How a GPU must be consideredA laboratory of parallelism● You’ve got dozen of thousands of Equivalent PU !● You have the choice between : CPU GPUMIC

How many elements inside ?Difference between GPU & CPU

● Operations– Matrix Multiply– Vectorization– Pipelining– Shader (multi)processor

● Programmation : 1993– OpenGL, Glide, Direct3D,

● Généricité : 2002– CgToolkit, CUDA, OpenCL

Tesla P100Tesla P100

Why GPU is so powerful ?Come back generic processing

Inside the GPU● Specialized processing units● Pipelining efficiency

Adaptation Loss● Changing scenes● Different nature of details

The Solution More generalist processing units !

Why is GPU so powerfulAll « M » Flynn taxonomy included

● Vectorial : SIMD (Simple Instruction Multiple Data)– Addition of 2 positions (x,y,z) : 1 uniq command

● Parallel : MIMD (Multiple Instructions Multiple Data)– Several executions in parallel with the (almost) same data

● In fact, SIMT : Simple Instruction Multiple Threads– All processing units share the Threads– Each processing unit can work independently from others

● Need to synchronize the Threads

Important dates

● 1992-01 : OpenGL and the birth of a standard● 1998-03 : OpenGL 1.2 and interesting functions● 2002-12 : Cg Toolkit (Nvidia) and the extensions of language

– Wrappers for all languages (Python)

● 2007-06 : CUDA (Nvidia) or the arrival of a real langague● 2008-06 : Snow Leopard (Apple) integrates OpenCL

– La volonté d'utiliser au mieux sa machine ?

● 2008-11 : OpenCL 1.0 and its first specifications● 2011-04 : WebCL and its first version by Nokia

Who use OpenCL ?As applications ● OS : Apple in MacOSX● « Big » applications :

– Libreoffice– Ffmpeg

● Known applications– Graphical ones : photoscan, – Scientific : OpenMM, Gromacs, Lammps, Amber, …

● Hundreds of softwares, librairies ! But checking is essentiel...– https://www.khronos.org/opencl/resources/opencl-applications-using-opencl

Why GPU is a disruptive technologyHow to reshuffle the cards in Scientific Computing

● In a conference in february 2006, this… – x100 in 5 years

● Between 2000 and 2015– GeForce 2 Go/GTX 980Ti : from 286 to 2816000 MOperations/s : x10000– Classical CPU: x100

Just a remind of last friday !

● OpenCL is the most versatile one● CUDA can ONLY be used with Nvidia GPUs● Accelerator seems to be the most universal, but...

Cluster Node CPU Node GPU Node Nvidia Accelerator

MPI Yes Yes No No Yes*

PVM Yes Yes No No Yes*

OpenMP No Yes No No Yes*

Pthreads No Yes No No Yes*

OpenCL No Yes Yes Yes Yes

CUDA No No No Yes No

TBB No Yes No No Yes*

Parallel programming models

Act as integrator : do not reinvent the wheel !

● Classic librairies can be used under GPU– Nvidia provides lots of implementations– OpenCL provides several on them, but not so complete

Parallel programming librairiesCluster Node CPU Node GPU Node Nvidia Accelerator

BLAS BLACSMKL

OpenBLASMKL

clBLAS CuBLAS OpenBLASMKL

LAPACK ScalapackMKL

AtlasMKL

clMAGMA MAGMA MagmaMIC

FFT FFTw3 FFTw3 clFFT CuFFT FFTw3

Why OpenCL ? Non Politically Correct History● In the middle of 2000, Apple needs power for MacOSX

– Some processing can be done by Nvidia chipsets– CUDA exists as a good successor of CgToolkit

● But Apple did not want to be jailed– (as their users :-) )

● Apple promotes the creation of Khronos consortium– AMD, IBM, ARM came quickly– … and Nvidia, Intel, Altera, … came backwards...

What OpenCL offers10 implementations on x86● 3 implementations for CPU :

– AMD one : the original– Intel one : efficient for specific parallel rate– PortableCL (POCL) : OpenSource

● 4 implementations for GPU :– AMD one : for AMD/ATI graphical boards– Nvidia one : for Nvidia graphical boards– « Beignet » one : Open Source one for Intel Graphical Chipset– « Mesa » one : Open Source one for all Open Source drivers in Linux Kernel

● 1 implementation for Accelerator : Intel one for Xeon Phi● 1 implementation for FPGA : Intel one for Altera FPGA● On other platforms, it seems to be possible

Goal : replace loop by distribute2 types of distributions

● Blocks/WorkItems– Domain « global », large but slow amount of memory available

● Threads– Domain « local », small but quick amount of memory available– Needed for synchronisation of process

● Different access to memory :● « clinfo » to get the properties of CL devices : Max work items

– Max work items : 1024x1024x1024 for CPU so 1.1 billion distribution

How to do it ? Little example as...A « Hello World » in OpenCL...● Define two vectors in ASCII● Duplicate them in 2 huge vectors● Add them using a OpenCL kernel● Définir deux vecteurs en « Ascii »● Print them on screen

Add of

2 vectors a+b =c

For each n :

c[n] = a[n] + b[n]

Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World! Hello OpenCL World!

The End

Kernel Code & Build and CallAdd a vector in OpenCL__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)

// Index of the elements to add

unsigned int n = get_global_id(0);

// Sum the n th element of vectors a and b and store in c

c[n] = a[n] + b[n];

OpenCLProgram = cl.Program(ctx, OpenCLSource).build()

OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)

How to OpenCL ?Write « Hello World ! » in C

#include <stdio.h>

#include <stdlib.h>

#include <CL/cl.h>

const char* OpenCLSource[] = {

"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",

" // Index of the elements to add \n",

" unsigned int n = get_global_id(0);",

" // Sum the n’th element of vectors a and b and store in c \n",

" c[n] = a[n] + b[n];",

int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};

int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};

#define SIZE 2048

int main(int argc, char **argv)

int HostVector1[SIZE], HostVector2[SIZE];

for(int c = 0; c < SIZE; c++)

HostVector1[c] = InitialData1[c%20];

HostVector2[c] = InitialData2[c%20];

cl_platform_id cpPlatform;

clGetPlatformIDs(1, &cpPlatform, NULL);

cl_int ciErr1;

cl_device_id cdDevice;

ciErr1 = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);

cl_context GPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, &ciErr1);

cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext,

cdDevice, 0, NULL);

cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector1, NULL);

cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int) * SIZE, HostVector2, NULL);

cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,sizeof(int) * SIZE, NULL, NULL);

cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource,NULL, NULL);

clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);

cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);

clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);

clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);

clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

size_t WorkSize[1] = {SIZE}; // one dimensional Range

clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);

int HostOutputVector[SIZE];

clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);

clReleaseKernel(OpenCLVectorAdd);

clReleaseProgram(OpenCLProgram);

clReleaseCommandQueue(cqCommandQueue);

clReleaseContext(GPUContext);

clReleaseMemObject(GPUVector1);

clReleaseMemObject(GPUVector2);

clReleaseMemObject(GPUOutputVector);

for (int Rows = 0; Rows < (SIZE/20); Rows++) {

printf("\t");

for(int c = 0; c <20; c++) {

printf("%c",(char)HostOutputVector[Rows * 20 + c]);

printf("\n\nThe End\n\n");

return 0;

OpenCL kernelNumber of lines

Of OpenCL kernel

Kernel Call

How to OpenCL ?Write « Hello World ! » in Python

import pyopencl as cl

import numpy

import numpy.linalg as la

import sys

OpenCLSource = """

__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)

// Index of the elements to add

unsigned int n = get_global_id(0);

// Sum the n th element of vectors a and b and store in c

c[n] = a[n] + b[n];

InitialData1=[37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17]

InitialData2=[35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15]

SIZE=2048

HostVector1=numpy.zeros(SIZE).astype(numpy.int32)

HostVector2=numpy.zeros(SIZE).astype(numpy.int32)

for c in range(SIZE):

HostVector1[c] = InitialData1[c%20]

HostVector2[c] = InitialData2[c%20]

ctx = cl.create_some_context()

queue = cl.CommandQueue(ctx)

mf = cl.mem_flags

GPUVector1 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector1)

GPUVector2 = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=HostVector2)

GPUOutputVector = cl.Buffer(ctx, mf.WRITE_ONLY, HostVector1.nbytes)

OpenCLProgram = cl.Program(ctx, OpenCLSource).build()

OpenCLProgram.VectorAdd(queue, HostVector1.shape, None,GPUOutputVector , GPUVector1, GPUVector2)

HostOutputVector = numpy.empty_like(HostVector1)

cl.enqueue_copy(queue, HostOutputVector, GPUOutputVector)

GPUVector1.release()

GPUVector2.release()

GPUOutputVector.release()

OutputString=''

for rows in range(SIZE/20):

OutputString+='\t'

for c in range(20):

OutputString+=chr(HostOutputVector[rows*20+c])

print OutputString

sys.stdout.write("\nThe End\n\n");

OpenCL kernel

Kernel Call

How to OpenCL ?« Hello World » C/Python : to weight

● On previous OpenCL implementations :– In C : 75 lines, 262 words, 2848 bytes– In Python : 51 lines, 137 words, 1551 bytes– Factors : 0.68, 0.52, 0.54 in lines, words and bytes.

● Programming OpenCL :– Difficult programming context

● « To open the box in more difficult than to assemble the furniture unit ! »● Requirement to simplify the calls by an API

– No compatibility between SDK of AMD and Nvidia● Everyone rewrite their own API

● One solution, however « The » solution : Python

Which simple code to implement in //?PiMC: Pi by the method of the target

● Historical example of the Monte Carlo method● Parallel implementation:

– Equivalent distribution of iterations

● From 2 to 4 parameters– Number of iterations– Parallel Rate (PR)– (Type of variable: INT32, INT64, FP32, FP64)– (RNG: MWC, CONG, SHR3, KISS)

● 2 simple observables:– Estimation of Pi (just indicative, Pi not being rational :-))– Elapsed Time (Individual or Global)

Differences between CUDA/OpenCL

__device__ ulong MainLoop(ulong iterations,uint seed_w,uint seed_z,size_t work)

uint jcong=seed_z+work;

ulong total=0;

for (ulong i=0;i<iterations;i++) {

float x=CONGfp ;

float y=CONGfp ;

ulong inside=((x*x+y*y) <= THEONE) ? 1:0;

total+=inside;

__global__ void MainLoopBlocks(ulong *s,ulong iterations,uint seed_w,uint seed_z)

ulong total=MainLoop(iterations,seed_z,seed_w,blockIdx.x);

s[blockIdx.x]=total;

__syncthreads();

ulong MainLoop(ulong iterations,uint seed_z,uint seed_w,size_t work)

uint jcong=seed_z+work;

ulong total=0;

for (ulong i=0;i<iterations;i++) {

float x=CONGfp ;

float y=CONGfp ;

ulong inside=((x*x+y*y) <= THEONE) ? 1:0;

total+=inside;

return(total);

__kernel void MainLoopGlobal(__global ulong *s,ulong iterations,uint seed_w,uint seed_z)

ulong total=MainLoop(iterations,seed_z,seed_w,get_global_id(0));

barrier(CLK_GLOBAL_MEM_FENCE);

s[get_global_id(0)]=total;

You think Python is not efficient ?Let’s have a quick comparison...

MPI+OpenMP/C

MPI+PyOpenCL NoHT

MPI+PyOpenCL HT

0.0E+00

5.0E+10

1.0E+11

1.5E+11

2.0E+11

2.5E+11

Pi Dart Board match (yesterday) : Python vs C & CPU vs GPU vs Phi

2 Boards on Host !

2 Chips on Board !

Teslafor

Pros GTX for GamersAMD/ATI

GTX Ti

1000000

10000000

100000000

1000000000

PiOpenCLPiOCL BasePiOpenCL Boost

With 1 QPU, low "regime" ...The CPU explodes the GPU!● From 20 to 50x slower!

● 1.5 order of magnitude ...X5550

E5-2665

E5-2620v2

E5-2637v2

E5-2687v3

Xeon Phi

R9-295X

R9-290

GTX Titan

Quadro K4000

Tesla K40c

GTX 980Ti

GTX 1080

PiOpenCL

PiOCL Base

PiOpenCL Boost

From 1 to 1024 PR

Small comparison between *PUMIC Phi emerges, GPU ambushed

Deep exploration of Parallel RatesParallel Rate from 1024 to 16384

PR from 1024 to 16384

Xeon Phi with 60 cores

60x4160x2760x64

Nvidia GTX Titan & Tesla K40All prime numbers > 1024

AMD HD7970 with 2048 ALU AMD R9-290X with 2816 ALU

2816x42048x4

Deep exploration of Parallel RatesParallel Rate from 1024 to 65536

PR from 1024 to 65536 AMD HD7970 & R9-290

Long Period : 4x number of ALU

Nvidia GTX Titan & Tesla K40Short Period : number of SMX units

Period of 14(14 SMX)

Period of 15(15 SMX)

And the latest Nvidia GTX1080,Always affected with oddities?

Simple Precision Double Precision

Variability without real structure

Behaviour of 20 (GP)GPU vs PR

Is Parallel Envelop reproducible ?For 2 specific boards...

Nvidia GT620Nvidia GT620 AMD HD6450AMD HD6450

Variability : A distinctive criterion !What a strange property :-/ ...

Mode BlocksMode Blocks Mode ThreadsMode Threads

GPU AMD/ATIHD 7970 & R290X Intel IviBridge & Haswell

A « memory-bound » testSplutter code on OpenCL devices

GPU Nvidia GT, GTX, Tesla

Accelerator Xeon Phi

"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"

– Code N-body very simply integrated– Newton's second law, autogravitating system

● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta

● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64

"Back to the Physics"A Newtonian N-body code ...● Pass on a code "fine grain"

– Code N-body very simply integrated– Newton's second law, autogravitating system

● Differential integration methods:– Euler Implicite, Euler Explicit, Heun, Runge Kutta

● Settings :– Physical: number of particles– Numeric: no integration, number of steps– Then: influence FP32, FP64

HD5850

HD6450

HD7970

5000000000

10000000000

15000000000

20000000000

25000000000

30000000000D²/T_FP32

D²/T_FP64

Very different performancesNvidia, AMD, Intel, Single/Double

CPUsAMD Intel

AMD/ATIforAlt*

GTXfor

Gamers

Teslafor

32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

1E+10GTX 1080

Xeon Phi

FX-9590

E5-2687v3 x2

For high powered *PUDe la « charge » à la saturation...

● Results expressed in unit Size²/Elapsed

Double precisionSimple precision

32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

1E+11GTX 1080

Xeon Phi

FX-9590

E5-2687v3 x2

But if GPU is to powerful…Quid about its original goal ?

● When using as compute GPU, beware to Xorg !● For AMD/ATI, boring when non root user...

32 64 128 256 512 1024 2048 4096 8192 16384 327681E+05

Xorg Off SP

Xorg On SP

Xorg Off DP

Xorg On DP

32 64 128

80.0E+00

5.0E+09

1.0E+10

1.5E+10

2.0E+10

2.5E+10

Xorg Off SP

Xorg On SP

Xorg Off DP

Xorg On DP

a M209

GTX560

GTX680

GTX780

GTX980

GTX108

HD5850

HD6450

HD7970

Ryzen7

E5-266

E5-260

9v2 x2

E5-263

E5-267

0v2 x2

E5-264

E5-268

7v3 x2

E5-264

E5-268

0v4 x2

20000000

40000000

60000000

80000000

100000000

120000000

Ratio32W

Ratio64W

Performances normalized to TDPNvidia, AMD, Intel, Single / Double

HD5850

HD6450

HD7970

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

Ratio32Pr

Ratio64Pr

Performances normalized to PriceNvidia, AMD, Intel, Single/Double

HD5850

HD6450

HD7970

Ratio32

Ratio64

Performances normalized to Cores.MHzNvidia, AMD, Intel, Single / Double

But what's the use of all this?Characterize & avoid regrets ...● Now, the « brute » power is inside GPUs● But, QPU (PR=1) of GPU is 50x slower than CPU● To exploit GPU, parallel rate MUST be over 1000● To program GPU, prefer :

– Developer approach : Python with PyOpenCL (or PyCUDA)– Integrator approach : external libraries optimized

● In all cases, instrument your launches !

Astrosim 2017 Les « Houches » the 7th ;-)

Documents

Transcript of Astrosim 2017 Les « Houches » the 7th ;-)

Quentin Berthet - imag.frljk.imag.fr/membres/Jerome.Malick/osl2017/talks/berthe… · · 2017-04-13Quentin Berthet Optimization and Statistical Learning - Les Houches ... Q. Berthet,

Les Houches Accords and Modularity Peter Skands Fermilab MC4BSM-4, UC Davis, April 3-4 2009.

Les Houches chauvel Geochemical tools - ISTerre€¦ · Crustdata: Taylor and McLennan(1985); Primitive mantle(BSE) data: Hofmann (1988) continental crust residualafterextraction

Theory of electronic transport in carbon nanotubes Reinhold Egger Institut für Theoretische Physik Heinrich-Heine Universität Düsseldorf Les Houches Seminar,

Les Houches School of Foam: Rheology of Complex Fluidsfoams/PRESENTATIONS/Monday16/Monday16_Lec… · Les Houches School of Foam: Rheology of Complex Fluids ... • What is seen in

Jet Studies for Les Houches - jthaler.net · 2017-06-21 · Jet Physics = Innovative Approaches to Hadronic Final States 1993 2017 Yes, searching for a ... Back to Les Houches 2015.

Discussion on the SUSY Les Houches Accord 2 project

SM (Loops/Multilegs) Experimental summary forgroups:conv_org_talks:experimental_summary...Les Houches 2019 Experimental summary for SM (Loops/Multilegs) Josh Bendavid (CERN) For the

Les Houches lectures on EFT · 2021. 2. 18. · Adam Falkowski Les Houches lectures on EFT 14-19 February 2021 Lectures given at the 4th International Workshop on Searches for a Neutron

Preliminary results in collaboration with Pavel Nadolsky Les Houches 20132013/6/5.

DNA: Structure, Dynamics and Recognition Les Houches 2004 L2: Introductory DNA biophysics and biology.

DNA: Structure, Dynamics and Recognition Les Houches 2004 L3: DNA dynamics.

LES HOUCHES 2011: PHYSICS AT TEV COLLIDERS ...lss.fnal.gov/archive/2012/conf/fermilab-conf-12-924-t.pdfLES HOUCHES 2011: PHYSICS AT TEV COLLIDERS NEW PHYSICS WORKING GROUP REPORT G.

Les Houches 2007 VINCIA Peter Skands Fermilab / Particle Physics Division / Theoretical Physics In collaboration with W. Giele, D. Kosower.

New Non-Perturbative Methods and Quantization on the Light Cone: Les Houches School, February 24 â€” March 7, 1997

The Persistent Spin Helix Shou-Cheng Zhang, Stanford University Les Houches, June 2006.

Les Houches 12 th June1 Generator Issues Peter Richardson IPPP, Durham University.

Status of the SUSY Les Houches Accord 2 project RPV, CPV, FLV, NMSSM & more

DNA: Structure, Dynamics and Recognition Les Houches 2004 L4: DNA deformation.

Les Houches 2012 Lecture Notesmaeresearch.ucsd.edu/~elauga/research/references/lesh... · 2013-03-26 · Les Houches 2012 Lecture Notes An introduction to the hydrodynamics of locomotion