Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June...

Post on 01-Jun-2020

9 views 0 download

Transcript of Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June...

XeonPhi

K20

Evghenii Gaburov

Clash of the Titans

..a personal view..

Thursday, June 20, 13

> 1 TFLOP/s

on a desktop

Thursday, June 20, 13

K20XXeonPhi

Thursday, June 20, 13

192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit

64KB L1$+shared

1.5MB L2$15 SMX @0.73GHz

240 GB/s

32 SIMD width

hardware thread scheduling

255 reg/thread

K20X

2048 threadsin-order execution

image: GK110 whitepaper

1.4 TFLOP/s fp64

Thursday, June 20, 13

XeonPhi (KNC)

software thread scheduling

61 pentium cores @1.1GHz

352 GB/s

16 fp32 SIMD 8 fp64 SIMD

32KB L1$512KB L2$ shared

512bit SIMD register

32 SIMD reg/thread

4 threadsin-order execution

image: Intel Xeon Phi programming overview

30.5MB L$2 1.1 TFLOP/s fp64

Thursday, June 20, 13

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Thursday, June 20, 13

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Xeon Phi: 61 core x 2 threads x 8 double = 976

Thursday, June 20, 13

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Xeon Phi: 61 core x 2 threads x 8 double = 976

Xeon E5: 8 core x 1 thread x 4 double = 32

Xeon Phi is much more parallel than Xeon E5!

above all: *algorithm* MUST scale!

Thursday, June 20, 13

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

~3x

for the same number of threadsimage: Intel Xeon Phi programming overview

Thursday, June 20, 13

~3x

to get the same performanceimage: Intel Xeon Phi programming overview

Thursday, June 20, 13

~3x

image: Intel Xeon Phi programming overview

if app doesn’t scale

Thursday, June 20, 13

~3x

image: Intel Xeon Phi programming overview

if app doesn’t scale ... or worse

Thursday, June 20, 13

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

image: Intel Xeon Phi programming overview

Thursday, June 20, 13

.. don’t forget the Amdahl’s law

P=0.99, N=1 P=0.99, N=32 P=0.99, N=960

S1=1 S32=24 S960=91

𝛆=100% 𝛆=75% 𝛆=9.4%

Thursday, June 20, 13

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

Thursday, June 20, 13

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

Thursday, June 20, 13

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

Thursday, June 20, 13

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

Thursday, June 20, 13

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

software schedulingthread affinity is important

hardware schedulingno worries about threads

Thursday, June 20, 13

for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

Thursday, June 20, 13

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

M

N

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

XeonPhi: OMP_NUM_THREADS = 240

K20X: use CUDA, it works!

Thursday, June 20, 13

M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

Thursday, June 20, 13

M

N

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

minimize surface-to-volume ratio

Thursday, June 20, 13

M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by

}minimize surface-to-volume ratio

Thursday, June 20, 13

M

N

nby

nbx

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}

minimize surface-to-volume ratio

Thursday, June 20, 13

#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}

M

N

minimize surface-to-volume ratio

nby

nbx

Thursday, June 20, 13

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

Thursday, June 20, 13

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

#pragma omp simd ... not that simple

Thursday, June 20, 13

This is where the I find the biggest limitation ... the TOOLS!

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

#pragma omp simd ... not that simple

Thursday, June 20, 13

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

auto-vectorization usually works well for simple cases

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple cases

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

do not fight the compiler

Thursday, June 20, 13

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

it doesn’t exist, but very important!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..

get_simd_lane_index() threadIdxget_simd_size() blockDim

Thursday, June 20, 13

http://ispc.github.com

Thursday, June 20, 13

“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”

MySerialCode.cpp ./parallel_a.out

Thursday, June 20, 13

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

Thursday, June 20, 13

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Thursday, June 20, 13

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

Thursday, June 20, 13

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)

Thursday, June 20, 13

http://arxiv.org/abs/1302.1078Thursday, June 20, 13

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

http://clbenchmark.com

Thursday, June 20, 13