Post on 01-Jun-2020
XeonPhi
K20
Evghenii Gaburov
Clash of the Titans
..a personal view..
Thursday, June 20, 13
> 1 TFLOP/s
on a desktop
Thursday, June 20, 13
K20XXeonPhi
Thursday, June 20, 13
192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit
64KB L1$+shared
1.5MB L2$15 SMX @0.73GHz
240 GB/s
32 SIMD width
hardware thread scheduling
255 reg/thread
K20X
2048 threadsin-order execution
image: GK110 whitepaper
1.4 TFLOP/s fp64
Thursday, June 20, 13
XeonPhi (KNC)
software thread scheduling
61 pentium cores @1.1GHz
352 GB/s
16 fp32 SIMD 8 fp64 SIMD
32KB L1$512KB L2$ shared
512bit SIMD register
32 SIMD reg/thread
4 threadsin-order execution
image: Intel Xeon Phi programming overview
30.5MB L$2 1.1 TFLOP/s fp64
Thursday, June 20, 13
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Thursday, June 20, 13
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Xeon Phi: 61 core x 2 threads x 8 double = 976
Thursday, June 20, 13
effective # compute units:
K20: 15 SMX x 64 CUDA cores = 960
Xeon Phi: 61 core x 2 threads x 8 double = 976
Xeon E5: 8 core x 1 thread x 4 double = 32
Xeon Phi is much more parallel than Xeon E5!
above all: *algorithm* MUST scale!
Thursday, June 20, 13
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
~3x
for the same number of threadsimage: Intel Xeon Phi programming overview
Thursday, June 20, 13
~3x
to get the same performanceimage: Intel Xeon Phi programming overview
Thursday, June 20, 13
~3x
image: Intel Xeon Phi programming overview
if app doesn’t scale
Thursday, June 20, 13
~3x
image: Intel Xeon Phi programming overview
if app doesn’t scale ... or worse
Thursday, June 20, 13
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
image: Intel Xeon Phi programming overview
Thursday, June 20, 13
.. don’t forget the Amdahl’s law
P=0.99, N=1 P=0.99, N=32 P=0.99, N=960
S1=1 S32=24 S960=91
𝛆=100% 𝛆=75% 𝛆=9.4%
Thursday, June 20, 13
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
Thursday, June 20, 13
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
Thursday, June 20, 13
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
Thursday, June 20, 13
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
MPIMPI+OpenMP
MPI+OpenCL, MPI+....Not possible
Thursday, June 20, 13
XeonPhi immature compiler
Intel only, not cheap ($$$)native/offload
K20mature compiler
many vendors (CUDA LLVM)offload only
MPIOpenMP
POSIX threadsCilk++, OpenCL, etc..
CUDA C/Fortran,OpenCL
OpenACCR, Python, Matlab ...
MPI Not possible
MPIMPI+OpenMP
MPI+OpenCL, MPI+....Not possible
software schedulingthread affinity is important
hardware schedulingno worries about threads
Thursday, June 20, 13
for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
Thursday, June 20, 13
#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}
M
N
say M=64, N=1024 ..
XeonE5: OMP_NUM_THREADS = 8
XeonPhi: OMP_NUM_THREADS = 240
K20X: use CUDA, it works!
Thursday, June 20, 13
M
N
max # parallel units: 64x1024 = 64K
much larger than #FPUs
M=64, N=1024
Thursday, June 20, 13
M
N
max # parallel units: 64x1024 = 64K
much larger than #FPUs
M=64, N=1024
minimize surface-to-volume ratio
Thursday, June 20, 13
M
N
nby
nbx
{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by
}minimize surface-to-volume ratio
Thursday, June 20, 13
M
N
nby
nbx
{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}
minimize surface-to-volume ratio
Thursday, June 20, 13
#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}
M
N
minimize surface-to-volume ratio
nby
nbx
Thursday, June 20, 13
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
Thursday, June 20, 13
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
#pragma omp simd ... not that simple
Thursday, June 20, 13
This is where the I find the biggest limitation ... the TOOLS!
CUDA programming model maps well to Xeon Phi
omp_get_thread_num() blockIdx
omp_get_num_threads() gridDim
omp_get_ .. what? .. threadIdx, blockDim
#pragma omp simd ... not that simple
Thursday, June 20, 13
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}
auto-vectorization usually works well for simple cases
it doesn’t exist, but very important!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple cases
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...
please compiler
have mercy!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
do not fight the compiler
Thursday, June 20, 13
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...
please compiler
have mercy!
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
it doesn’t exist, but very important!
#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}
auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..
get_simd_lane_index() threadIdxget_simd_size() blockDim
Thursday, June 20, 13
“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”
MySerialCode.cpp ./parallel_a.out
Thursday, June 20, 13
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
Thursday, June 20, 13
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Thursday, June 20, 13
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..
Thursday, June 20, 13
Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually
It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi
Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..
CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)
Thursday, June 20, 13
http://arxiv.org/abs/1302.1078Thursday, June 20, 13
http://icl.cs.utk.edu/magma/Thursday, June 20, 13
http://icl.cs.utk.edu/magma/Thursday, June 20, 13