Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June...

XeonPhi

Evghenii Gaburov

Clash of the Titans

..a personal view..

Thursday, June 20, 13

> 1 TFLOP/s

on a desktop

K20XXeonPhi

192 fp32 cores 64 fp64 cores 32 SFU 32 LD/ST unit

64KB L1$+shared

1.5MB L2$15 SMX @0.73GHz

240 GB/s

32 SIMD width

hardware thread scheduling

255 reg/thread

2048 threadsin-order execution

image: GK110 whitepaper

1.4 TFLOP/s fp64

XeonPhi (KNC)

software thread scheduling

61 pentium cores @1.1GHz

352 GB/s

16 fp32 SIMD 8 fp64 SIMD

32KB L1$512KB L2$ shared

512bit SIMD register

32 SIMD reg/thread

4 threadsin-order execution

image: Intel Xeon Phi programming overview

30.5MB L$2 1.1 TFLOP/s fp64

effective # compute units:

K20: 15 SMX x 64 CUDA cores = 960

Xeon Phi: 61 core x 2 threads x 8 double = 976

Xeon E5: 8 core x 1 thread x 4 double = 32

Xeon Phi is much more parallel than Xeon E5!

above all: *algorithm* MUST scale!

for the same number of threadsimage: Intel Xeon Phi programming overview

to get the same performanceimage: Intel Xeon Phi programming overview

if app doesn’t scale

if app doesn’t scale ... or worse

.. don’t forget the Amdahl’s law

P=0.99, N=1 P=0.99, N=32 P=0.99, N=960

S1=1 S32=24 S960=91

𝛆=100% 𝛆=75% 𝛆=9.4%

XeonPhi immature compiler

Intel only, not cheap ($$$)native/offload

K20mature compiler

many vendors (CUDA LLVM)offload only

K20mature compiler

MPIOpenMP

POSIX threadsCilk++, OpenCL, etc..

CUDA C/Fortran,OpenCL

OpenACCR, Python, Matlab ...

K20mature compiler

MPIOpenMP

MPI Not possible

K20mature compiler

MPIOpenMP

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

K20mature compiler

MPIOpenMP

MPI Not possible

MPIMPI+OpenMP

MPI+OpenCL, MPI+....Not possible

software schedulingthread affinity is important

hardware schedulingno worries about threads

for (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

#pragma omp parallel forfor (int j = 0; j < M; j++){ .. some code for (int i = 0; i < N; i++) { some code }}

say M=64, N=1024 ..

XeonE5: OMP_NUM_THREADS = 8

say M=64, N=1024 ..

XeonPhi: OMP_NUM_THREADS = 240

say M=64, N=1024 ..

K20X: use CUDA, it works!

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

max # parallel units: 64x1024 = 64K

much larger than #FPUs

M=64, N=1024

minimize surface-to-volume ratio

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by

}minimize surface-to-volume ratio

{ /* thread-block code */ bid = blockIdx.x; nb = gridDim.x; bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i += blockDim.x) { some thread code } }}

#pragma omp parallel{ /* thread-block code */ bid = omp_get_thread_num(); nb = omp_get_num_threads(); bx = bid % nbx; by = bid / nbx; nby = nb / nbx; compute ib & ie for bx compute jb & je for by for (int j = jb; j < je; j++) { .. some thread code for (int i = ib; i < ie; i++) { some thread code } }}

CUDA programming model maps well to Xeon Phi

omp_get_thread_num() blockIdx

omp_get_num_threads() gridDim

omp_get_ .. what? .. threadIdx, blockDim

#pragma omp simd ... not that simple

This is where the I find the biggest limitation ... the TOOLS!

#pragma omp simd ... not that simple

it doesn’t exist, but very important!

get_simd_lane_index() threadIdxget_simd_size() blockDim

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { simple code } }}

auto-vectorization usually works well for simple cases

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple cases

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma omp simd for (int i = ib; i < ie; i++) { complex code } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

do not fight the compiler

auto-vectorization usually works well for simple casesuse #pragma omp simd ..we’re still at the mercy of the compiler...

please compiler

have mercy!

#pragma omp parallel{ /* “thread-block” */ for (int j = jb; j < je; j++) { .. some code#pragma simd { simdsize = get_simd_size(); simdlane = get_simd_lane_index(); for (int i = ib; i < ie; i += simdsize) { complex code executed by each lane } } }}

auto-vectorization usually works well for simple casesuse #pragma omp simd ..manual vectorization, as in CUDA..

http://ispc.github.com

“The reality is that there is no suchthing as a “magic” compiler that willautomatically parallelize your code.”

MySerialCode.cpp ./parallel_a.out

Both K20 & Xeon Phi can deliver excellent performanceif algorithm scales and is vectorized.. either auto- or manually

It is not easy to get excellent performanceeasier to get performance for bandwidth-bound applications on Xeon Phi than K20have to worry about thread scheduling & vectorization on Xeon Phi

Xeon Phi can run any code natively (big plus)this can solve PCIe bottleneck on legacy apps w/o extensive app-modificationminimizes start-up efforts, doesn’t require memory management, code rewrites, etc..

CUDA programming model maps well to Xeon Phihowever, there is lack of tools to take advantage of this... (Intel SPMD compiler may help)

http://arxiv.org/abs/1302.1078Thursday, June 20, 13

http://icl.cs.utk.edu/magma/Thursday, June 20, 13

http://clbenchmark.com

Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June...

Documents

Transcript of Clash of the Titans - StreamHPC Gaburov... · Clash of the Titans..a personal view.. Thursday, June...

Clash of the Titans (Scene)

Clash of Titans

Clash of the Titans / 40k open 3++ 2014 Player's Pack

Clash of Titans Part 2

Movie gods disapprove of ‘Clash of the Titans’

Caught in a clash of titans - Analysys Mason€¦ · In April, Maxis topped up the data juice ... Caught in a clash of titans ... had nine licensed operators in 2011. By 2015,

TITANS TRAVEL TO WEST COAST FOR CLASH WITH SAN FRANCISCO 49ERSprod.static.titans.clubs.nfl.com/assets/docs/titans_49ers_2009.pdf · TITANS TRAVEL TO WEST COAST FOR CLASH WITH SAN

Clash of the titans - Merlon Capital Partners€¦ · Clash of the titans . The United States and China are the world’s two largest economies. They are linked heavily by trade,

LoadLeveler vs. NQE/NQS: Clash of The Titans

CLASH OF THE TITANS AT KERANG TOURNAMENT. · Wedderburn Community News AUGUST, 2014 CLASH OF THE TITANS AT KERANG TOURNAMENT. The Wedderburn hockey club fielded teams in all age divisions

Clash of the Titans - DiVA portal1212551/FULLTEXT01.pdf · Clash of the Titans A study of the interaction between environmental regulations and foreign investment protection in the

Clash of the Titans | Foreign Policy · Title: Clash of the Titans | Foreign Policy Author: albol Created Date: 3/6/2017 9:36:31 PM

Accidents & Injuries in Air Law: The Clash of the Titans

CLASH OF THE TITANS REGULATION, TECHNOLOGY, AND EXPRESSION ...€¦ · CLASH OF THE TITANS REGULATION, TECHNOLOGY, AND EXPRESSION IN THE INTERNET AGE Materials for Alumni College,

CLASH OF THE TITANS Clash of the Titans - CineFile of the Titans No portion of this script may be performed, ... turns away from the stars, ... Past the blood trickling from a human

CLASH OF THE TITANS 2016 - panorama-consulting.com · © 2015 Panorama Consulting Solutions 1

The Discursive Double Game of EMU Reform: The Clash of Titans ...

Samoana wins ‘Clash of the Titans’ against Leone

Clash of Titans in SDN: OpenDaylight vs ONOS - Elisa Rojas

Windows Phone - how does it look in clash of the titans