MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf ·...

64
MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, eScience Center 6-Oct-2014

Transcript of MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf ·...

Page 1: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

MANY-CORE

COMPUTING Ana Lucia Varbanescu, UvA

Original slides: Rob van Nieuwpoort, eScience Center 6-Oct-2014

Page 2: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Schedule

1. Introduction and Programming Basics (2-10-

2014)

2. Performance analysis (6-10-2014)

3. Advanced CUDA programming (9-10-2014)

4. Case study: LOFAR telescope with many-

cores

by Rob van Nieuwpoort (??)

2

Page 3: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

GPUs @ AMD 3

Radeon R9 Top of the line: R9 295X2

For comparison: R9 290X Performance: 5.6 TFLOPs

Memory: 4GB Bandwidth: 320GB/s

NVIDIA GTX980 (Maxwell) Performance: 5.0 TFLOPs

Memory: 4GB Lower bandwidth: 224 GB/s

NVIDIA GTX Titan Black (Kepler) Performance: 5.3 TFLOPs

Memory: 6GB Higher bandwidth: 336 GB/s

NVIDIA GTX Titan Z vs R9 295X2: fairly similar numbers, higher DP performance

Page 4: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Today 4

Revisit the VectorAdd

For GPUs

For many-core CPUs

Hardware revisited

Performance analysis

Hardware performance

Application performance

Page 5: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

VectorAdd revisited 5

Page 6: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vector add: sequential 6

void vector_add(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

}

}

Page 7: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vector add: GPU code (skeleton) 7

// compute vector sum c = a + b

// each thread performs one pair-wise addition

__global__ void vector_add(int N,float* A,float* B,float* C){

int i = threadIdx.x + blockDim.x * blockIdx.x;

if (i<N) C[i] = A[i] + B[i];

}

int main() {

// initialization code here ...

N = 5120;

// launch N/256 blocks of 256 threads each

vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);

// cleanup code here ...

}

Device code

Host code

(should be in the same file)

Page 8: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Multi-core CPU programming

Two levels of parallelism:

Coarse-grain: threads / processes

Fine-grain: SIMD operations

Instantiate the threads

Pthreads

Java threads

OpenMP

MPI

Vectorize

Rely on compilers

Manual vectorization

Vector types

Intrinsics

8

Page 9: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

OpenMP 9

Add directives to sequential code for parallel

sections.

// Phi function to add two vectors

vector_add_Phi(int n, int* a, int* b, int* c) {

int i = 0;

#pragma omp parallel for

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

}

// main program

int main() {

int i, in1[SIZE], in2[SIZE], res[SIZE];

{

vector_add_Phi(SIZE, in1, in2, res);

}}

Page 10: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

OpenMP (for Xeon Phi, too) 10

Add directives to sequential code for parallel

sections.

// Phi function to add two vectors

__attribute__((target(mic)))

vector_add_Phi(int n, int* a, int* b, int* c) {

int i = 0;

#pragma omp parallel for

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

}

// main program

int main() {

int i, in1[SIZE], in2[SIZE], res[SIZE];

#pragma offload target(mic) in(in1,in2) inout(res)

{

vector_add_Phi(SIZE, in1, in2, res);

}}

For Xeon Phi

For Xeon Phi

Page 11: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Cilk (for Xeon Phi, too) 11

Add directives to parallelize sequential code

by divide-and-conquer

cilk VectorAdd(float *a, float *b, float *c, int n){

if (n<GrainSize) {

int i;

for(i=0; i<n; ++i)

a[i] = b[i]+c[i];

}

else {

spawn (a,b,c,n/2);

spawn (a+n/2,b+n/2,c+n/2,n/2);

}

}

Page 12: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vectorization on x86 architectures 12

Sinc

e

Name Bits Single

precision

vector size

Double

precision

vector size

1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only

1999 Streaming SIMD Extensions

(SSE)

128 bit 4 float 2 double

2011 Advanced Vector Extensions

(AVX)

256 bit 8 float 4 double

2012 Intel Xeon Phi accelerator

(was Larrabee, MIC)

512 bit 16 float 8 double

Page 13: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vectorizing with SSE

Assembly instructions

Execute on vector registers

C or C++: intrinsics

Declare vector variables

Name instruction

Work on variables, not registers

13

Page 14: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vectorizing with SSE examples

float data[1024];

// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.

init(data);

// Set all elements in my vector to zero.

__m128 myVector0 = _mm_setzero_ps();

// Load the first 4 elements of the array into my vector.

__m128 myVector1 = _mm_load_ps(data);

// Load the second 4 elements of the array into my vector.

__m128 myVector2 = _mm_load_ps(data+4);

0.0

0 element

value

1 2 3

0.0 0.0 0.0

0.0

0 element

value

1 2 3

3.0 2.0 1.0

4.0

0 element

value

1 2 3

7.0 6.0 5.0

14

Page 15: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vectorizing with SSE examples

// Add vectors 1 and 2; instruction performs 4 FLOP.

__m128 myVector3 = _mm_add_ps(myVector1, myVector2);

// Multiply vectors 1 and 2; instruction performs 4 FLOP.

__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);

// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.

__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,

_MM_SHUFFLE(2, 3, 0, 1));

0 element

value

1 2 3

4.0 = + 6.0 8.0 10.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 = x 5.0 12.0 21.0

0 element

value

1 2 3

2.0 = 3.0 4.0 5.0 s

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

15

Page 16: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vector add with SSE: unroll loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

c[i+0] = a[i+0] + b[i+0];

c[i+1] = a[i+1] + b[i+1];

c[i+2] = a[i+2] + b[i+2];

c[i+3] = a[i+3] + b[i+3];

}

}

16

Page 17: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Vector add with SSE: vectorize

loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a

__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b

__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts

_mm_store_ps(c + i, vecC); // store four elts

}

}

17

Page 18: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Optional assignment 18

Implement a vectorized version of

Element-wise array multiplication, with complex

numbers

Element-wise array division, with complex numbers

Compile with gcc and measure performance

with/without vectorization.

Send (pseudo-)code (and performance numbers,

if you have them) by email to

[email protected]

Page 19: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

CPUs

NVIDIA GPUs

Hardware revisited 19

Page 20: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Generic multi-core CPU 20

Hardware threads

SIMD units (vector lanes)

L1 and L2

dedicated

caches

Shared L3/L4 cache Main memory, I/O

Peak

performance

Bandwidth

Page 21: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Generic GPU 21

Single or SIMD execution units Hardware scheduler

Local memory/cache Units for executing

functions with high precision

Peak

performance

Bandwidth

Page 22: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

NVIDIA GPUs 22

Kepler: Larger SM (SMX), more registers, better scheduler, dynamic parallelism, multi-GPU

Maxwell: Modular SM (SMM), dedicated registers, dedicated schedulers, more L2 cache

Page 23: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Platform architecture (Fermi) 23

Page 24: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Memory architecture (from Fermi) 24

Configurable L1 cache per SM

16KB L1 cache / 48KB Shared

48KB L1 cache / 16KB Shared

Shared L2 cache

Device memory

L2 cache

Host memory

PCI-e

bus

registers

L1 cache /

shared mem

registers

L1 cache /

shared mem ….

Page 25: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Fermi 25

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine Consumer: GTX 480, 580

HPC: Tesla C2050

More memory, ECC

1.0 TFlop SP

515 Gflop DP

16 streaming multiprocessors (SM) GTX 580: 16

GTX 480: 15

C2050: 14

768 KB L2 cache

Page 26: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Fermi : SM 26

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

32 cores per SM

64KB configurable

L1 cache / shared memory

32,768 32-bit registers

Page 27: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Fermi: CUDA Core* 27

Decoupled floating point and integer data paths

Double Fused-multiply-add (FMA)

Integer operations optimized for extended precision

DP throughput is 50% of SP throughput

DP: 256 FMA ops /clock

SP: 512 FMA ops /clock

*http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

Page 28: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Kepler: the new SMX

Consumer:

GTX680, GTX780, GTX-Titan

HPC

Tesla K10..K40

SMX features

192 CUDA cores

32 in Fermi

32 Special Function Units (SFU)

4 for Fermi

32 Load/Store units (LD/ST)

16 for Fermi

3x Perf/Watt improvement

28

Page 29: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

A comparison 29

Page 30: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Maxwell: the newest SMM

Consumer:

GTX 970, GTX 980, …

HPC:

?

SMM Features:

4 subblocks of 32 cores

Dedicated L1/LM per 64 cores

Dispatch/ecode/registers per 32 cores

L2 cache: 2MB (~3x vs. Kepler)

40 texture units

Lower power consumption

30

Page 31: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Hardware performance 31

Page 32: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Hardware Performance metrics

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Instructions per cycle + frequency

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power [Watt]

Derived metrics

FLOP/Byte, FLOP/Watt

32

Page 33: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Theoretical peak performance

Peak = chips * cores * threads/core * vector_lanes *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GHz = 2703 GFLOPs

33

Page 34: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

DRAM Memory bandwidth

Throughput =

memory bus frequency * bits per cycle * bus

width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176

GB/s

34

Page 35: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Memory bandwidths

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

E.g., AMD HD 7970 L1 cache achieves 2 TB/s

Off-chip memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3) 19.2 GB/s

Accelerators: PCI-e 2.0 8 GB/s

35

Page 36: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Power

Chip manufactures specify Thermal Design Power

(TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

36

Page 37: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Summary

Cores Threads/ALUs GFLOPS Bandwidth

Sun Niagara 2 8 64 11.2 76

IBM BG/P 4 8 13.6 13.6

IBM Power 7 8 32 265 68

Intel Core i7 4 16 85 25.6

AMD Barcelona 4 8 37 21.4

AMD Istanbul 6 6 62.4 25.6

AMD Magny-Cours 12 12 125 25.6

Cell/B.E. 8 8 205 25.6

NVIDIA GTX 580 16 512 1581 192

NVIDIA GTX 680 8 1536 3090 192

AMD HD 6970 384 1536 2703 176

AMD HD 7970 32 2048 3789 264

Intel Xeon Phi 7120 61 240 2417 352

Page 38: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Absolute hardware performance

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

In real life, there are no applications like this

Can we reason about “real” performance?

38

Page 39: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Optional assignment 39

Compute and fill in the numbers in the table

with the CPU and GPU from your machine.

Compute the FLOPs/BW as well

Compute the numbers and fill in the table for

your dream GPU

Please send me your answers (just the added

lines) by Thursday @ 11:00 at

[email protected]

Page 40: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Amdahl’s Law

Operational Intensity and the Roofline

model

Performance analysis 40

Page 41: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Software performance metrics (3

P’s)

Performance

Execution time

Speed-up

Computational throughput (GFLOP/s)

Computational efficiency (i.e., utilization)

Bandwidth (GB/s)

Memory efficiency (i.e., utilization)

Productivity and Portability

Programmability

Production costs

Maintenance costs

41

Page 42: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Reason early about performance 42

Amdahl’s law:

s = fraction of sequential code

p = number of processors

Parallel part: assumed perfectly parallel!

How fast can it really be?

Compute achievable performance

Page 43: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Amdhal’s Law in pictures

Page 44: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

RGB to gray

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

45

Page 45: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Performance evaluation 46

Measure execution time : Tpar

Absolute performance

Calculate speed-up : S = Tseq / Tpar

Relative performance

Does not take application into account!

Execution time and speedup can be used to

compare implementations of the same

algorithm!

Page 46: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Performance measurement setup

Image sizes:

Select at least 7 different images

Order them increasingly

Run the code 10 times per image

Assume outliers are eliminated

Ts = average 10 sequential runs

Choose different p’s:

Tp = average 10 parallel runs

Tp_par = execution time for the parallel part

Tp_seq = execution time for the sequential part (should be the same)

Report execution time & speed-ups

Full application

Parallel section only

Page 47: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

An example: execution time

0

5

10

15

20

25

30

35

Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7

Ts

T2

T4

T8

T16

Page 48: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Same example: speed-up

0

1

2

3

4

5

6

7

8

2 4 8 16

Image 1

Image 2

Image 3

Image 4

Image 5

Image 6

Image 7

Strong scaling

How would you build a weak scaling

experiment?

Weak scaling: keep the same work per

compute node and increase the number

of compute nodes.

Strong scaling: keep the total workload

constant and increase the number of

cores/nodes.

Page 49: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Derived metrics 50

Throughput: GFLOPs = #FLOPs / Tpar

Takes application into account!

Calculate compute utilization: Ec = GFLOPs/peak *100

Bandwidth: BW = #(RD+WR) / Tpar

Takes application into account!

Calculate bandwidth utilization: Ebw = BW/peak*100

Achieved bandwidth and throughput can be used

to compare *different* algorithms.

Utilization can be used to compare *different*

(application, platform) combinations.

Page 50: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Performance analysis 51

Real-life performance vs. theoretical limits.

Understand bottlenecks

Perform correct optimizations

… decide when to stop fiddling with code!!!

Computing the theoretical limits is the most

difficult challenge in parallel performance

analysis

Use theoretical peak limits => low accuracy

Use application characteristics

Use the platform characteristics

Page 51: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Arithmetic/operational intensity

The number of operations per byte of

accessed memory

Compute-intensive?

Data-intensive?

It is an application characteristic!

Ignore “overheads”

Loop counters

Array index calculations

Branches

52

Page 52: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

RGB to gray

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x]; // 3-byte structure

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

53

2 x ADD, 3 x MUL = 5 Ops

1 x RD, 1 x WR => 4 bytes of memory accessed

OI = 5/4 = 1.25

Page 53: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Many-core platforms

Cores

Threads or ALUs GFLOPS Bandwidth

FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Intel Xeon Phi 7120 61 240 2417 352 6.9

Page 54: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Compute or memory intensive?

RGB to Gray

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sun Niagara 2IBM bg/p

IBM Power 7Intel Core i7

AMD BarcelonaAMD Istanbul

AMD Magny-CoursCell/B.E.

NVIDIA GTX 580NVIDIA GTX 680

AMD HD 6970Intel Xeon Phi 7120Intel Xeon Phi 3120

55

“A multi-/many-core processor is a

device built to turn a compute-intensive

application into a memory-intensive

one”

Kathy Yelick, UC Berkeley

Page 55: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Applications OI

Operational Intensity

O( N ) O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

56

Page 56: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational

Intensity)

Peak iff OIapp ≥ PeakFLOPs/PeakBW

Compute-intensive iff OIapp ≥ (FLOPs/Byte)platform

Memory-intensive iff OIapp < (FLOPs/Byte)platform

Attainable performance 58

Compute intensive

Memory intensive

Page 57: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

Example: RGB-to-Gray

AI = 1.25

NVIDIA GTX680 P = min ( 3090, 1.25 * 192) = 240 GFLOPs

Only 7.8% of the peak

Intel Xeon Phi P = min ( 2417, 1.25 * 352) = 440 GFLOPs

Only 18.2% of the peak

Attainable performance 59

Compute intensive

Memory intensive

Page 58: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

The Roofline model

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

60

Page 59: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Roofline: comparing architectures

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

61

Page 60: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Roofline: computational ceilings

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

62

Page 61: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Roofline: bandwidth ceilings

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

63

Page 62: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Roofline: optimization regions 64

Page 63: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Use the Roofline model

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David

Patterson

“Roofline: an insightful visual performance model

for multicore architectures”

65

Page 64: MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

Questions? Comments? 66

For questions, comments, suggestions, … :

[email protected]