CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE...

74
CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira [email protected]

Transcript of CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE...

Page 1: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS

1

André Pereira [email protected]

Page 2: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

AgendaThe Case Study

Identifying InefficienciesCommon code pitfallsUsing compiler flags

Vectorisation

Shared Memory Parallelisation

Intel Xeon Phi Coprocessor

2

Page 3: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

The Case Study

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++) for (int j = 0; j < size; j++) {

c[i][j] = 0; for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

3

Page 4: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

4

Page 5: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

Avoidable memory accesses

4

Page 6: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

Avoidable memory accesses

False alias

4

Page 7: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

Avoidable memory accesses

False alias

Blocking/Tiling

4

Page 8: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

Avoidable memory accesses

False alias

Blocking/Tiling

Memory alignment - to see later

4

Page 9: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Common Code Inefficiencies

Avoidable memory accesses

False alias

Blocking/Tiling

Memory alignment - to see later

Row/Column major - work assignment…

4

Page 10: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Avoidable Memory Accesses

5

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}

Page 11: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Avoidable Memory AccessesIssue:

5

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}

Page 12: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Avoidable Memory AccessesIssue:

The same element in a data structure (matrix c) is being accessed twice from memory per cycle iteration

5

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}

Page 13: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Avoidable Memory AccessesIssue:

The same element in a data structure (matrix c) is being accessed twice from memory per cycle iteration

Solution:

5

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}

Page 14: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Avoidable Memory AccessesIssue:

The same element in a data structure (matrix c) is being accessed twice from memory per cycle iteration

Solution:Use a temporary variable to store intermediate results

5

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}

Page 15: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False Alias

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 16: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 17: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

a, b, or c may be pointing to overlapped memory blocks

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 18: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

a, b, or c may be pointing to overlapped memory blocksCompiler is very cautions using optimisations

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 19: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

a, b, or c may be pointing to overlapped memory blocksCompiler is very cautions using optimisations

Solution:

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 20: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

a, b, or c may be pointing to overlapped memory blocksCompiler is very cautions using optimisations

Solution:General C++ case prefer references over pointers!

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 21: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

False AliasIssue:

a, b, or c may be pointing to overlapped memory blocksCompiler is very cautions using optimisations

Solution:General C++ case prefer references over pointers!Give hints to the compiler (pragmas or restrict)

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

6

Page 22: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

7

Page 23: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

Issue:

7

Page 24: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

Issue:Memory is not contiguously aligned

7

Page 25: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

Issue:Memory is not contiguously alignedAlignment is not a multiple of 16 bytes

7

Page 26: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

Issue:Memory is not contiguously alignedAlignment is not a multiple of 16 bytes

Solution:

7

Page 27: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Memory Alignment

Issue:Memory is not contiguously alignedAlignment is not a multiple of 16 bytes

Solution:Guarantee the proper alignment of data structures manually

7

Page 28: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Blocking/Tiling

8

Page 29: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Blocking/Tiling

8

Page 30: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Blocking/Tiling for I = 1 to N for J = 1 to N {read block C(I,J) into fast memory} for K = 1 to N {read block A(I,K) into fast memory} {read block B(K,J) into fast memory} do a matrix multiply on blocks to compute block C(I,J) {write block C(I,J) back to slow memory}

8

Page 31: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags

9

Page 32: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags

Compilers provide flags to enable sets of optimisations

9

Page 33: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags

Compilers provide flags to enable sets of optimisations

Both GNU and Intel compilers use the same syntax

9

Page 34: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags

Compilers provide flags to enable sets of optimisations

Both GNU and Intel compilers use the same syntax

“Free lunch” speedups!

9

Page 35: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags

10

Page 36: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags-O1

Enable a small set of optimisationsReduces both code size and execution timeTrade-off between performance and compilation time

10

Page 37: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags-O1

Enable a small set of optimisationsReduces both code size and execution timeTrade-off between performance and compilation time

-O2Performs almost all supported optimisations Generates faster code without compromising the space

10

Page 38: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Compiler Flags-O1

Enable a small set of optimisationsReduces both code size and execution timeTrade-off between performance and compilation time

-O2Performs almost all supported optimisations Generates faster code without compromising the space

-O3Almost all available optimisations without considering any trade-off

10

Page 39: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Vectorisation

How is it achieved? By hand?void matrixAdd (void) { __m256 ymm1, ymm2;

for (unsigned i = 0; i < SIZE; ++i) { for (unsigned j = 0; j < VEC_SIZE; ++j) { ymm1 = _mm256_load_ps(&m1[i][j * 8]); ymm2 = _mm256_load_ps(&m2[i][j * 8]);

ymm1 = _mm256_add_ps(ymm1, ymm2); _mm256_store_ps(&result[i][j * 8], ymm1); } }}

11

Page 40: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Vectorisation

12

Page 41: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

VectorisationLoop vectorisation is enabled by default with -O3 for the GNU compiler and -O2 for Intel compiler

12

Page 42: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

VectorisationLoop vectorisation is enabled by default with -O3 for the GNU compiler and -O2 for Intel compiler

…but is the code really vectorised?

12

Page 43: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

VectorisationLoop vectorisation is enabled by default with -O3 for the GNU compiler and -O2 for Intel compiler

…but is the code really vectorised?

12

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

Page 44: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

VectorisationLoop vectorisation is enabled by default with -O3 for the GNU compiler and -O2 for Intel compiler

…but is the code really vectorised?

12

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

Page 45: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

VectorisationLoop vectorisation is enabled by default with -O3 for the GNU compiler and -O2 for Intel compiler

…but is the code really vectorised?

12

void matMult (float **a, float **b, float **c, int size) {for (int i = 0; i < size; i++)

for (int j = 0; j < size; j++) {c[i][j] = 0;

for (int k = 0; k < size; k++) c[i][j] += a[i][k] * b[k][j];

}}

Page 46: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Vectorisation

Vectorisation report ICC: -vec-reportX, where X1 - Loops successfully vectorised2 - Loops not vectorised (and the justification)3 - Adds dependency information4 - Non-vectorised loops report5 - Non-vectorised loops report with dependency information

13

Page 47: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Help the compiler vectorise

14

Page 48: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Help the compiler vectorise

Give information about loop dependencies#pragma vector always#pragma ivdep

14

Page 49: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Help the compiler vectorise

Give information about loop dependencies#pragma vector always#pragma ivdep

Avoid nested loopsor use #pragma omp simd collapse(X) - OpenMP 4.0

14

Page 50: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Help the compiler vectorise

Give information about loop dependencies#pragma vector always#pragma ivdep

Avoid nested loopsor use #pragma omp simd collapse(X) - OpenMP 4.0

Data alignment and layout

14

Page 51: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Data LayoutEnsure that vector operations are applied to packed data; Ex. 512-bit Xeon Phi registers

15

Page 52: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Data Alignment

16

Page 53: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Data Alignment

Ensure that the memory allocation of static data structures is properly aligned

16

Page 54: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Data Alignment

Ensure that the memory allocation of static data structures is properly aligned

Depends on the size of the vector registers

__attribute__((align(DIM_SIMD))) float a [SIZE];

16

Page 55: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

EXERCISES - PART I

17

Page 56: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Exercise 1

Perform basic code optimisations

Parallelise code execution for shared memory systems

https://bitbucket.org/ampereira/matrix-optimization/downloads

18

Page 57: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Shared Memory Parallelism

Several libraries to produce parallel code for shared memory environments

OpenMP

TBB

CILK

19

Page 58: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

OpenMP

20

Page 59: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

OpenMP

Easy to use, pragma-based

20

Page 60: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

OpenMP

Easy to use, pragma-based

Implemented by default in both GNU and Intel compilers

-fopenmp - gcc-openmp - icc

20

Page 61: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

OpenMP

Easy to use, pragma-based

Implemented by default in both GNU and Intel compilers

-fopenmp - gcc-openmp - icc

Add the <omp.h> header to the code and use the pragmas

20

Page 62: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

(Very) Basic Pragmas

21

Page 63: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

(Very) Basic Pragmas

#pragma omp parallel options

#pragma omp for/task

21

Page 64: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

(Very) Basic Pragmas

#pragma omp parallel options

#pragma omp for/task

Options:num_threadsschedule(X)private(X)

21

Page 65: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depth

22

Page 66: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

22

Page 67: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

Definition of the task grain

22

Page 68: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

Definition of the task grain

Data scoping

22

Page 69: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

Definition of the task grain

Data scoping

Reduction, synchronisation, nowait clauses…

22

Page 70: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

Definition of the task grain

Data scoping

Reduction, synchronisation, nowait clauses…

Parallel tasks for irregular problems

22

Page 71: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Go In-depthSeveral work scheduling heuristics available

Definition of the task grain

Data scoping

Reduction, synchronisation, nowait clauses…

Parallel tasks for irregular problems

Nested parallelism

22

Page 72: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

EXERCISES - PART II

23

Page 73: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

Using Intel Composer on SeARCH

Always have 2 terminal windows openThe computing node (interactive qsub)The frontend, where the code is compiled

Run module load intel/2013.1.117 on both windowsIn the computing node to set the dynamic library pathsIn the frontend to use the compiler, headers, etc

If you get “Catastrophic error: could not set locale "" to allow processing of multibyte characters” do export LC_ALL=C

24

Page 74: CODE OPTIMISATION - Universidade do Minhogec.di.uminho.pt/MInf/cpd/AA/code_optim.pdf · CODE OPTIMISATION ON INTEL XEON (CO)PROCESSORS 1 André Pereira ampereira@di.uminho.pt. André

André Pereira, UMinho, 2015/2016

ReferencesAn Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors, James Reinders, Intel

Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, Elsevier Waltham (Mass.), 2013

Intel® 64 and IA-32 Architectures Software Developer’s Manual

25