MKL for MIC training - 03192012 - The Linux Foundation

Intel® Math Kernel Library Perspectives and Latest Advances

Noah Clemons Lead Technical Consulting Engineer Developer Products Division, Intel

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Where your workload, Intel ® Integrated Performance Primitives (Intel® IPP), and Intel® Math Kernel Library (Intel® MKL) Intersect

• Library will generate hand-tuned Intel AVX

code

• Intel Performance Libraries work with any compiler

• Functions for common algorithms

• Thread-safe functions

• no blocking calls

• Multiple OS support

• Minimal OS overhead

• User-replaceable memory allocation mechanism

MKL & IPP

Compiler

SIMD

CPU Harness architecture features

with performance libraries tailored for wide variety of

computational domains

After Compiler and Threading Libraries, what’s next? Intel® Performance Libraries


Agenda

• What is Intel® Math Kernel Library (Intel® MKL)?

• Intel® MKL supports Intel® Xeon Phi™ Coprocessor

• Intel® MKL usage models on Intel® Xeon Phi™

– Automatic Offload

– Compiler Assisted Offload

– Native Execution

• Performance roadmap

• Conditional Numerical Reproducibility

• Where to find more information?

3


Linear Algebra

•BLAS

•LAPACK

•Sparse solvers

•ScaLAPACK

Fast Fourier Transforms

•Multidimensional (up to 7D)

•FFTW interfaces

•Cluster FFT

Vector Math

•Trigonometric

•Hyperbolic

•Exponential, Logarithmic

•Power / Root

•Rounding

Vector Random Number

Generators

•Congruential

•Recursive

•Wichmann-Hill

•Mersenne Twister

•Sobol

•Neiderreiter

•Non-deterministic

Summary Statistics

•Kurtosis

•Variation coefficient

•Quantiles, order statistics

•Min/max

•Variance-covariance

•…

Data Fitting

•Splines

•Interpolation

•Cell search

4

Intel® MKL is industry’s leading math library*

Many-core Multicore

Multicore CPU

Multicore CPU

Intel® MIC co-processor

Source

Clusters with Multicore and Many-core

… …

Multicore Cluster

Clusters

Intel® MKL

*2012 Evans Data N. American developer survey


Intel® MKL Supports for Intel® Xeon Phi™ Coprocessor

• Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessor.

• Heterogeneous computing

– Takes advantage of both multicore host and many-core coprocessors

• Optimized for wider (512-bit) SIMD instructions

• Flexible usage models:

– Automatic Offload: Offers transparent heterogeneous computing

– Compiler Assisted Offload: Allows fine offloading control

– Native execution: Use the coprocessors as independent nodes

5

Using Intel® MKL on Intel® Xeon Phi™ • Performance scales from multicore to many-cores

• Familiarity of architecture and programming models • Code re-use, Faster time-to-market


Usage Models on Intel® Xeon Phi™ Coprocessor

6

• Automatic Offload – No code changes required

– Automatically uses both host and target

– Transparent data transfer and execution management

• Compiler Assisted Offload – Explicit controls of data transfer and remote execution using compiler

offload pragmas/directives

– Can be used together with Automatic Offload

• Native Execution – Uses the coprocessors as independent nodes

– Input data is copied to targets in advance


Execution Models

7

Multicore Hosted

General purpose serial and parallel computing

Offload

Codes with highly- parallel phases

Many Core Hosted

Highly-parallel codes

Symmetric

Codes with balanced needs

Multicore (Intel® Xeon®)

Many-core (Intel® Xeon Phi™)

Multicore Centric Many-Core Centric


Automatic Offload (AO)

• Offloading is automatic and transparent

• By default, Intel MKL decides:

– When to offload

– Work division between host and targets

• Users enjoy host and target parallelism

automatically

• Users can still control work division to fine tune

performance

8


How to Use Automatic Offload

• Using Automatic Offload is easy

• What if there doesn’t exist a coprocessor in the

system?

– Runs on the host as usual without any penalty

9

Call a function:

mkl_mic_enable()

or

Set an env variable:

MKL_MIC_ENABLE=1


Automatic Offload Enabled Functions

• A selective set of MKL functions are AO enabled.

– Only functions with sufficient computation to offset data

transfer overhead are subject to AO

• In 11.0, AO enabled functions include:

– Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM

– LAPACK 3 amigos: LU, QR, Cholesky

10


Work Division Control in Automatic Offload

11

Examples Notes

mkl_mic_set_Workdivision( MKL_TARGET_MIC, 0, 0.5)

Offload 50% of computation only to the 1st card.

Examples Notes

MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st card.


Tips for Using Automatic Offload

• AO works only when matrix sizes are large

– ?GEMM: Offloading only when M, N > 2048

– ?TRSM/TRMM: Offloading only when M, N > 3072

– Square matrices may give better performance

• Work division settings are just hints to MKL runtime

• How to disable AO after it is enabled?

– mkl_mic_disable( ), or

– mkl_mic_set_workdivision(MIC_TARGET_HOST, 0, 1.0), or

– MKL_HOST_WORKDIVISION=100

12


Tips for Using Automatic Offload

• Threading control tips:

• Avoid using the OS core of the target, which is handling

data transfer and housekeeping tasks. Example (on a

60-core card):

MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-236:1]

• Also, prevent thread migration on the host:

KMP_AFFINITY=granularity=fine,compact,1,0

• Note: Different env-variables on host and coprocessor:

OMP_NUM_THREADS MIC_OMP_NUM_THREADS

KMP_AFFINITY MIC_KMP_AFFINITY

13


Compiler Assisted Offload (CAO)

• Offloading is explicitly controlled by compiler pragmas or directives.

• All MKL functions can be offloaded in CAO.

– In comparison, only a subset of MKL is subject to AO.

• Can leverage the full potential of compiler’s offloading facility.

• More flexibility in data transfer and remote execution management.

– A big advantage is data persistence: Reusing transferred data for multiple operations

14


How to Use Compiler Assisted Offload

• The same way you would offload any function call to the coprocessor.

• An example in C:

15

#pragma offload target(mic) \

in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

in(C:length(matrix_elements)) \

out(C:length(matrix_elements) alloc_if(0))

{

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,

&beta, C, &N);

}


How to Use Compiler Assisted Offload

• An example in Fortran:

16

!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM

!DEC$ OMP OFFLOAD TARGET( MIC ) &

!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &

!DEC$ IN( A: LENGTH( NCOLA * LDA )), &

!DEC$ IN( B: LENGTH( NCOLB * LDB )), &

!DEC$ INOUT( C: LENGTH( N * LDC ))

!$OMP PARALLEL SECTIONS

!$OMP SECTION

CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &

A, LDA, B, LDB BETA, C, LDC )

!$OMP END PARALLEL SECTIONS


Data Persistence with CAO

17

__declspec(target(mic)) static float *A, *B, *C, *C1;

// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B


in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \

in(A:length(NCOLA * LDA) free_if(0)) \

in(B:length(NCOLB * LDB) free_if(0)) \

inout(C:length(N * LDC))

{

sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);

}

// Transfer matrix C1 to coprocessor and reuse matrices A and B


in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \

nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \

nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \

inout(C1:length(N * LDC1))

{

sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);

}

// Deallocate A and B on the coprocessor


nocopy(A:length(NCOLA * LDA) free_if(1)) \

nocopy(B:length(NCOLB * LDB) free_if(1)) \ { }


Tips of Using Compiler Assisted Offload

• Use data persistence to avoid unnecessary data copying and

memory alloc/de-alloc

• Thread affinity: avoid using the OS core. Example for a 60-

core coprocessor:

MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-236:1]

• Use huge (2MB) pages for memory allocation in user code:

• MIC_USE_2MB_BUFFERS=64K

• The value of MIC_USE_2MB_BUFFERS is a threshold. E.g.,

allocations of 64K bytes or larger will use huge pages.

18


Using AO and CAO in the Same Program

• Users can use AO for some MKL calls and use CAO

for others in the same program

– Only supported by Intel compilers

– Work division must be set explicitly for AO

– Otherwise, all MKL AO calls are executed on the host

19


Native Execution

• Programs can be built to run only on the coprocessor by using

the –mmic build option.

• Tips of using MKL in native execution:

• Use all threads to get best performance (for 60-core coprocessor)

MIC_OMP_NUM_THREADS=240

• Thread affinity setting

KMP_AFFINITY=explicit,proclist=[1-240:1,0,241,242,243],granularity=fine

• Use huge pages for memory allocation (More details on the “Intel

Developer Zone”):

– The mmap( ) system call

– The open-source libhugetlbfs.so library: http://libhugetlbfs.sourceforge.net/

20

http://libhugetlbfs.sourceforge.net/




Suggestions on Choosing Usage Models

• Choose native execution if

– Highly parallel code.

– Want to use coprocessors as independent compute nodes.

• Choose AO when

– A sufficient Byte/FLOP ratio makes offload beneficial.

– Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM.

– LU and QR factorization (in upcoming release updates).

• Choose CAO when either

– There is enough computation to offset data transfer overhead.

– Transferred data can be reused by multiple operations

• You can always run on the host if offloading does not achieve better performance

21


Linking Examples

• AO: The same way of building code on Xeon!

icc –O3 –mkl sgemm.c –o sgemm.exe

• Native: Using –mmic

icc –O3 –mmic -mkl sgemm.c –o sgemm.exe

• CAO: Using -offload-option

icc –O3 -openmp -mkl \

–offload-option,mic,ld, “-L$MKLROOT/lib/mic -Wl,\

--start-group -lmkl_intel_lp64 -lmkl_intel_thread \

-lmkl_core -Wl,--end-group” sgemm.c –o sgemm.exe

22


Intel® MKL Link Line Advisor

• A web tool to help users to choose correct link line options.

• http://software.intel.com/sites/products/mkl/

• Also available offline in the MKL product package.

23

http://software.intel.com/sites/products/mkl/

http://software.intel.com/sites/products/mkl/


Performance Roadmap

• The following components are well optimized for Intel Xeon Phi coprocessors:

• BLAS Level 3, and much of Level 1 & 2

• Sparse BLAS: ?CSRMV, ?CSRMM

• LU, Cholesky, and QR factorization

• FFTs: 1D/2D/3D, SP and DP, r2c, c2c

• VML (real floating point functions)

• Random number generators:

– MT19937, MT2203, MRG32k3a

– Discrete Uniform and Geometric

24


Where to Find More Code Examples

$MKLROOT/examples/mic_samples

– ao_sgemm AO example

– dexp VML example (vdExp)

– dgaussian double precision Gaussian RNG

– fft complex-to-complex 1D FFT

– sexp VML example (vsExp)

– sgaussian single precision Gaussian RNG

– sgemm SGEMM example

– sgemm_f SGEMM example(Fortran 90)

– sgemm_reuse SGEMM with data persistence

– sgeqrf QR factorization

– sgetrf LU factorization

– spotrf Cholesky

– solverc PARDISO examples

25


Where to Find MKL Documentation?

• How to use MKL on Intel® Xeon Phi™ coprocessors:

• “Using Intel Math Kernel Library on Intel® Xeon Phi™

Coprocessors” section in the User’s Guide.

• “Support Functions for Intel Many Integrated Core

Architecture” section in the Reference Manual.

• Tips, app notes, etc. can also be found on the “Intel

Developer Zone”:

• http://www.intel.com/software/mic

• http://www.intel.com/software/mic-developer

26

http://www.intel.com/software/mic

http://www.intel.com/software/mic-developer




Run-to-run Numerical Reproducibility with the Intel® Math Kernel Library and Intel® Composer XE 2013


Agenda

• Why do floating point results vary?

• Reproducibility in the Intel compilers

• New reproducibility features in Intel® MKL

28


Ever seen something like this?

29

C:\Users\me>test.exe

4.012345678901111


4.012345678902222


4.012345678902222


4.012345678901111


4.012345678902222


…or this on different processors?

Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

30


4.012345678901111


4.012345678901111


4.012345678901111


4.012345678901111


4.012345678902222


4.012345678902222


4.012345678902222


4.012345678902222


Why do results vary?

Root cause for variations in results in Intel MKL

• floating-point numbers and rounding

• double precision example where (a+b)+c a+(b+c)

2-63 + 1 + -1 = 2-63 (mathematical result)

(2-63 + 1) + -1 0 (correct IEEE result)

2-63 + ( 1 + -1) 2-63 (correct IEEE result)

31


Why might the order of operations change in a computer program

32

Many optimizations require a change in order of operations.

Optimizations

instruction sets memory alignment affects grouping of data in registers

multiple cores / multiple processors

most functions are threaded to use as many cores as will give good scalability

Non-deterministic task scheduling

some algorithms use asynchronous task scheduling for optimal performance

code path optimized to use all the processor features available on the system where the program is run


Why are reproducible results important for Intel MKL users?

33

Technical / legacy Software correctness is determined by comparison to previous ‘gold’ results.

Debugging / porting When developing and debugging, a higher degree of run-to-run stability is required to find potential problems

Legal Accreditation or approval of software might require exact reproduction of previously defined results.

Customer perception Developers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies.

Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf


What are the ingredients for reproducibility

Source code

Tools

• Compilers

• Libraries

34


Floating Point Semantics

The -fp-model (/fp:) compiler switch lets you choose the floating point semantics at a coarse granularity. It lets you specify the compiler rules for:

– Value safety (our main focus)

– FP expression evaluation

– FPU environment access

– Precise FP exceptions

– FP contractions (fused multiply-add)


The -fp-model & /fp: switches

• Recommendation for reproducibility: -fp-model precise

– for reproducible result and for ANSI/ IEEE standards compliance, C++ & Fortran

Value Description

fast[=1] (default)

allows value-unsafe optimizations

fast=2 allows additional optimizations

precise value-safe optimizations only

except enables floating point exception semantics

strict precise + except + disable fma + don’t assume default floating-point environment


Reassociation -fp-model precise

• disables reassociation

• enforces C std conformance (left-to-right)

• may carry a significant performance penalty

Parentheses are respected only in value-safe mode!

#include <iostream>

#define N 100

int main() {

float a[N], b[N];

float c = -1., tiny = 1.e-20F;

for (int i=0; i<N; i++) a[i]=1.0;

for (int i=0; i<N; i++) {

a[i] = a[i] + c + tiny;

b[i] = 1/a[i];

}

std::cout << "a = " << a[0]

<< " b = " << b[0]

<< "\n";

}


Run-to-Run Variations (single-threaded)

Data alignment may vary from run to run, due to changes in the external environment

• E.g. malloc of a string to contain date, time, user name or directory:

size of allocation affects alignment of subsequent malloc’s

• Compiler may “peel” scalar iterations off the start of the loop until subsequent memory accesses are aligned, so that the main loop kernel can be vectorized efficiently

• For reduction loops, this changes the composition of the partial sums, hence changes rounding and the final result

• Occurs for both gcc and icc, when compiling for Intel® AVX

To avoid, align data: _mm_malloc(size, 32) (icc only)

mkl_malloc(size, 32) (Intel MKL)

• or compile with –fp-model precise (icc) or without –ffast-math (larger performance impact)


Typical Performance Impact

• SPECCPU2006fp benchmark suite compiled with -O2 or -O3

• Geomean performance reduction due to –fp-model precise and –fp-model source: 12% - 15% – Intel Compiler XE 2011 ( 12.0 )

– Measured on Intel Xeon® 5650 system with dual, 6-core processors at 2.67Ghz, 24GB memory, 12MB cache, SLES* 10 x64 SP2

• Performance impact can vary between applications

Use -fp-model precise to improve floating point reproducibility while limiting performance impact


How did Intel MKL handle reproducibility historically?

Through MKL 10.3 (Nov. 2011), the recommendation was to:

• Align your input/output arrays using the Intel MKL memory manager

• Call sequential Intel MKL

• This meant the user needed to handle threading themselves

40


New

!

•Align memory — try Intel MKL memory allocation functions

•64-byte alignment for processors in the next few years

Memory alignment

•Set the number of threads to a constant number

•Use sequential libraries

Number of threads

•Ensures that FP operations occur in order to ensure reproducible results

Deterministic task scheduling

•Maintains consistent code paths across processors

•Will often mean lower performance on the latest processors

Code path control

Balancing Reproducibility and Performance: Conditional Numerical Reproducibility (CNR)

41

Goal: Achieve best performance possible for cases that require reproducibility


Controls for CNR features

42


• In Intel MKL 11.0 reproducibility is currently available under certain conditions: – Within single operating systems / architecture

– Reproducibility only applies within the blue boxes, not between them…

– Within a particular version of Intel MKL

– Results in version 11.0 update 1 may differ slightly from results in version 11.0

– Reproducibility controls in Intel MKL only affect Intel MKL functions; other tools, and other parts of your code can still cause variation

Why “Conditional”?

Linux*

IA32

Intel® 64

Windows*

IA32

Intel® 64

Mac OS X

IA32

Intel® 64

43

Send us your feedback!


CNR Impact on Performance of Intel® Optimized LINPACK Benchmark

44


What’s next?

45

https://softwareproductsurvey.intel.com/survey/150072/1afd/




Further resources on reproducibility

Reference manuals, User Guides, Getting Started Guides…

• Intel® MKL Documentation

• Intel® Fortran Composer XE 2013 Documentation

• Intel® C++ Composer XE 2013 Documentation

Knowledgebase:

• CNR in Intel MKL 11.0, Consistency of Floating-Point Results

Support

• Intel MKL user forum

• Intel compiler forums [IVF, Fortran , and C++]

• Intel Premier support

Feedback

• Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/

46

http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation

http://software.intel.com/en-us/articles/intel-fortran-composer-xe-documentation/

http://software.intel.com/en-us/articles/intel-c-composer-xe-documentation/

http://software.intel.com/en-us/articles/conditional-numerical-reproducibility-cnr-in-intel-mkl-110

http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler



http://software.intel.com/en-us/forums/intel-math-kernel-library/

http://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows

http://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x

http://software.intel.com/en-us/forums/intel-c-compiler

http://premier.intel.com/





Summary

When writing programs for reproducible results: • Write your source code with reproducibility in mind • Use “–fp-model precise” with Intel compilers • Use the new Conditional Numerical Reproducibility (CNR)

features in Intel MKL

Evaluate CNR in the following: Intel® Math Kernel Library 11.0

Intel® Composer XE 2013 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013

Provide feedback:


47

http://software.intel.com/en-us/intel-mkl



http://software.intel.com/en-us/intel-composer-xe/

http://software.intel.com/en-us/intel-parallel-studio-xe/



http://software.intel.com/en-us/intel-cluster-studio-xe/




INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


50

50

MKL for MIC training - 03192012 - The Linux Foundation

Documents

Transcript of MKL for MIC training - 03192012 - The Linux Foundation