Getting Reproducible Results with Intel® MKL 11.0

27
Getting Reproducible Results with Intel® MKL 11.0 Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library

description

Getting Reproducible Results with Intel® MKL 11.0. Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library. The agenda. Reproducible results in Intel MKL The symptom The problem The reality The requirements A conditional solution A beginner’s guide Performance - PowerPoint PPT Presentation

Transcript of Getting Reproducible Results with Intel® MKL 11.0

Page 1: Getting Reproducible Results  with  Intel® MKL  11.0

Getting Reproducible Results with Intel® MKL 11.0

Todd RosenquistTechnical Consulting EngineerIntel® Math Kernel Library

Page 2: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The agenda

Reproducible results in Intel MKL• The symptom• The problem• The reality• The requirements• A conditional solution• A beginner’s guide• Performance • Further resources

Try the feature in the recently released Intel® MKL 11.0

2

Page 3: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Ever seen something like this?

3

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

Page 4: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

…or this?

Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

4

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

Page 5: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why do results vary?

Root cause for variations in results• floating-point numbers order of computation matters!• double precision example where (a+b)+c a+(b+c)

2-63 + 1 + -1 = 2-63 (infinitely precise result)

(2-63 + 1) + -1 0 (correct IEEE single precision result)

2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result)

5

Order matters when doing floating point arithmetic.

Page 6: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why does the order of operations change in Intel MKL?

6

Many optimizations require a change in order of operations.

Optimizationsinstruction sets memory alignment affects grouping

of data in registers multiple cores / multiple processors

most functions are threaded to use as many cores as will give good scalability

Non-deterministic task scheduling

some algorithms use asynchronous task scheduling for optimal performance

code path optimized to use all the processor features available on the system where the program is run

Page 7: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why are reproducible results important for Intel MKL users?

7

 

Technical/legacySoftware correctness is determined by comparison to previous ‘gold’ results.

DebuggingWhen developing and debugging, a higher degree of run-to-run stability is required to find potential problems

LegalAccreditation or approval of software might require exact reproduction of previously defined results.

Customer perceptionDevelopers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

Page 8: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

New

!

• Align memory — try Intel MKL memory allocation functions• 64-byte alignment for processors in the next few years

Memory alignment

• Set the number of threads to a constant number• Use sequential libraries

Number of threads

• Ensures that FP operations occur in order to ensure reproducible results

Deterministic task

scheduling• Maintains consistent code paths across processors• Will often mean lower performance on the latest processors

Code path control

Balancing Reproducibility and Performance:Conditional Numerical Reproducibility (CNR)

8

Goal: Achieve best performance possible for cases that require reproducibility

Page 9: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

• In Intel MKL 11.0 reproducibility is currently available under certain conditions:– Within single operating systems / architecture

– Reproducibility only applies within the blue boxes, not between them…

– Reproducibility on all supported servers and workstations– No support yet for Intel® Xeon Phi™ coprocessors

– Within a particular version of Intel MKL– Results in version 11.0 update 1 may differ from results in version 11.0

– Reproducibility controls in Intel MKL only affect Intel MKL functions

Why “Conditional”?

Linux*IA32

Intel® 64

Windows*IA32

Intel® 64

Mac OS XIA32

Intel® 64

9

Page 10: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Conditions for reproducibility

Aligned input and output arrays in function calls• 16-byte alignment for the family of SSE instruction sets• 32-byte alignment for AVX• 64-byte alignment for future processors <- choose this to be safe

Set the same number of computational threads for the library in each runUse the same Intel MKL parameters from run-to-run • Example: You cannot call a function in 3 blocks in one run and 4 blocks

in the next

Use the new functions & controls to ensure deterministic task scheduling and to control code paths• CNR controls must be set or called before any computational math

functions in Intel MKL

10

Page 11: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Example - COMPATIBLE

For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later • function call

mkl_cbwr_set(MKL_CBWR_COMPATIBLE)• or environment variable

set MKL_CBWR="COMPATIBLE"

Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.

11

Page 12: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Example – SSE2

For the same results on every Intel processor that supports SSE2 instructions or later• function call

mkl_cbwr_set(MKL_CBWR_SSE2) • or environment variable

set MKL_CBWR="SSE2"

Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

12

Page 13: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Example – SSE4.2

For the same results on every Intel processor that supports SSE4.2 instructions or later • function call

mkl_cbwr_set(MKL_CBWR_SSE4_2)• or environment variable

set MKL_CBWR= "SSE4_2"

Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

13

Page 14: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Example – deterministic task schedulingFor consistent results on all supported processors without fixing the code branch• function call

mkl_cbwr_set(MKL_CBWR_AUTO)• or environment variable

set MKL_CBWR= "AUTO"

• Note– This will ensure deterministic task scheduling– It will not give you reproducibility from processor to processor

14

Page 15: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Example – Find out the best performing option from a pool of processorsFor the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows#include <mkl.h>

int main(void) {

int my_cbwr_branch;

/* Find the available MKL_CBWR_BRANCH */

my_cbwr_branch = mkl_cbwr_get_auto_branch();

if (!mkl_cbwr_set(my_cbwr_branch)) {

printf(“Error in setting branch. Aborting…\n”);

return;}

return my_cbwr_branch;

}

Examine all results and use mkl_cbwr_set(<minimum_result>)

15

The full list of options:COMPATIBLE 3SSE2 4SSE3 5SSSE3 6SSE4_1 7SSE4_2 8AVX 9AVX2 10

Page 16: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

16

Page 17: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Change this sort of inconsistency…

17

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

• Align memory • Constant # of threads• Turn on CNR with either

mkl_cbwr_set(MKL_CBWR_AUTO)orset MKL_CBWR=AUTO

Page 18: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Change this inconsistency in results…

Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

18

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

C:\Users\me>test.exe4.012345678902222

Page 19: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

…to get reproducible results?

Intel® Xeon® Processor E5540

(Supporting SSE4.2 instructions)

Intel® Xeon® Processor E3-1275

(Supporting AVX instructions)

19

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

• Align memory • Constant # of threads• Turn on CNR with

either…mkl_cbwr_set(MKL_CBWR_SSE4_2)

orset MKL_CBWR=SSE4_2

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

C:\Users\me>test.exe4.012345678901111

Page 20: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What’s next?

20

https://softwareproductsurvey.intel.com/survey/150072/1afd/

Page 21: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Further resources on conditional numerical reproducibilityIntel MKL Documentation – online and in the product• Intel MKL User’s Guide• Reference Manual

Knowledgebase articles on CNRSupport• Intel MKL user forum• Intel Premier support

Feedback• Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/

21

Page 22: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

New optimizations and features

Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* onlyOptimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructionsFFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers Local threading control function• mkl_set_num_threads_local()

22

Page 23: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Sept 18th, 2012 9:00AMInteresting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors

Oct 2nd, 2012 9:00AMPointer Checker: Catch Out-of-Bounds Memory Accesses Easily!

Oct 16th, 2012 9:00AMHow Intel® Parallel Studio XE is used to improve the HMMER application

Oct 30th, 2012 9:00AMUsing the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results

Oct 9th, 2012 9:00AMAchieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling.

Oct 23rd, 2012 9:00AMThree common Fortran mistakes you can avoid by using Intel® Inspector XE

Nov 6th, 2012 9:00AMAvoid common parallelization mistakes with the help of Intel® Advisor XE

Dec 4th, 2012 9:00AMFortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE*

23

http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe

Page 24: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Summary

Conditional Numerical Reproducibility (CNR) provides:• reproducible results from run-to-run• reproducible results from processor-to-processor• the ability to balance reproducibility requirements with

great performance

Evaluate CNR in the following:Intel® Math Kernel Library 11.0

Intel® Composer XE 2013Intel® Parallel Studio XE 2013Intel® Cluster Studio XE 2013

Provide feedback:https://softwareproductsurvey.intel.com/survey/150072/1afd/

24

Page 25: Getting Reproducible Results  with  Intel® MKL  11.0

25

Page 26: Getting Reproducible Results  with  Intel® MKL  11.0

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.26

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization NoticeIntel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

26

Page 27: Getting Reproducible Results  with  Intel® MKL  11.0