Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... ·...

14
Getting Reproducible Results with Intel® MKL

Transcript of Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... ·...

Page 1: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Getting Reproducible Results with Intel® MKL

Page 2: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why do results vary?

Root cause for variations in results

• Floating-point numbers order of computation

matters!

• Single precision example where (a+b)+c a+(b+c)

2-63 + 1 + -1 = 2-63 (infinitely precise result)

(2-63 + 1) + -1 0 (correct IEEE single precision result)

2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result)

Order matters when doing floating point arithmetic.

Page 3: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why are reproducible results important for users?

Technical/legacySoftware correctness is determined by comparison to previous ‘gold’ results.

DebuggingWhen developing and debugging, a higher degree of run-to-run stability is required to find potential problems

LegalAccreditation or approval of software might require exact reproduction of previously defined results.

Customer perceptionDevelopers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies.

Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

Page 4: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Why does the order of operations change?

Many optimizations require a change in order of operations.

Optimizations

instruction sets Code path affects grouping of data in registers

multiple cores / multiple processors

most functions are threaded to use as many cores as will give good scalability

Non-deterministictask scheduling

some algorithms use asynchronous task scheduling for optimal performance

Page 5: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

How to get numerical reproducibility?Conditional Numerical Reproducibility with Intel® MKL

• Data alignment is no longer a requirement for getting numerical reproducibility.

• But aligning input data is still a good idea for getting better performance.

MKL 11.0 (2012 September release) MKL 11.1 (2013 September release)

Page 6: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

CNR with Threading (Unsupported)

Mainblock

992

7

1 thread

333

Current 3 thread decomposition

333

333

328

CNR-safe Static 3 thread decomposition

328

343

7

• Consider m=n=k=999, M_kernel_unroll=8

• For CNR, pad the matrix chunks to be multiples of M_kernel_unroll such that the last thread executes the border code-paths

Page 7: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Page 8: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

CNR Code Path Controls

These are run time controls.

Page 9: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Balancing reproducibility and performance

Page 10: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

• Reproducibility is currently available under certain conditions:– Within single operating systems / architecture

– Reproducibility only applies within the blue boxes, not between them…

– Reproducibility only with the same number of threads

– Reproducibility on supported servers and workstations– No support yet for Intel® Xeon Phi™ coprocessors

– Within a particular version of Intel® MKL– Results in version 11.1 update 1 may differ from results in version 11.1

– Reproducibility controls in Intel MKL only affect Intel MKL functions

Why “Conditional”?

Linux*

IA32

Intel® 64

Windows*

IA32

Intel® 64

Mac OS X

IA32

Intel® 64

Page 11: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What about floating point computations outside Intel® MKL?

Intel compilers provide switches to improve floating point results reproducibility:

• Run-to-run

– Compiler options “-fp-model precise -fp-model source”

• Processor-to-processor

– Compiler option “-fimf-arch-consistency=true”

These are compile time controls. You have to build your code using these switches.

Page 12: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

How to do deterministic reductions?

For deterministic parallel reduction in OpenMP:

– KMP_DETERMINISTIC_REDUCTION=1

– Only available in the Intel implementation of OpenMP

For deterministic parallel reduction in Intel® TBB:

– Function “parallel_deterministic_reduce()”

Page 13: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Negligible impacts on performance

• Intel MKL’s LAPACK had two operations that were not “CNR-friendly”

• Parallel reduces that were nondeterministic

• Variable block sizes

• Removing these did not seem to have a significant impact on performance

13

Page 14: Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... · Conditional Numerical Reproducibility with Intel® MKL • Data alignment is no longer

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Further resources

• Knowledgebase articles

– Intel MKL conditional numerical reproducibility

– Consistency of FP results using the Intel compilers

• Support

– Intel MKL user forum

– Intel Premier support