Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... ·...
Transcript of Getting Reproducible Results with Intel® MKLsc13.supercomputing.org/sites/default/files/... ·...
Getting Reproducible Results with Intel® MKL
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why do results vary?
Root cause for variations in results
• Floating-point numbers order of computation
matters!
• Single precision example where (a+b)+c a+(b+c)
2-63 + 1 + -1 = 2-63 (infinitely precise result)
(2-63 + 1) + -1 0 (correct IEEE single precision result)
2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result)
Order matters when doing floating point arithmetic.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why are reproducible results important for users?
Technical/legacySoftware correctness is determined by comparison to previous ‘gold’ results.
DebuggingWhen developing and debugging, a higher degree of run-to-run stability is required to find potential problems
LegalAccreditation or approval of software might require exact reproduction of previously defined results.
Customer perceptionDevelopers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies.
Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why does the order of operations change?
Many optimizations require a change in order of operations.
Optimizations
instruction sets Code path affects grouping of data in registers
multiple cores / multiple processors
most functions are threaded to use as many cores as will give good scalability
Non-deterministictask scheduling
some algorithms use asynchronous task scheduling for optimal performance
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How to get numerical reproducibility?Conditional Numerical Reproducibility with Intel® MKL
• Data alignment is no longer a requirement for getting numerical reproducibility.
• But aligning input data is still a good idea for getting better performance.
MKL 11.0 (2012 September release) MKL 11.1 (2013 September release)
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CNR with Threading (Unsupported)
Mainblock
992
7
1 thread
333
Current 3 thread decomposition
333
333
328
CNR-safe Static 3 thread decomposition
328
343
7
• Consider m=n=k=999, M_kernel_unroll=8
• For CNR, pad the matrix chunks to be multiples of M_kernel_unroll such that the last thread executes the border code-paths
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CNR Code Path Controls
These are run time controls.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Balancing reproducibility and performance
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• Reproducibility is currently available under certain conditions:– Within single operating systems / architecture
– Reproducibility only applies within the blue boxes, not between them…
– Reproducibility only with the same number of threads
– Reproducibility on supported servers and workstations– No support yet for Intel® Xeon Phi™ coprocessors
– Within a particular version of Intel® MKL– Results in version 11.1 update 1 may differ from results in version 11.1
– Reproducibility controls in Intel MKL only affect Intel MKL functions
Why “Conditional”?
Linux*
IA32
Intel® 64
Windows*
IA32
Intel® 64
Mac OS X
IA32
Intel® 64
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What about floating point computations outside Intel® MKL?
Intel compilers provide switches to improve floating point results reproducibility:
• Run-to-run
– Compiler options “-fp-model precise -fp-model source”
• Processor-to-processor
– Compiler option “-fimf-arch-consistency=true”
These are compile time controls. You have to build your code using these switches.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How to do deterministic reductions?
For deterministic parallel reduction in OpenMP:
– KMP_DETERMINISTIC_REDUCTION=1
– Only available in the Intel implementation of OpenMP
For deterministic parallel reduction in Intel® TBB:
– Function “parallel_deterministic_reduce()”
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Negligible impacts on performance
• Intel MKL’s LAPACK had two operations that were not “CNR-friendly”
• Parallel reduces that were nondeterministic
• Variable block sizes
• Removing these did not seem to have a significant impact on performance
13
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Further resources
• Knowledgebase articles
– Intel MKL conditional numerical reproducibility
– Consistency of FP results using the Intel compilers
• Support
– Intel MKL user forum
– Intel Premier support