Sergey Maidanov Software Engineering Manager …...and composable multi-threading (TBB, MPI)...
Transcript of Sergey Maidanov Software Engineering Manager …...and composable multi-threading (TBB, MPI)...
Sergey Maidanov
Software Engineering Manager for Intel® Distribution for Python*
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Introduction
Python is among the most popular programming languages Especially for prototyping But very limited use in production
Many computational problems require HPC/Big Data production environments Hire a team of Java/C++ programmers … OR Ease access for Python researcher and/or
have team of Python programmers to deploy optimized Python in production
Python is #1programming language in
hiring demandfollowed by Java and C++.
And the demand is
growing
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
What is required for making Python performance closer to native code?
0.2
1
5
25
125
625
3125
15625
Python C C (Parallelism)
BLACK SCHOLES FORMULA M O P T I O NS / SE C
Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
55x
350x
Vectorization, threading, and
data locality optimizations
Static compilation
Bringing parallelism (vectorization, threading, multi-node) to Python is
essential to make it useful in production
Chapter 19. Performance Optimization of Black Scholes Pricing
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Performance-productivity technological choices
Numerical packages acceleration with Intel® performance libraries
(MKL, DAAL, IPP)
Better parallelism and composablemulti-threading
(TBB, MPI)
Language extensions for vectorization and multi-
threading(Cython, Numba, Pyston)
Integration with Big Data and Machine Learning
platforms/frameworks(Spark, Hadoop, Theano, etc)
Profiling Python and mixed language codes
(VTune)
Energy FinancialAnalytics
Science &Research
Engineering Design
SignalProcessing
Digital Content Creation
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Our approach1. Enable hooks to Intel® MKL, Intel® DAAL, Intel® IPP functions in the most popular numerical/data
processing packages• NumPy, SciPy, Scikit-Learn, PyTables, Scikit-Image, …• Available through Intel® Distribution for Python* and as Conda packages
2. Most optimizations eventually upstreamed to home open source projects3. Provide Python interfaces for DAAL (a.k.a PyDAAL)
More cores More Threads Wider vectors
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Optimized mathematical building blocks Intel® Math Kernel Library (Intel MKL)
Linear Algebra
• BLAS
• LAPACK
• ScaLAPACK
• Sparse BLAS
• Sparse Solvers
• Iterative
• PARDISO* SMP & Cluster
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential
• Log
• Power
• Root
Vector RNGs
• Multiple BRNG
• Support methods for independent streamscreation
• Support all key probability distributions
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
7Functional domain in this color accelerate respective NumPy, SciPy, etc. domain
Up to 100x faster
Up to 10x
faster!
Up to 10x
faster!
Up to 60x
faster!
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice8
Optimized FFT show caseIntel® Math Kernel Library (Intel MKL)
• Original SciPy FFT implementation is about 2x faster than original NumPy FFT
• Intel engineers bridged NumPy and SciPyimplementations via common layer and embedded MKL FFT calls, what measurably accelerates both NumPy and SciPy
• NumPy and SciPy are computationally compatible
• FFT descriptors caching applied for enhanced performance in repetitive and multidimensional FFT calculations 1.0 1.0
4.8
0.0
2.0
4.0
6.0
Sp
ee
du
p S
ciP
y F
FT
vs
Ub
un
tu*
De
fau
lt P
yth
on
SciPy FFT Intel SciPy vs Ubuntu*
Vanilla Python*
PSF Intel (1 thread) Intel (32 threads)
1
2.72
10.17
0
2
4
6
8
10
12
Sp
ee
du
p N
um
Py
FF
T v
s
Ub
un
tu*
De
fau
lt P
yth
on
*
NumPy FFT NumPy vs Ubuntu*
Default Python*
PSF Intel (1 thread) Intel (32 threads)
Available starting Intel® Distribution for Python* 2017 Beta
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice9
Optimized RNG show caseIntel® Math Kernel Library (Intel MKL)
• Implemented numpy.random in vector fashion to enable vector MKL RNG and VML calls
• Enabled multiple BRNG
• Enabled multiple distribution transformation methods
Initial data. Final data to be available in the update for Intel® Distribution for Python* 2017 Beta
0
10
20
30
40
50
60
ran
do
m_s
am
ple
()
un
ifo
rm(0
, 1)
sta
nd
ard
_no
rma
l()
no
rma
l(1
, 5)
sta
nd
ard
_exp
on
en
tia
l()
exp
on
en
tial
(2)
sta
nd
ard
_ga
mm
a(0
.78
)
gam
ma
(0.7
8, 2
)
sta
nd
ard
_t(1
.78
)
sta
nd
ard
_ca
uch
y()
chis
qu
are
(7.7
8)
be
ta(1
.2, 4
.3)
log
no
rma
l(-2
, 1)
f(2
.5, 3
.4)
no
nce
ntr
al_
f(2
.5, 3
.5, 1
2)
wa
ld(0
.7, 2
.7)
gu
mb
el(
0.7
, 2.5
)
we
ibu
ll(0
.7)
log
isti
c(0
, 2)
lap
lace
(1, 2
)
no
nce
ntr
al_
chis
qu
are
(3, 2
.3)
tria
ng
ula
r(0
.7, 0
.9, 1
.2)
ray
leig
h(2
.5)
bin
om
ial(
12
3, 0
.4)
ge
om
etr
ic(0
.7)
ne
ga
tive
_bin
om
ial(
45
, 0.3
4)
po
isso
n(0
.5)
po
isso
n(7
)
po
isso
n(5
3)
numpy.random speedup due to MKL RNG
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Intel MPI in Python
11
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Oversubscription with Nested Parallelism
Software components are built from smaller components
If each component specifies threads... there can be too much!
Intel TBB dynamically balances thread loads and effectively manages oversubscription
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Intel TBB in Python
TBB treading already exposed via Intel MKL and Intel DAAL
2017 Beta introduces Intel TBB module for effective higher-level parallelism
• Implements multiprocessing.Pool interface
• from TBB import Pool
• Uses Monkey-patching in order to enable Dask/Joblib/etc..
• python -m TBB my_app.py
• with TBB.Monkey():
• Switches MKL accelerated packages into TBB-based threading layer when applied
• Bug fix: Numpy did not release GIL for MKL LAPACK calls
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice15
Case study: Collaborative filtering in Python
Recommendations of useful purchases
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice16
Configuration Info: - Versions: Intel(R) Distribution for Python 2.7.11 2017, Beta (Mar 04, 2016), MKL version 11.3.2 for Intel Distribution for Python 2017, Beta, Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L [email protected], RAM 64GB, Operating System: Fedora release 23 (Twenty Three)
0
2000
4000
6000
8000
10000
12000
Fedora Python Intel Python Fedora Python +
ThreadPool
Intel
Python+ThreadPool
Intel Python +
ThreadPool + TBB
Use
r re
qu
est
s p
er
sec
Collaborative Filtering - Generation of User
Recommendations Positive effect of TBB-based nested parallelism in Python
Innermost parallelism only (via Intel MKL)
Outermost parallelism only
Oversubscription with nested parallelism
Nested parallelism with Intel TBB
17x15x
11x
27x
1x
Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
18