Sergey Maidanov Software Engineering Manager …...and composable multi-threading (TBB, MPI)...

Sergey Maidanov

Software Engineering Manager for Intel® Distribution for Python*

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Introduction

Python is among the most popular programming languages Especially for prototyping But very limited use in production

Many computational problems require HPC/Big Data production environments Hire a team of Java/C++ programmers … OR Ease access for Python researcher and/or

have team of Python programmers to deploy optimized Python in production

Python is #1programming language in

hiring demandfollowed by Java and C++.

And the demand is

growing


Optimization Notice

What is required for making Python performance closer to native code?

0.2

1

5

25

125

625

3125

15625

Python C C (Parallelism)

BLACK SCHOLES FORMULA M O P T I O NS / SE C

Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.

55x

350x

Vectorization, threading, and

data locality optimizations

Static compilation

Bringing parallelism (vectorization, threading, multi-node) to Python is

essential to make it useful in production

Chapter 19. Performance Optimization of Black Scholes Pricing


Optimization Notice

Performance-productivity technological choices

Numerical packages acceleration with Intel® performance libraries

(MKL, DAAL, IPP)

Better parallelism and composablemulti-threading

(TBB, MPI)

Language extensions for vectorization and multi-

threading(Cython, Numba, Pyston)

Integration with Big Data and Machine Learning

platforms/frameworks(Spark, Hadoop, Theano, etc)

Profiling Python and mixed language codes

(VTune)

Energy FinancialAnalytics

Science &Research

Engineering Design

SignalProcessing

Digital Content Creation


Optimization Notice

Our approach1. Enable hooks to Intel® MKL, Intel® DAAL, Intel® IPP functions in the most popular numerical/data

processing packages• NumPy, SciPy, Scikit-Learn, PyTables, Scikit-Image, …• Available through Intel® Distribution for Python* and as Conda packages

2. Most optimizations eventually upstreamed to home open source projects3. Provide Python interfaces for DAAL (a.k.a PyDAAL)

More cores More Threads Wider vectors


Optimization Notice

Optimized mathematical building blocks Intel® Math Kernel Library (Intel MKL)

Linear Algebra

• BLAS

• LAPACK

• ScaLAPACK

• Sparse BLAS

• Sparse Solvers

• Iterative

• PARDISO* SMP & Cluster

Fast Fourier Transforms

• Multidimensional

• FFTW interfaces

• Cluster FFT

Vector Math

• Trigonometric

• Hyperbolic

• Exponential

• Log

• Power

• Root

Vector RNGs

• Multiple BRNG

• Support methods for independent streamscreation

• Support all key probability distributions

Summary Statistics

• Kurtosis

• Variation coefficient

• Order statistics

• Min/max

• Variance-covariance

And More

• Splines

• Interpolation

• Trust Region

• Fast Poisson Solver

7Functional domain in this color accelerate respective NumPy, SciPy, etc. domain

Up to 100x faster

Up to 10x

faster!

Up to 10x

faster!

Up to 60x

faster!


Optimization Notice8

Optimized FFT show caseIntel® Math Kernel Library (Intel MKL)

• Original SciPy FFT implementation is about 2x faster than original NumPy FFT

• Intel engineers bridged NumPy and SciPyimplementations via common layer and embedded MKL FFT calls, what measurably accelerates both NumPy and SciPy

• NumPy and SciPy are computationally compatible

• FFT descriptors caching applied for enhanced performance in repetitive and multidimensional FFT calculations 1.0 1.0

4.8

0.0

2.0

4.0

6.0

Sp

ee

du

p S

ciP

y F

FT

vs

Ub

un

tu*

De

fau

lt P

yth

on

SciPy FFT Intel SciPy vs Ubuntu*

Vanilla Python*

PSF Intel (1 thread) Intel (32 threads)

1

2.72

10.17

0

2

4

6

8

10

12

Sp

ee

du

p N

um

Py

FF

T v

s

Ub

un

tu*

De

fau

lt P

yth

on

*

NumPy FFT NumPy vs Ubuntu*

Default Python*

PSF Intel (1 thread) Intel (32 threads)

Available starting Intel® Distribution for Python* 2017 Beta



Optimized RNG show caseIntel® Math Kernel Library (Intel MKL)

• Implemented numpy.random in vector fashion to enable vector MKL RNG and VML calls

• Enabled multiple BRNG

• Enabled multiple distribution transformation methods

Initial data. Final data to be available in the update for Intel® Distribution for Python* 2017 Beta

0

10

20

30

40

50

60

ran

do

m_s

am

ple

()

un

ifo

rm(0

, 1)

sta

nd

ard

_no

rma

l()

no

rma

l(1

, 5)

sta

nd

ard

_exp

on

en

tia

l()

exp

on

en

tial

(2)

sta

nd

ard

_ga

mm

a(0

.78

)

gam

ma

(0.7

8, 2

)

sta

nd

ard

_t(1

.78

)

sta

nd

ard

_ca

uch

y()

chis

qu

are

(7.7

8)

be

ta(1

.2, 4

.3)

log

no

rma

l(-2

, 1)

f(2

.5, 3

.4)

no

nce

ntr

al_

f(2

.5, 3

.5, 1

2)

wa

ld(0

.7, 2

.7)

gu

mb

el(

0.7

, 2.5

)

we

ibu

ll(0

.7)

log

isti

c(0

, 2)

lap

lace

(1, 2

)

no

nce

ntr

al_

chis

qu

are

(3, 2

.3)

tria

ng

ula

r(0

.7, 0

.9, 1

.2)

ray

leig

h(2

.5)

bin

om

ial(

12

3, 0

.4)

ge

om

etr

ic(0

.7)

ne

ga

tive

_bin

om

ial(

45

, 0.3

4)

po

isso

n(0

.5)

po

isso

n(7

)

po

isso

n(5

3)

numpy.random speedup due to MKL RNG


Optimization Notice

Intel MPI in Python

11


Optimization Notice

Oversubscription with Nested Parallelism

Software components are built from smaller components

If each component specifies threads... there can be too much!

Intel TBB dynamically balances thread loads and effectively manages oversubscription


Optimization Notice

Intel TBB in Python

TBB treading already exposed via Intel MKL and Intel DAAL

2017 Beta introduces Intel TBB module for effective higher-level parallelism

• Implements multiprocessing.Pool interface

• from TBB import Pool

• Uses Monkey-patching in order to enable Dask/Joblib/etc..

• python -m TBB my_app.py

• with TBB.Monkey():

• Switches MKL accelerated packages into TBB-based threading layer when applied

• Bug fix: Numpy did not release GIL for MKL LAPACK calls



Case study: Collaborative filtering in Python

Recommendations of useful purchases



Configuration Info: - Versions: Intel(R) Distribution for Python 2.7.11 2017, Beta (Mar 04, 2016), MKL version 11.3.2 for Intel Distribution for Python 2017, Beta, Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L [email protected], RAM 64GB, Operating System: Fedora release 23 (Twenty Three)

0

2000

4000

6000

8000

10000

12000

Fedora Python Intel Python Fedora Python +

ThreadPool

Intel

Python+ThreadPool

Intel Python +

ThreadPool + TBB

Use

r re

qu

est

s p

er

sec

Collaborative Filtering - Generation of User

Recommendations Positive effect of TBB-based nested parallelism in Python

Innermost parallelism only (via Intel MKL)

Outermost parallelism only

Oversubscription with nested parallelism

Nested parallelism with Intel TBB

17x15x

11x

27x

1x


Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

18

Sergey Maidanov Software Engineering Manager …...and composable multi-threading (TBB, MPI)...

Documents

Transcript of Sergey Maidanov Software Engineering Manager …...and composable multi-threading (TBB, MPI)...