Parallel Programming in Python: Speeding up your analysis

23
Domino Data Lab November 10, 2015 Faster data science — without a cluster Parallel programming in Python Manojit Nandi [email protected] @mnandi92

Transcript of Parallel Programming in Python: Speeding up your analysis

Page 1: Parallel Programming in Python: Speeding up your analysis

Domino Data Lab November 10, 2015

Faster data science — without a clusterParallel programming in Python

Manojit [email protected]

@mnandi92

Page 2: Parallel Programming in Python: Speeding up your analysis

Who am I?

Domino Data Lab November 10, 2015

• Data Scientist at STEALTHBits Technologies

• Data Science Evangelist at Domino Data Lab

• BS in Decision Science

Page 3: Parallel Programming in Python: Speeding up your analysis

Domino Data Lab November 10, 2015

Agenda and Goals

• Motivation• Conceptual intro to parallelism, general principles and pitfalls• Machine learning applications• Demos

Goal: Leave you with principles, and practical concrete tools, that will help you run your code much faster

Page 4: Parallel Programming in Python: Speeding up your analysis

Motivation

Domino Data Lab November 10, 2015

• Lots of “medium data” problems• Can fit in memory on one machine

• Lots of naturally parallel problems

• Easy to access large machines• Clusters are hard• Not everything fits map-reduce

CPUs with multiple cores have become the standard in the recent development of modern computer architectures and we can not only find them in supercomputer facilities but also in our desktop machines at home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core processor in 2013.

- Sebastian Rascka

Page 5: Parallel Programming in Python: Speeding up your analysis

Parallel programing 101

Domino Data Lab November 10, 2015

• Think about independent tasks (hint: “for” loops are a good place to start!)• Should be CPU-bound tasks

• Warning and pitfalls• Not a substitute for good code• Overhead • Shared resource contention• Thrashing

Source: Blaise Barney, Lawrence Livermore National Laboratory

Page 6: Parallel Programming in Python: Speeding up your analysis

Can parallelize at different “levels”

Domino Data Lab November 10, 2015

Will focus on algorithms, with some brief comments on Experiments

Run against underlying libraries that parallelize low-level operations, e.g., openBLAS, ATLAS

Write your code (or use a package) to parallelize functions or steps within your analysis

Run different analyses at once

Math ops

Algorithms

Experiments

Page 7: Parallel Programming in Python: Speeding up your analysis

Parallelize tasks to match your resources

Domino Data Lab November 10, 2015

Computing something (CPU)

Reading from disk/database

Writing to disk/database

Network IO (e.g., web scraping)

Saturating a resource will create a bottleneck

Page 8: Parallel Programming in Python: Speeding up your analysis

Don't oversaturate your resources

Domino Data Lab November 10, 2015

itemIDs = [1, 2, … , n]

parallel-for-each(i = itemIDs){

item = fetchData(i)

result = computeSomething(item)

saveResult(result)

}

Page 9: Parallel Programming in Python: Speeding up your analysis

Parallelize tasks to match your resources

Domino Data Lab November 10, 2015

items = fetchData([1, 2, … , n])

results = parallel-for-each(i = items){

computeSomething(item)

}

saveResult(results)

Page 10: Parallel Programming in Python: Speeding up your analysis

Avoid modifying global state

Domino Data Lab November 10, 2015

itemIDs = [0, 0, 0, 0]

parallel-for-each(i = 1:4) {

itemIDs[i] = i

}

A = [0,0,0,0]Array initialized in process 1

[0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process

[0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modified

[0,0,0,0]When all parallel tasks finish, array in original process remained unchanged

Page 11: Parallel Programming in Python: Speeding up your analysis

Demo

Domino Data Lab November 10, 2015

Page 12: Parallel Programming in Python: Speeding up your analysis

Many ML tasks are parallelized

Domino Data Lab November 10, 2015

• Cross-Validation• Grid Search Selection• Random Forest• Kernel Density Estimation

• K-Means Clustering• Probabilistic Graphical Models• Online Learning• Neural Networks (Backpropagation)

Harder to parallelize

Intuitive to parallelize

Page 13: Parallel Programming in Python: Speeding up your analysis

Cross validation

Domino Data Lab November 10, 2015

Page 14: Parallel Programming in Python: Speeding up your analysis

Grid search

Domino Data Lab November 10, 2015

1 10 100 1000

Linear

RBF

C

Ker

nel

Page 15: Parallel Programming in Python: Speeding up your analysis

Random forest

Domino Data Lab November 10, 2015

Page 16: Parallel Programming in Python: Speeding up your analysis

Parallel programing in Python

Domino Data Lab November 10, 2015

• Joblib pythonhosted.org/joblib/parallel.html

• scikit-learn (n_jobs) scikit-learn.org

• GridSearchCV• RandomForest• KMeans• cross_val_score

• IPython Notebook clusterswww.astro.washington.edu/users/vanderplas/Astr599/notebooks/21_IPythonParallel

Page 17: Parallel Programming in Python: Speeding up your analysis

Demo

Domino Data Lab November 10, 2015

Page 18: Parallel Programming in Python: Speeding up your analysis

Parallel Programming using the GPU

Domino Data Lab November 10, 2015

• GPUs are essential to deep learning because they can yield 10x speed-up when training the neural networks.

• Use PyCUDA library to write Python code that executes using the GPU.

Page 19: Parallel Programming in Python: Speeding up your analysis

Demo

Domino Data Lab November 10, 2015

Page 20: Parallel Programming in Python: Speeding up your analysis

Can compose layers of parallelism

Domino Data Lab November 10, 2015

c1 c2 cn… c1 c2 cn…c1 c2 cn…

Machines (experiments)

Cores

RF NN GridSearched SVC

Page 21: Parallel Programming in Python: Speeding up your analysis

Demo

Domino Data Lab November 10, 2015

Page 22: Parallel Programming in Python: Speeding up your analysis

FYI: Parallel programing in R

Domino Data Lab November 10, 2015

• General purpose• parallel• foreach

cran.r-project.org/web/packages/foreach

• More specialized• randomForest

cran.r-project.org/web/packages/randomForest

• caret topepo.github.io/caret

• plyr cran.r-project.org/web/packages/plyr

Page 23: Parallel Programming in Python: Speeding up your analysis

Domino Data Lab November 10, 2015

dominodatalab.comblog.dominodatalab.com

@dominodatalab

Check us out!