Parallel Programming in Python: Speeding up your analysis

Domino Data Lab November 10, 2015
Faster data science — without a cluster
Parallel programming in Python
Manojit Nandi
mnandi92@gmail.com
@mnandi92

Who am I?
• Data Scientist at STEALTHBits Technologies 
 
 
• Data Science Evangelist at Domino Data Lab 
 
 
• BS in Decision Science

Agenda and Goals
• Motivation
• Conceptual intro to parallelism, general principles and pitfalls
• Machine learning applications
• Demos
Goal: Leave you with principles, and practical concrete tools, that will
help you run your code much faster

Motivation
• Lots of “medium data” problems
• Can fit in memory on one machine
• Lots of naturally parallel problems 
• Easy to access large machines
• Clusters are hard
• Not everything fits map-reduce
CPUs with multiple cores have become the standard in the recent
development of modern computer architectures and we can not only find
them in supercomputer facilities but also in our desktop machines at
home, and our laptops; even Apple's iPhone 5S got a 1.3 GHz Dual-core
processor in 2013.
- Sebastian Rascka

Parallel programing 101
• Think about independent tasks (hint: “for” loops are a good place to start!)
• Should be CPU-bound tasks
• Warning and pitfalls
• Not a substitute for good code
• Overhead
• Shared resource contention
• Thrashing
Source: Blaise Barney, Lawrence Livermore National Laboratory

Can parallelize at different “levels”
Will focus on algorithms, with some brief comments on
Experiments
Run against underlying libraries that parallelize
low-level operations, e.g., openBLAS, ATLAS
Write your code (or use a package) to
parallelize functions or steps within your
analysis
Run different analyses at once
Math ops
Algorithms
Experiments

Parallelize tasks to match your resources
Computing something (CPU) 
Reading from disk/database 
Writing to disk/database 
Network IO (e.g., web scraping)
Saturating a resource will create a bottleneck

Don't oversaturate your resources
itemIDs = [1, 2, … , n]
parallel-for-each(i = itemIDs){
item = fetchData(i)
result = computeSomething(item)
saveResult(result)
}

Parallelize tasks to match your resources
items = fetchData([1, 2, … , n])
results = parallel-for-each(i = items){
computeSomething(item)
}
saveResult(results)

Avoid modifying global state
itemIDs = [0, 0, 0, 0]
parallel-for-each(i = 1:4) {
itemIDs[i] = i
}
A = [0,0,0,0]Array initialized in process 1
[0,0,0,0] [0,0,0,0][0,0,0,0][0,0,0,0]Array copied to each sub-process
[0,0,0,0] [0,0,0,3][0,0,2,0][0,1,0,0]The copy is modiﬁed
[0,0,0,0]
When all parallel tasks ﬁnish, array in original process remained unchanged

Demo

Many ML tasks are parallelized
• Cross-Validation
• Grid Search Selection
• Random Forest
• Kernel Density Estimation
• K-Means Clustering
• Probabilistic Graphical Models
• Online Learning
• Neural Networks (Backpropagation)
Harder to parallelize
Intuitive to parallelize

Cross validation

Grid search
1 10 100 1000
Linear
RBF
C
Kernel

Random forest

Parallel programing in Python
• Joblib 
pythonhosted.org/joblib/parallel.html
• scikit-learn (n_jobs) scikit-learn.org
• GridSearchCV
• RandomForest
• KMeans
• cross_val_score
• IPython Notebook clusters 
www.astro.washington.edu/users/vanderplas/Astr599/notebooks/
21_IPythonParallel

Parallel Programming using the GPU
• GPUs are essential to deep learning
because they can yield 10x speed-up
when training the neural networks.
• Use PyCUDA library to write Python
code that executes using the GPU.

Can compose layers of parallelism
c1 c2 cn… c1 c2 cn…c1 c2 cn…
Machines 
(experiments)
Cores
RF NN GridSearched  
SVC

FYI: Parallel programing in R
• General purpose
• parallel
• foreach 
cran.r-project.org/web/packages/foreach
• More specialized
• randomForest 
cran.r-project.org/web/packages/randomForest
• caret 
topepo.github.io/caret
• plyr 
cran.r-project.org/web/packages/plyr

dominodatalab.com
blog.dominodatalab.com
@dominodatalab
Check us out!

Parallel Programming in Python: Speeding up your analysis

Related slideshows

More Related Content

Parallel Programming in Python: Speeding up your analysis