This document discusses scaling Python performance in production environments. It introduces the Intel Distribution for Python, which provides optimized versions of NumPy, SciPy, and Scikit-Learn using Intel MKL to accelerate linear algebra and machine learning algorithms. It also supports parallelism through MPI, TBB for multithreading, and integration with big data frameworks. Profiling tools like Intel VTune Amplifier help optimize mixed-language Python applications for Intel architectures. The goal is to make Python usable for high performance computing and big data workloads while maintaining its ease of use.
Report
Share
Report
Share
1 of 32
Download to read offline
More Related Content
Python* Scalability in Production Environments
2. Stanley Seibert
Director of Community Innovation
Continuum Analytics
Python Scalability Story
In Production Environments
Sergey Maidanov
Software Engineering Manager for
Intel® Distribution for Python*
4. What Problems We Solve: Scalable Performance
Make Python usable beyond prototyping environment by
scaling out to HPC and Big Data environments
5. What Problems We Solve: Out-Of-The-Box Usability
“Any articles I found on your site
that related to actually using the
MKL for compiling something were
overly technical. I couldn't figure
out what the heck some of the
things were doing or talking
about.“ – Intel® Parallel Studio 2015 Beta Survey Response
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/280832
https://software.intel.com/en-us/articles/building-numpyscipy-with-intel-mkl-and-intel-fortran-on-windows
https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl
6. INTEL®DISTRIBUTIONFORPYTHON*2017
Advancing Python performance closer to native speeds
• Prebuilt, optimized for numerical computing, data analytics, HPC
• Drop in replacement for existing Python. No code changes required
Easy, out-of-the-box
access to high
performance Python
• Accelerated NumPy/SciPy/Scikit-Learn with Intel® MKL
• Data analytics with pyDAAL, enhanced thread scheduling with TBB,
Jupyter* Notebook interface, Numba, Cython
• Scale easily with optimized MPI4Py and Jupyter notebooks
Performance with multiple
optimization techniques
• Distribution and individual optimized packages available through
conda and Anaconda Cloud: anaconda.org/intel
• Optimizations upstreamed back to main Python trunk
Faster access to latest
optimizations for Intel
architecture
7. Intel® Xeon® Processors Intel® Xeon Phi™ Product Family
Configuration Info: apt/atlas: installed with apt-get, Ubuntu 16.10, python 3.5.2, numpy 1.11.0, scipy 0.17.0; pip/openblas: installed with pip, Ubuntu 16.10, python 3.5.2, numpy 1.11.1, scipy 0.18.0; Intel Python: Intel Distribution for Python
2017;. Hardware: Xeon: Intel Xeon CPU E5-2698 v3 @ 2.30 GHz (2 sockets, 16 cores each, HT=off), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Xeon Phi: Intel Intel® Xeon Phi™ CPU 7210 1.30 GHz, 96 GB of RAM, 6 DIMMS of 16GB@1200MHz
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice. Notice revision #20110804 .
Why Yet Another Python Distribution?
Mature AVX2 instructions based product New AVX512 instructions based product
8. Scaling To HPC/Big Data Production Environment
• Hardware and software efficiency crucial in production (Perf/Watt, etc.)
• Efficiency = Parallelism
• Instruction Level Parallelism with effective memory access patterns
• SIMD
• Multi-threading
• Multi-node
* Roofline Performance Model https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
Roofline Performance Model*
Arithmetic Intensity
SpMVBLAS1
Stencils
FFT
BLAS3 Particle
Methods
Low High
Gflop/s
Peak Gflop/s
9. Efficiency = Parallelism in Python
• CPython as interpreter inhibits parallelism but…
• … Overall Python tools evolved far toward unlocking parallelism
Native extensions
numpy*, scipy*, scikit-
learn* accelerated
with Intel® MKL, Intel®
DAAL, Intel® IPP
Composable multi-
threading with
Intel® TBB and
Dask*
Multi-node
parallelism with
mpi4py*
accelerated with
Intel® MPI
Language
extensions for
vectorization &
multi-threading
(Cython*, Numba*)
Integration with Big
Data platforms and
Machine Learning
frameworks (pySpark*,
Theano*, TensorFlow*,
etc.)
Mixed language
profiling with Intel®
VTune™ Amplifier
10. Numpy* & Scipy* optimizations with Intel® MKL
Linear Algebra
• BLAS
• LAPACK
• ScaLAPACK
• Sparse BLAS
• Sparse Solvers
• Iterative
• PARDISO* SMP & Cluster
Fast Fourier
Transforms
• 1D and multidimensional FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential
• Log
• Power
• Root
Vector RNGs
• Multiple BRNG
• Support methods for
independent streams
creation
• Support all key probability
distributions
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Functional domain in this color accelerate respective NumPy, SciPy, etc. domain
Up to
100x
faster!
Up to
10x
faster!
Up to
10x
faster!
Up to
60x
faster!
Configuration Info: apt/atlas: installed with apt-get, Ubuntu 16.10, python 3.5.2, numpy 1.11.0, scipy 0.17.0; pip/openblas: installed with pip, Ubuntu 16.10, python 3.5.2, numpy
1.11.1, scipy 0.18.0; Intel Python: Intel Distribution for Python 2017;. Hardware: Xeon: Intel Xeon CPU E5-2698 v3 @ 2.30 GHz (2 sockets, 16 cores each, HT=off), 64 GB of RAM, 8
DIMMS of 8GB@2133MHz; Xeon Phi: Intel Intel® Xeon Phi™ CPU 7210 1.30 GHz, 96 GB of RAM, 6 DIMMS of 16GB@1200MHz
11. Scikit-Learn* optimizations with Intel® MKL
0x
1x
2x
3x
4x
5x
6x
7x
8x
9x
Approximate
neighbors
Fast K-means GLM GLM net LASSO Lasso path Least angle
regression,
OpenMP
Non-negative
matrix
factorization
Regression by
SGD
Sampling
without
replacement
SVD
Speedups of Scikit-Learn Benchmarks
Intel® Distribution for Python* 2017 Update 1 vs. system Python & NumPy/Scikit-Learn
System info: 32x Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel® Distribution for Python* 2017 Gold; Intel® MKL 2017.0.0; Ubuntu 14.04.4 LTS; Numpy 1.11.1; scikit-learn 0.17.1. See Optimization Notice.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by
this notice. Notice revision #20110804 .
Effect of Intel MKL
optimizations for
NumPy* and SciPy*
1 1.11
54.13
0x
10x
20x
30x
40x
50x
60x
System Sklearn Intel SKlearn Intel PyDAAL
Speedup
Potential Speedup of Scikit-learn*
due to PyDAAL
PCA, 1M Samples, 200 Features
Effect of DAAL
optimizations for
Scikit-Learn*
Intel® Distribution for Python* ships Intel®
Data Analytics Acceleration Library with
Python interfaces, a.k.a. pyDAAL
12. Distributed parallelism
Intel® MPI library accelerates Intel® Distribution
for Python* (Mpi4py*, Ipyparallel*)
Intel Distribution for Python* also supports
▪ PySpark* - Python interfaces for Spark*, a fast and general
engine for large-scale data processing.
▪ Dask* - a flexible parallel computing library for analytic
computing.
Mpi4py* performance vs. native Intel® MPI
1.7x 2.2x 3.0x 5.3x
0x
1x
2x
3x
4x
5x
6x
2 nodes 4 nodes 8 nodes 16 nodes
PyDAAL Implicit ALS with
Mpi4Py*
Configuration Info: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, 2x18 cores, HT is ON, RAM 128GB; Versions: Oracle Linux Server 6.6, Intel®
DAAL 2017 Gold, Intel® MPI 5.1.3; Interconnect: 1 GB Ethernet
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. * Other brands and names are the
property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice. Notice revision #20110804 .
13. Composable multi-threading with Intel® TBB
• Amhdal’s law suggests extracting parallelism at all levels
• Software components are built from smaller ones
• If each component is threaded there can be too much!
• Intel TBB dynamically balances thread loads and effectively manages oversubscription
MKL
TBB
DAAL
pyDAAL
NumPy
SciPy TBB
Joblib
Dask
Application
PythonpackagesNative
libs
12
Application
Component 1
Component N
Subcomponent 1
Subcomponent 2
Subcomponent K
Subcomponent 1
Subcomponent M
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
Subcomponent
1
>python –m TBB myapp.py
14. Composable Parallelism: QR Performance
Numpy
1.00x
Numpy
0.22x
Numpy
0.47xDask
0.61x
Dask
0.89x
Dask
1.46x
0.0x
0.2x
0.4x
0.6x
0.8x
1.0x
1.2x
1.4x
Default MKL Serial MKL Intel® TBB
Speedup relative to Default Numpy*
Intel® MKL,
OpenMP* threading
Intel® MKL,
Serial
Intel® MKL,
Intel® TBB threading
Over-
subscription
App-level
parallelism
only
TBB-
composable
nested
parallelism
System info: 32x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, disabled HT, 64GB RAM; Intel(R) MKL 2017.0 Beta Update 1 Intel(R) 64 architecture,
Intel(R) AVX2; Intel(R)TBB 4.4.4; Ubuntu 14.04.4 LTS; Dask 0.10.0; Numpy 1.11.0. Software and workloads used in performance tests may have been
optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to
Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in
this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered
by this notice. Notice revision #20110804 .
15. 15
Feature cProfile Line_profiler Intel® VTune™ Amplifier
Profiling technology Event Instrumentation Sampling, hardware events
Analysis granularity Function-level Line-level Line-level, call stack, time windows,
hardware events
Intrusiveness Medium (1.3-5x) High (4-10x) Low (1.05-1.3x)
Mixed language programs Python Python Python, Cython, C++, Fortran
Right tool for high performance application profiling at all levels
• Function-level and line-level hotspot analysis, down to disassembly
• Call stack analysis
• Low overhead
• Mixed-language, multi-threaded application analysis
• Advanced hardware event analysis for native codes (Cython, C++, Fortran) for cache misses,
branch misprediction, etc.
Profiling Python* code with Intel® VTune™ Amplifier
16. Stanley Seibert
Director of Community Innovation
Continuum Analytics
November 2016
Scaling Python with JIT Compilation
17. 17
Creating a Compiler For Python
Many valid approaches, but we think these are the most important for data science:
▪ Cannot replace the standard interpreter
– Must be able to continue to use pandas, SciPy, scikit-learn, etc
▪ Minimize boilerplate
– Traditional compiled Python extensions require a lot of infrastructure. Try to stay simple
and get out of the way.
��� Be flexible about execution model
– Not all hardware is a general purpose CPU
▪ Integrate well with Python’s adaptable ecosystem
– Must be able to continue to use pandas, SciPy, scikit-learn, etc
18. 18
Numba: A JIT Compiler for Python Functions
▪ An open-source, function-at-a-time compiler library for Python
▪ Compiler toolbox for different targets and execution models:
– single-threaded CPU, multi-threaded CPU, GPU
– regular functions, “universal functions” (array functions), GPU kernels
▪ Speedup: 2x (compared to basic NumPy code) to 200x (compared to pure
Python)
▪ Combine ease of writing Python with speeds approaching FORTRAN
▪ Empowers data scientists who make tools for themselves and other data
scientists
19. 19
How does Numba work?
Python Function
(bytecode)
Bytecode
Analysis
Functions
Arguments
Numba IR
Machine
Code
Execut
e!
Type
Inference
LLVM/NVVM JIT LLVM IR
Lowering
Rewrite IR
Cache
@jit
def do_math(a, b):
…
>>> do_math(x, y)
20. 20
Supported Platforms and Hardware
OS HW SW
Windows (7 and later) 32 and 64-bit x86 CPUs Python 2 and 3
OS X (10.9 and later) CUDA & HSA Capable GPUs NumPy 1.7 through 1.11
Linux (RHEL 5 and later)
Experimental support for
ARM, Xeon Phi, AMD Fiji
GPUs
22. 22
Basic Example
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator
(nopython=True not required)
24. 24
Universal Functions (Ufuncs)
Ufuncs are a core concept in NumPy for array-oriented computing.
▪ A function with scalar inputs is broadcast across the elements of the input
arrays:
– np.add([1,2,3], 3) == [4, 5, 6]
– np.add([1,2,3], [10, 20, 30]) == [11, 22, 33]
▪ Parallelism is present, by construction. Numba will generate loops and can
automatically multi-thread if requested.
▪ Before Numba, creating fast ufuncs required writing C. No longer!
27. 27
Distributed Computing
Example: Dask
Dask Client
(Haswell)
Dask Scheduler
Dask Worker
(Skylake)
Dask Worker
(Skylake)
Dask Worker
(Knight’s Landing)
@jit
def f(x):
…
- Serialize with pickle module
- Works with Dask and Spark (and others)
- Automatic recompilation for each target
f(x)
f(x)
f(x)
28. 28
Other Numba Features
▪ Detects CPU model during code generation and instructs LLVM to optimize for that
architecture.
▪ Automatic dispatch to multiple type-specialized implementations of the same
function
▪ Uses LLVM autovectorization optimization passes for SIMD code generation
▪ Supports calls directly to C with CFFI and ctypes
▪ Optional caching of compiled functions to disk
▪ Ahead of time compilation to shared libraries
▪ Extension API allowing 3rd parties to extend the compiler with new data types and
functions.
29. 29
Conclusion
▪ Numba - Create new high performance functions on-the-fly with pure Python
▪ Understands NumPy arrays and many NumPy operations
▪ Supplies several compilation modes and options for multi-threading
▪ Use with your favorite distributed computing framework
▪ For more information: http://numba.pydata.org
▪ Comes with Anaconda: https://www.continuum.io/downloads
30. Call To Action
• Start with either Intel’s or Continuum’s distribution
• Both have Intel performance goodness baked in!
• You cannot go wrong either way!
• Give Numba* a try and see performance increase
• Try Python* performance profiling with Intel® VTune™ Amplifier!
• Intel Distribution for Python is free!
https://software.intel.com/en-us/intel-distribution-for-python
– Commercial support included for Intel® Parallel Studio XE customers!
– Easy to install with Anaconda* https://anaconda.org/intel/
Intel is working with community leaders like Continuum Analytics to bring the
BEST performance on IA to Python developers
31. Thank you for your time
Stan Seibert
stan.seibert@continuum.io
www.intel.com/hpcdevcon
Sergey Maidanov
sergey.Maidanov@intel.com
www.intel.com/hpcdevcon
Intel® Distribution for Python*
Powered by Anaconda*