Scikit-learn for easy machine learning: the vision, the tool, and the project
- 1. Scikit-learn for easy machine learning:
the vision, the tool, and the project
Ga¨el Varoquaux
scikit
machine learning in Python
- 4. 1 Scikit-learn: the vision
An enabler
Machine learning
for everybody and
for everything
Machine learning
without learning the
machinery
G Varoquaux 2
- 5. Machine learning in a nutshell
Machine learning is about making prediction from data
G Varoquaux 3
- 6. 1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Eatable?
Mobile?
Tall?
G Varoquaux 4
- 7. 1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
G Varoquaux 4
- 8. 1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
G Varoquaux 4
- 9. 1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
G Varoquaux 4
- 10. 1 Machine learning: a historical perspective
Artificial Intelligence The 80s
Building decision rules
Machine learning The 90s
Learn these from observations
Statistical learning 2000s
Model the noise in the observations
Big data today
Many observations,
simple rules
“Big data isn’t actually interesting without machine
learning”
Steve Jurvetson, VC, Silicon Valley
G Varoquaux 4
- 11. 1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 5
- 12. 1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
?G Varoquaux 5
- 13. 1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
G Varoquaux 6
- 14. 1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names
that go with them.
2 From a new (noisy) images, find the image that is
most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no errors
Test data = Train data
G Varoquaux 6
- 15. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
G Varoquaux 7
- 16. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Which model to prefer?
G Varoquaux 7
- 17. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Problem of “over-fitting”
Minimizing error is not always the best strategy
(learning noise)
Test data = train data
G Varoquaux 7
- 18. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
G Varoquaux 7
- 19. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
x
y
Prefer simple models
= concept of “regularization”
Balance the number of parameters to learn
with the amount of data
Bias variance tradeoff
G Varoquaux 7
- 20. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
G Varoquaux 7
- 21. 1 Machine learning in a nutshell: regression
A single descriptor:
one dimension
x
y
Two descriptors:
2 dimensions
X_1
X_2
y
More parameters
⇒ need more data
“curse of dimensionality”
G Varoquaux 7
- 22. 1 Machine learning in a nutshell: classification
Example:
recognizing hand-written digits
G Varoquaux 8
- 23. 1 Machine learning in a nutshell: classification
X1
X2
Example:
recognizing hand-written digits
Represent with 2 numerical features
G Varoquaux 8
- 25. 1 Machine learning in a nutshell: classification
X1
X2
It’s about finding
separating lines
G Varoquaux 8
- 26. 1 Machine learning in a nutshell: models
1 staircase
Fit with a staircase of 10 constant values
G Varoquaux 9
- 27. 1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
G Varoquaux 9
- 28. 1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
G Varoquaux 9
- 29. 1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
Boosted regression trees
G Varoquaux 9
- 30. 1 Machine learning in a nutshell: models
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values
Fit a new staircase on errors
Keep going
Boosted regression trees
Complexitity trade offs
Computational + statistical
G Varoquaux 9
- 32. 1 Machine learning in a nutshell: unsupervised
Stock market structure
Unlabeled data
more common than labeled data
G Varoquaux 10
- 33. Machine learning
Mathematics and algorithms for fitting predictive models
Regression
x
y
Classification
Unsupervised...
Notions of overfit, test error
regularization, model complexity
G Varoquaux 11
- 34. Machine learning is everywhere
Image recognition
Marketing (click-through rate)
Movie / music recommendation
Medical data
Logistic chains (eg supermarkets)
Language translation
Detecting industrial failures
G Varoquaux 12
- 36. Real statisticians use R
And real astronomers use IRAF
Real economists use Gauss
Real coders use C assembler
Real experiments are controlled in Labview
Real Bayesians use BUGS stan
Real text processing is done in Perl
Real Deep learner is best done with torch (Lua)
And medical doctors only trust SPSS
G Varoquaux 14
- 37. 1 My stack
Python, what else?
General purpose
Interactive language
Easy to read / write
G Varoquaux 15
- 38. 1 My stack
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared with C / fortran
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
187745620
G Varoquaux 15
- 39. 1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
It’s about plugin things
together
G Varoquaux 15
- 40. 1 My stack
The scientific Python stack
numpy arrays
Connecting to
scipy
scikit-image
pandas
...
Being Pythonic and
SciPythonic
G Varoquaux 15
- 41. 1 scikit-learn vision
Machine learning for all
No specific application domain
No requirements in machine learning
High-quality Pythonic software library
Interfaces designed for users
Community-driven development
BSD licensed, very diverse contributors
http://scikit-learn.org
G Varoquaux 16
- 42. 1 Between research and applications
Machine learning research
Conceptual complexity is not an issue
New and bleeding edge is better
Simple problems are old science
In the field
Tried and tested (aka boring) is good
Little sophistication from the user
API is more important than maths
Solving simple problems matters
Solving them really well matters a lot
G Varoquaux 17
- 43. 2 Scikit-learn: the tool
A Python library for machine learning
c Theodore W. Gray
G Varoquaux 18
- 44. 2 A Python library
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 19
- 45. 2 API: specifying a model
A central concept: the estimator
Instanciated without data
But specifying the parameters
from s k l e a r n . n e i g h b o r s import
KNear estNeig hbo r s
e s t i m a t o r = KN ea r estNe ig h b or s (
n n e i g h b o r s =2)
G Varoquaux 20
- 46. 2 API: training a model
Training from data
e s t i m a t o r . f i t ( X t r a i n , Y t r a i n )
with:
X a numpy array with shape
nsamples × nfeatures
y a numpy 1D array, of ints or float, with shape
nsamples
G Varoquaux 21
- 47. 2 API: using a model
Prediction: classification, regression
Y t e s t = e s t i m a t o r . p r e d i c t ( X t e s t )
Transforming: dimension reduction, filter
X new = e s t i m a t o r . t r a n s f o r m ( X t e s t )
Test score, density estimation
t e s t s c o r e = e s t i m a t o r . s c o r e ( X t e s t )
G Varoquaux 22
- 48. 2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
G Varoquaux 23
- 49. 2 Vectorizing
From raw data to a sample matrix X
For text data: counting word occurences
- Input data: list of documents (string)
- Output data: numerical matrix
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t
import H a s h i n g V e c t o r i z e r
h a s h e r = H a s h i n g V e c t o r i z e r ()
X = h a s h e r . f i t t r a n s f o r m ( documents )
G Varoquaux 23
- 50. 2 Scikit-learn: very rich feature set
Supervised learning
Decision trees (Random-Forest, Boosted Tree)
Linear models
SVM
Unsupervised Learning
Clustering
Dictionary learning
Outlier detection
Model selection
Built in cross-validation
Parameter optimization
G Varoquaux 24
- 51. 2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
G Varoquaux 25
- 52. 2 Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithmic optimizations
Minimizing data copies
Random Forest fit time
0
2000
4000
6000
8000
10000
12000
14000Fittime(s)
203.01 211.53
4464.65
3342.83
1518.14
1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
randomForest
R, Fortran
Orange
Python
Figure: Gilles Louppe
G Varoquaux 25
- 53. What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
G Varoquaux 26
- 54. What if the data does not fit in memory?
“Big data”:
Petabytes...
Distributed storage
Computing cluster
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 26
- 55. 2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 27
- 56. 2 On-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Linear models
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch (new in 0.16)
PCA (new in 0.16)
sklearn.decompositions.IncrementalPCA
G Varoquaux 27
- 57. 2 On-the-fly data reduction
Many features
⇒ Reduce the data as it is loaded
X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 28
- 58. 2 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 28
- 63. 3 Having an impact
1% of Debian installs
1200 job offers on stack overflow
G Varoquaux 30
- 64. 3 Having an impact
1% of Debian installs
1200 job offers on stack overflow
G Varoquaux 30
- 65. 3 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
∼ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 31
- 66. 3 Many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 32
- 67. 3 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux 33
- 68. 3 Quality assurance
Code review: pull requests
Can include newcomers
We read each others code
Everything is discussed:
- Should the algorithm go in?
- Are there good defaults?
- Are names meaningfull?
- Are the numerics stable?
- Could it be faster?
G Varoquaux 34
- 69. 3 Quality assurance
Unit testing
Everything is tested
Great for numerics
Overall tests enforce on all estimators
- consistency with the API
- basic invariances
- good handling of various inputs
If it ain’t tested
it’s broken
G Varoquaux 35
- 71. 3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux 37
- 72. 3 The tragedy of the commons
Individuals, acting independently and rationally accord-
ing to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
⇒ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
+ It’s so hard to scale
User support
Growing codebase
G Varoquaux 37