Managing Data Science by David Martínez Rego

Managing
Data Science
Dr. David Martinez Rego
Big Data Spain 2016

Lead Data Science
• Leading Data Science is probably one of the most
exciting/fun positions that someone can have
nowadays
• Create computational algorithms that can take
decisions and learn from errors in any problem
that can be formulated in numbers
• There many paths to ﬁnd the treasure in your data,
the role of a Lead Data Scientist is to ﬁnd the
shortest and safest one.

Top weird moments
• I prefer not to give you any insight into the problem. Why
do you want to know what the columns are? I prefer you
treat the problem as just data.
• There exist labels. We do not have permission to access
them. You inspect the results and see if they makes sense.
• We can give the screenshot of the dashboard and tell the
algorithm to predict if it will break.
• Your algorithm is wrong! I have been managing this for 10
years, it cannot be like that.

Common language
• Because of its short public life,
Machine Learning lacks the
general understanding of its
fundamental limitations/
principles.
• Focus on practicality makes
literature/media oblivious to
these fundamentals.
• Only when we agree on some
common language, all parties
in a room can start to
understand each other.

Learning theory
• Set of fundamental results that are behind many of
the common practices and algorithms we use
nowadays.
• Has been heavily researched since the 80s and
offers a set of mathematical guarantees/limitations
in the practice of ML
• Useful both for ML practitioners and managers as a
rule of thumb to understand and manage DS.

Domain
Dataset
Loss function
Hypothesis space
Training algorithm
Evaluation
The problem

No Free Lunch
How can we prevent such failures?

By using our prior knowledge about a speciﬁc learning task, to
avoid the distributions that will cause us to fail when learning
that task. Such prior knowledge can be expressed by
restricting our hypothesis class.

No Free Lunch take aways
• No free lunch theorem is a mathematical certiﬁcate
• For managers & HR
• foresee an investment in a variety of specialists if you plan
to tackle an increasing number of data challenges
• escape from promises of one killer technique that acts as
a hammer for all problems
• For Data Science teams
• foresee and increasing number of speciﬁc techniques
which you have to keep up to date (team effort)

Generalisation bounds
• How can be sure that a model will not fail in
production?
• How can we correct when things do not go well?
• How can I know if I am being wasteful?

Generalisation bounds
• A ML practitioner is going to train a model with
complexity d (VC-dimension), on m samples, and
she is going to observe an error Ls.
• The expected performance when this model goes
to production is bounded by with probability 1-𝜹

Manage DS
• How can we correct when things do not go well
• Get a larger sample
• Change the hypothesis class by:
• Enlarging it
• Reducing it
• Completely changing it
• Changing the parameters you consider
• Change the feature representation of the data
• Change the optimisation algorithm used to apply your learning
rule

Big Data
• Big Data has had a signiﬁcant impact in the number of m samples, and also
the complexity complexity d (VC-dimension).
• When tackling Variety by making use of unstructured data we increase the
complexity d and so it should be planned that the size m is adequate.
• Review the modelling that we are doing to know if we need a big database.
• Is it the case that you do not need to maintain all that data?

Half pie syndrome
• Symptoms
• You are spending a lot of
money on gathering
data to fuel growth in
your business
• Your systems look like
this pie, succulent but it
seems that your
business has lost
appetite.

Enough data?
Andrew Gelman (2005):
“Sample sizes are never large. If N is
too small to get a sufﬁciently-precise
estimate, you need to get more data
(or make more assumptions). But once
N is "large enough”, you can start
subdividing the data to learn more. N
is never enough because if it were
"enough" you'd already be on to the
next problem for which you need more
data.”

Big data bounds
Alg. design #Data Engineering

Conclusions
• In order to build a better understanding between
data science teams and other stakeholders, we
need to make an effort to build a robust common
language!
• Learning theory, originally devised as the
fundamental theoretic pillar of ML, can help to
build an understanding
• These proven basic laws can help you to have a
structured way to manage Data Science

References
• Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theory to
Algorithms, 2014.
• León Bottou and Olivier Bousquet. The Tradeoffs of
Large Scale Learning. NIPS 2008
• SVM Optimization: Inverse depencen on dataset size.
ICML 2008
• Gelman, Andrew. N is never large enough, http://
andrewgelman.com/2005/07/31/n_is_never_larg/

Managing Data Science by David Martínez Rego

More Related Content

Managing Data Science by David Martínez Rego