SlideShare a Scribd company logo
Managing Data Science by David Martínez Rego
Managing
Data Science
Dr. David Martinez Rego
Big Data Spain 2016
Lead Data Science
• Leading Data Science is probably one of the most
exciting/fun positions that someone can have
nowadays
• Create computational algorithms that can take
decisions and learn from errors in any problem
that can be formulated in numbers
• There many paths to find the treasure in your data,
the role of a Lead Data Scientist is to find the
shortest and safest one.
DS plan meeting!
Top weird moments
• I prefer not to give you any insight into the problem. Why
do you want to know what the columns are? I prefer you
treat the problem as just data.
• There exist labels. We do not have permission to access
them. You inspect the results and see if they makes sense.
• We can give the screenshot of the dashboard and tell the
algorithm to predict if it will break.
• Your algorithm is wrong! I have been managing this for 10
years, it cannot be like that.
Other meetings!
Common language
• Because of its short public life,
Machine Learning lacks the
general understanding of its
fundamental limitations/
principles.
• Focus on practicality makes
literature/media oblivious to
these fundamentals.
• Only when we agree on some
common language, all parties
in a room can start to
understand each other.
Learning theory
• Set of fundamental results that are behind many of
the common practices and algorithms we use
nowadays.
• Has been heavily researched since the 80s and
offers a set of mathematical guarantees/limitations
in the practice of ML
• Useful both for ML practitioners and managers as a
rule of thumb to understand and manage DS.
Domain
Dataset
Loss function
Hypothesis space
Training algorithm
Evaluation
The problem
No Free Lunch
No Free Lunch
No Free Lunch
How can we prevent such failures? 	

By using our prior knowledge about a specific learning task, to
avoid the distributions that will cause us to fail when learning
that task. Such prior knowledge can be expressed by
restricting our hypothesis class.
Managing Data Science by David Martínez Rego
No Free Lunch take aways
• No free lunch theorem is a mathematical certificate
• For managers & HR
• foresee an investment in a variety of specialists if you plan
to tackle an increasing number of data challenges
• escape from promises of one killer technique that acts as
a hammer for all problems
• For Data Science teams
• foresee and increasing number of specific techniques
which you have to keep up to date (team effort)
Generalisation bounds
• How can be sure that a model will not fail in
production?
• How can we correct when things do not go well?
• How can I know if I am being wasteful?
Generalisation bounds
• A ML practitioner is going to train a model with
complexity d (VC-dimension), on m samples, and
she is going to observe an error Ls.
• The expected performance when this model goes
to production is bounded by with probability 1-𝜹
Manage DS
• How can we correct when things do not go well
• Get a larger sample
• Change the hypothesis class by:
• Enlarging it
• Reducing it
• Completely changing it
• Changing the parameters you consider
• Change the feature representation of the data
• Change the optimisation algorithm used to apply your learning
rule
Big Data
• Big Data has had a significant impact in the number of m samples, and also
the complexity complexity d (VC-dimension).
• When tackling Variety by making use of unstructured data we increase the
complexity d and so it should be planned that the size m is adequate.
• Review the modelling that we are doing to know if we need a big database.
• Is it the case that you do not need to maintain all that data?
Half pie syndrome
• Symptoms
• You are spending a lot of
money on gathering
data to fuel growth in
your business
• Your systems look like
this pie, succulent but it
seems that your
business has lost
appetite.
Enough data?
Andrew Gelman (2005):
“Sample sizes are never large. If N is
too small to get a sufficiently-precise
estimate, you need to get more data
(or make more assumptions). But once
N is "large enough”, you can start
subdividing the data to learn more. N
is never enough because if it were
"enough" you'd already be on to the
next problem for which you need more
data.”
Big data bounds
Alg. design #Data Engineering
Conclusions
• In order to build a better understanding between
data science teams and other stakeholders, we
need to make an effort to build a robust common
language!
• Learning theory, originally devised as the
fundamental theoretic pillar of ML, can help to
build an understanding
• These proven basic laws can help you to have a
structured way to manage Data Science
References
• Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theory to
Algorithms, 2014.
• León Bottou and Olivier Bousquet. The Tradeoffs of
Large Scale Learning. NIPS 2008
• SVM Optimization: Inverse depencen on dataset size.
ICML 2008
• Gelman, Andrew. N is never large enough, http://
andrewgelman.com/2005/07/31/n_is_never_larg/
Managing
Data Science
Dr. David Martinez Rego
Big Data Spain 2016

More Related Content

Managing Data Science by David Martínez Rego

  • 2. Managing Data Science Dr. David Martinez Rego Big Data Spain 2016
  • 3. Lead Data Science • Leading Data Science is probably one of the most exciting/fun positions that someone can have nowadays • Create computational algorithms that can take decisions and learn from errors in any problem that can be formulated in numbers • There many paths to find the treasure in your data, the role of a Lead Data Scientist is to find the shortest and safest one.
  • 5. Top weird moments • I prefer not to give you any insight into the problem. Why do you want to know what the columns are? I prefer you treat the problem as just data. • There exist labels. We do not have permission to access them. You inspect the results and see if they makes sense. • We can give the screenshot of the dashboard and tell the algorithm to predict if it will break. • Your algorithm is wrong! I have been managing this for 10 years, it cannot be like that.
  • 7. Common language • Because of its short public life, Machine Learning lacks the general understanding of its fundamental limitations/ principles. • Focus on practicality makes literature/media oblivious to these fundamentals. • Only when we agree on some common language, all parties in a room can start to understand each other.
  • 8. Learning theory • Set of fundamental results that are behind many of the common practices and algorithms we use nowadays. • Has been heavily researched since the 80s and offers a set of mathematical guarantees/limitations in the practice of ML • Useful both for ML practitioners and managers as a rule of thumb to understand and manage DS.
  • 12. No Free Lunch How can we prevent such failures? By using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class.
  • 14. No Free Lunch take aways • No free lunch theorem is a mathematical certificate • For managers & HR • foresee an investment in a variety of specialists if you plan to tackle an increasing number of data challenges • escape from promises of one killer technique that acts as a hammer for all problems • For Data Science teams • foresee and increasing number of specific techniques which you have to keep up to date (team effort)
  • 15. Generalisation bounds • How can be sure that a model will not fail in production? • How can we correct when things do not go well? • How can I know if I am being wasteful?
  • 16. Generalisation bounds • A ML practitioner is going to train a model with complexity d (VC-dimension), on m samples, and she is going to observe an error Ls. • The expected performance when this model goes to production is bounded by with probability 1-𝜹
  • 17. Manage DS • How can we correct when things do not go well • Get a larger sample • Change the hypothesis class by: • Enlarging it • Reducing it • Completely changing it • Changing the parameters you consider • Change the feature representation of the data • Change the optimisation algorithm used to apply your learning rule
  • 18. Big Data • Big Data has had a significant impact in the number of m samples, and also the complexity complexity d (VC-dimension). • When tackling Variety by making use of unstructured data we increase the complexity d and so it should be planned that the size m is adequate. • Review the modelling that we are doing to know if we need a big database. • Is it the case that you do not need to maintain all that data?
  • 19. Half pie syndrome • Symptoms • You are spending a lot of money on gathering data to fuel growth in your business • Your systems look like this pie, succulent but it seems that your business has lost appetite.
  • 20. Enough data? Andrew Gelman (2005): “Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is "large enough”, you can start subdividing the data to learn more. N is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.”
  • 21. Big data bounds Alg. design #Data Engineering
  • 22. Conclusions • In order to build a better understanding between data science teams and other stakeholders, we need to make an effort to build a robust common language! • Learning theory, originally devised as the fundamental theoretic pillar of ML, can help to build an understanding • These proven basic laws can help you to have a structured way to manage Data Science
  • 23. References • Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms, 2014. • León Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. NIPS 2008 • SVM Optimization: Inverse depencen on dataset size. ICML 2008 • Gelman, Andrew. N is never large enough, http:// andrewgelman.com/2005/07/31/n_is_never_larg/
  • 24. Managing Data Science Dr. David Martinez Rego Big Data Spain 2016