SlideShare a Scribd company logo
Future of Data Science
as a profession
Jose Quesada, Director, Data Science Retreat
@datascienceret
http://datascienceretreat.com/
The promise
The machine learning promise
People should be able to predict:
• Which employee will leave in the next 6 months
• Which electric generator is likely to die in the next 2 weeks
• Which sales lead has the highest potential to close in the next 3
months
• What each new website visitor is likely to buy based on past visitors
http://www.slideshare.net/bigml/the-past-present-and-future-of-machine-learning-apis
Jao. The Past, Present, and Future of Machine Learning APIs
http://www.enlitic.com/healthcare.html
Smile detection
Example Graduate portfolio project from DSR
03. Smile detection on video streams. Works
reliably with multiple people on cam.
Applications: youtube funny video evaluation
Data analysis has become super easy.
But has it?
• Great libraries exist with every algorithm under the sun
The machine learning promise
(Anyone who can turn on a computer) should be able
to predict:
• Which employee will leave in the next 6 months
• Which electric generator is likely to die in the next 2 weeks
• Which sales lead has the highest potential to close in the next 3 months
• What each new website visitor is likely to buy based on past visitors
Future of data science as a profession
Paco Nathan: Data Science in future tense
Future of data science as a profession
Why data analysis is still
hard, after all the libraries
and APIs
Andreas Mueller’s map
Trent McConaghy’s riff on Andy
http://trent.st/ffx/
Two machine learners, two maps
Andreas Mueller, PhD
Andy is an Assistant Research Scientist
at the NYU Center for Data Science,
building a group to work on open
source software for data science.
Previously I was a Machine Learning
Scientist at Amazon, working on
computer vision and forecasting
problems. I am one of the core
developers of the scikit-learn machine
learning library, and have maintained
it for several years.
Authored the now famous model
picker image from scikit-learn
Trent McConaghy, PhD
Trent is co-founder & CTO of ascribe,
which uses modern crypto, ML, and
big data to tackle challenges in digital
property ownership. His two startups
applied ML in the enterprise semi-
conductor space: ADA was acquired in
2004 and Solido is going strong. His
interests include large scale
regression, automating creativity,
anything labeled "impossible", and
thousand-fold improvements. He was
raised on a pig farm in Canada.
Why data analysis is still hard, after
all the libraries and APIs
• It’s too easy to lie to yourself about it working
• It’s very hard to tell whether it could work if it doesn’t
• There is no free lunch
http://blog.mikiobraun.de/2014/02/data-analysis-hard-
parts.html
No free lunch theorem
• There is no universally optimal learning algorithm as
shown by the No Free Lunch Theorem: There is no
algorithm which is better than all the rest for all kinds
of data.
“Toolified”
• As more and more ML techniques become "toolified" the
problem is that the business doesn't understand that the
hard work is still ahead of them.
• Home Depot sells hammers and lumber, and while some
people have the skill and dedication to build their own
house, most folks are smart enough to hire someone that
knows what they're doing so the thing doesn't fall in and kill
their family.
• Blind faith in the power of tools is not helpful
80 % data mangling 20 % building & testing
models
Is model building automatable?
How about the data Wrangling part? It’s actually a larger chunk
Automating the data
scientist
Machine learning APIs
Machine learning for data Wrangling
• Zoubin Ghahramani, Automatic statistician
• It's easy to shoot yourself in the foot with automated
tools — and convince yourself that the results are
meaningful when they're not
Alternative:
interfaces that draw
the most useful
information out of
people
Aka ‘The Luis von Ahn trick’.
Human computation: combine
human brainpower with computers
to solve problems that neither could
solve alone.
ReCAPTCHA: Computer-generated
tests that humans are routinely able
to pass but that computers have not
yet mastered.
Actionable advice for
individuals
Goal
• Become a full-stack problem solver
• AKA the unicorn data scientist
How to get there
• Focus on delivering business value
How to get there
Only after the business side is covered: focus on the tech
stack.
• Machine learning
• Big data/ engineering
• When to use ML at scale, when to sample and run on a single
machine
Constant learning
• The field changes faster than any other in technology
• If you are not willing to allocate ‘time outside work’ to
learn new things you will stagnate fast
Not being the equivalent to a code
monkey
• MOOC haven decreased the barrier of entry to machine-
learning.
• Nowadays, you cannot be ‘the guy who knows how to
run (insert off-the-shelf-algo-here)’. In dataland, that’s
the equivalent to being a code monkey. MOOCs and
superb libraries (scikit-learn, R’s ecosystem) made sure
there is plenty of people who can throw say a random
forest to a problem. In the modern world, this is not
adding that much value.
Picking problems to add the most
value
• Sometimes beating what the company is already doing
(often, nothing) offers a lot of value. Detecting fraud
poorly is better than not detecting fraud
Data Science will continue to be
democratized
• There’s no shortage of data
scientists.
• 1900: Number of cars on the
road would be limited by the
supply of trained chauffeurs.
Machine learning can very quickly get
you, say, 80% of the way to solving just
about any (real world) problem
You want to apply ML to contexts that are fault tolerant:
• Online ad targeting
• Ranking search results
• Recommendations
• Spam filtering
ML quickly hits a point of
diminishing returns
“The gain is not worth the pain."
Actionable advice for
companies
Talent: invest in it
• The hunt for the 10x programmer continues (although
few companies succeed)
• In data science, the equivalent is the unicorn data
scientist
• Unicorn data scientist should generate more business
value than a 10x programmer
• Market agrees: supersalaries of >200k are common for
unicorn data scientists
Talent: beware of the fake data
scientist
• Each linkedin job ad for data scientist gets ~150
applications
• Often people who just rebranded themselves but have no
real experience
• Very common in guys bailing out of academia
• HR managers cannot tell the difference
• It’s a common mistake to hire one, and never be able to
produce business value
Talent: easier to find than you may
think
• Online courses have raised the bar
• Intensive bootcamps do work, as long as people have
built something at the end
• You will still get 150 fake data scientist for each decent
one
A future where ML has
been popular for years.
How does it look like?
Next 3 years
• ML APIs will enable people with less and less skill to run
quite sophisticated analyses
• Startups doing ML as a service will grow up, then
contract. ML will stop being a key competitive
advantage on most (not all) domains
• Blind faith in the power of tools will lead to wrong
decisions, which will lead to a backslash
Next 10 years
• Prediction: C-level people will be data scientists in the
future
• Product managers become a data scientist, or get
replaced by one
DS is a chaotic field and
people don’t really know
what they want (much less
what they need)
Interested in Data Science Retreat?
Apply to any of our two tracks
http://datascienceretreat.com/
Future of data science as a profession
Thank You!
Jose Quesada, PhD
Director, Data Science Retreat
@datascienceret
me@josequesada.com
References
• Paco Nathan. Data science in future tense
• Chris Dixon Machine learning is really good at partially
solving just about any problem
• Jao. The Past, Present, and Future of Machine Learning
APIs

More Related Content

Future of data science as a profession

  • 1. Future of Data Science as a profession Jose Quesada, Director, Data Science Retreat @datascienceret http://datascienceretreat.com/
  • 3. The machine learning promise People should be able to predict: • Which employee will leave in the next 6 months • Which electric generator is likely to die in the next 2 weeks • Which sales lead has the highest potential to close in the next 3 months • What each new website visitor is likely to buy based on past visitors
  • 6. Smile detection Example Graduate portfolio project from DSR 03. Smile detection on video streams. Works reliably with multiple people on cam. Applications: youtube funny video evaluation
  • 7. Data analysis has become super easy. But has it? • Great libraries exist with every algorithm under the sun
  • 8. The machine learning promise (Anyone who can turn on a computer) should be able to predict: • Which employee will leave in the next 6 months • Which electric generator is likely to die in the next 2 weeks • Which sales lead has the highest potential to close in the next 3 months • What each new website visitor is likely to buy based on past visitors
  • 10. Paco Nathan: Data Science in future tense
  • 12. Why data analysis is still hard, after all the libraries and APIs
  • 14. Trent McConaghy’s riff on Andy http://trent.st/ffx/
  • 15. Two machine learners, two maps Andreas Mueller, PhD Andy is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously I was a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. I am one of the core developers of the scikit-learn machine learning library, and have maintained it for several years. Authored the now famous model picker image from scikit-learn Trent McConaghy, PhD Trent is co-founder & CTO of ascribe, which uses modern crypto, ML, and big data to tackle challenges in digital property ownership. His two startups applied ML in the enterprise semi- conductor space: ADA was acquired in 2004 and Solido is going strong. His interests include large scale regression, automating creativity, anything labeled "impossible", and thousand-fold improvements. He was raised on a pig farm in Canada.
  • 16. Why data analysis is still hard, after all the libraries and APIs • It’s too easy to lie to yourself about it working • It’s very hard to tell whether it could work if it doesn’t • There is no free lunch http://blog.mikiobraun.de/2014/02/data-analysis-hard- parts.html
  • 17. No free lunch theorem • There is no universally optimal learning algorithm as shown by the No Free Lunch Theorem: There is no algorithm which is better than all the rest for all kinds of data.
  • 18. “Toolified” • As more and more ML techniques become "toolified" the problem is that the business doesn't understand that the hard work is still ahead of them. • Home Depot sells hammers and lumber, and while some people have the skill and dedication to build their own house, most folks are smart enough to hire someone that knows what they're doing so the thing doesn't fall in and kill their family. • Blind faith in the power of tools is not helpful
  • 19. 80 % data mangling 20 % building & testing models Is model building automatable? How about the data Wrangling part? It’s actually a larger chunk
  • 22. Machine learning for data Wrangling
  • 23. • Zoubin Ghahramani, Automatic statistician • It's easy to shoot yourself in the foot with automated tools — and convince yourself that the results are meaningful when they're not
  • 24. Alternative: interfaces that draw the most useful information out of people Aka ‘The Luis von Ahn trick’. Human computation: combine human brainpower with computers to solve problems that neither could solve alone. ReCAPTCHA: Computer-generated tests that humans are routinely able to pass but that computers have not yet mastered.
  • 26. Goal • Become a full-stack problem solver • AKA the unicorn data scientist
  • 27. How to get there • Focus on delivering business value
  • 28. How to get there Only after the business side is covered: focus on the tech stack. • Machine learning • Big data/ engineering • When to use ML at scale, when to sample and run on a single machine
  • 29. Constant learning • The field changes faster than any other in technology • If you are not willing to allocate ‘time outside work’ to learn new things you will stagnate fast
  • 30. Not being the equivalent to a code monkey • MOOC haven decreased the barrier of entry to machine- learning. • Nowadays, you cannot be ‘the guy who knows how to run (insert off-the-shelf-algo-here)’. In dataland, that’s the equivalent to being a code monkey. MOOCs and superb libraries (scikit-learn, R’s ecosystem) made sure there is plenty of people who can throw say a random forest to a problem. In the modern world, this is not adding that much value.
  • 31. Picking problems to add the most value • Sometimes beating what the company is already doing (often, nothing) offers a lot of value. Detecting fraud poorly is better than not detecting fraud
  • 32. Data Science will continue to be democratized • There’s no shortage of data scientists. • 1900: Number of cars on the road would be limited by the supply of trained chauffeurs.
  • 33. Machine learning can very quickly get you, say, 80% of the way to solving just about any (real world) problem You want to apply ML to contexts that are fault tolerant: • Online ad targeting • Ranking search results • Recommendations • Spam filtering
  • 34. ML quickly hits a point of diminishing returns “The gain is not worth the pain."
  • 36. Talent: invest in it • The hunt for the 10x programmer continues (although few companies succeed) • In data science, the equivalent is the unicorn data scientist • Unicorn data scientist should generate more business value than a 10x programmer • Market agrees: supersalaries of >200k are common for unicorn data scientists
  • 37. Talent: beware of the fake data scientist • Each linkedin job ad for data scientist gets ~150 applications • Often people who just rebranded themselves but have no real experience • Very common in guys bailing out of academia • HR managers cannot tell the difference • It’s a common mistake to hire one, and never be able to produce business value
  • 38. Talent: easier to find than you may think • Online courses have raised the bar • Intensive bootcamps do work, as long as people have built something at the end • You will still get 150 fake data scientist for each decent one
  • 39. A future where ML has been popular for years. How does it look like?
  • 40. Next 3 years • ML APIs will enable people with less and less skill to run quite sophisticated analyses • Startups doing ML as a service will grow up, then contract. ML will stop being a key competitive advantage on most (not all) domains • Blind faith in the power of tools will lead to wrong decisions, which will lead to a backslash
  • 41. Next 10 years • Prediction: C-level people will be data scientists in the future • Product managers become a data scientist, or get replaced by one
  • 42. DS is a chaotic field and people don’t really know what they want (much less what they need)
  • 43. Interested in Data Science Retreat? Apply to any of our two tracks http://datascienceretreat.com/
  • 45. Thank You! Jose Quesada, PhD Director, Data Science Retreat @datascienceret me@josequesada.com
  • 46. References • Paco Nathan. Data science in future tense • Chris Dixon Machine learning is really good at partially solving just about any problem • Jao. The Past, Present, and Future of Machine Learning APIs

Editor's Notes

  1. It was almost a joke Too much email asking the ‘When to do what’ question
  2. IF YOU thought sci-kit learn was convenient 
  3. What is business value? If you have been in academia or away from a customer-facing role most of your career, you probably don’t have good intuitions abut this. Sure-fire way to learn is to start a business. Or take a customer-facing role. Even so it may take years to know your market
  4. What is business value? If you have been in academia or away from a customer-facing role most of your career, you probably don’t have good intuitions abut this. Sure-fire way to learn is to start a business. Or take a customer-facing role. Even so it may take years to know your market
  5. The discussion about the shortage of Data Scientists reminds me that in the early 1900s people thought that the number of cars on the road would be limited by the supply of trained chauffeurs. Then Henry Ford and others built cars that owners could drive themselves. New tools are going to be available that business owners can use themselves without need data scientists  
  6. you need to apply ML to contexts that are fault tolerant: online ad targeting, ranking search results, Recommendations spam filtering.