SlideShare a Scribd company logo
Demystifying Data Science
Venkatesh
Data Science Expert and Machine Learning Researcher
What is Data Science?
Data science is an interdisciplinary field that uses
scientific methods, processes, algorithms and systems
to extract knowledge and insights from data in
various forms, both structured and
unstructured,[1][2]
similar to data mining.
Well Tell me in Layman terms
Data
Domain Expertise
Algorithms
Insights
Data products
Automation /
Optimization
Business value
Intelligent Systems - A simple definition
Systems that perform actions that,
if performed by humans, would be
considered intelligent
Sensing
Language
Understanding
Planning
Problem Solving
Knowledge
Decision Making
Learning
Inference
Language
Generation
Robotics
Control
Tasks of Intelligence
Companies have AI issues
Engineering wants to get its hands on Machine Learning
The C-Suite needs an “AI” strategy
Marketing wants to include “AI” in product descriptions
Product is afraid of falling behind
Everyone is pitching you technology
Wait.. What happened to Data warehousing?
First, What is data warehousing?
● Integrated: Constructed by combining data from heterogeneous sources such as relational databases, flat files, etc.
● Time-Variant: Provides information with respect to a particular time period.
● Non-volatile: Data once entered into the warehouse should not change
However, it does not provide:
1. Automatic discovery of patterns
2. Prediction of likely outcomes
3. Creation of actionable information
Courtesy: https://www.educba.com/data-warehousing-vs-data-mining/
What about Business intelligence? Reporting?
● Summarizes the factual/historical data
● Delivers reports, KPIs and trends in a visually
pleasing manner
● Allows organisation to see the big picture
● Assists them to make better decisions to support
the mission.
● BI systems are designed to look backwards
based on real data from real events.
“What Happened and what needs to change ?”
● Data Science looks forward, interpreting the
information to predict what might happen in the future.
“Why it happened and how to change it ?”
STATISTICAL MACHINE LEARNING
= Cat
DEEP LEARNING
92%
EVIDENCE-BASED REASONING
RECOMMENDATION SYSTEMS
NATURAL LANGUAGE GENERATION
CHAT/CONVERSATIONAL INTERFACESROBOTIC PROCESS
AUTOMATION
TEXT ANALYSIS
What makes a Data Science Team?
Research
Courtesy: https://www.business-science.io/business/2018/09/18/data-science-team.html
Who are the members?
Data Engineers Data Scientists
Full Stack
Developers
Product
Managers
Research
Data Engineer. Does he only do ETL?
● Industry has shifted from drag-and-drop ETL tools towards a more
programmatic approach
● Nature of data that needs to be processed is changing day by day
(Processing Files/Batches --> Real time stream data)
Expected Skill Sets:
● Should not stick with a set of tools for building data pipelines
● Has to be a good software engineer
● Comfortable in working with open source platforms
● Adaptable to constantly evolving open source tools
● Employ a variety of tools and languages to marry systems together
Courtesy: http://podcast.freecodecamp.org/ep-37-the-rise-of-the-data-engineer
Why does a DS team need Full Stack Developer?
● Development of Pilots and MVP Applications
○ Productize the data science work so it can serve
an internal stakeholder
○ Interactive display of results/stats/insights
● Responsible for bringing a Software Engineering culture into the Data Science process
○ Build Infrastructure as Code - Automatization of the Data Science team infrastructure and testing
○ Continuous Integration and Versioning Control
○ Development of APIs to help integrate data products and source into applications
○ Building tools for internal use like tools for data collection, data labelling
Courtesy: https://towardsdatascience.com/what-is-the-role-of-an-ai-software-engineer-in-a-data-science-team-eec987203ceb
Data Scientists come in many types
Type A Type B Type C
● High understanding of domain
knowledge
● Uses ready made tools instead
of developing algorithms
● Has less or no hands-on
experience in building software
applications
● Insight oriented
● Focus in better understanding
of the business
● Has basic theoretical
knowledge in data science
● Has good hands-on experience
in building software
applications
● Capable of building an
end-to-end prototype or MVP
● Deep understanding of data
science algorithms
● Has great hands on product
development skills
Domain Experts
● Experts both by education and experience in that domain
● Aware of what data is available and judge how good it is
● Major contribution in Feature Engineering and Modeling
● Use and apply the deliverables of a data science project
in the real world
● Communicate with the intended users of the project’s outcome
● Define the framework for a data science project as they would know
○ What are the current challenges
○ How they must be answered to be practically useful
● Can learn enough data science to make a reasonable model using standardized tools
Courtesy: https://www.linkedin.com/pulse/role-domain-knowledge-data-science-patrick-bangert/
Cutting edge Research
● Seek to understand and develop systems by advancing the
longer-term academic problems surrounding AI
● Actively engage with the research community through
○ publications
○ participation in technical conferences and workshops
● Has the skills to craft customized data science and
machine learning algorithms
● Their focus will be to do research, not solve a business problem
● Data science researchers should not be an early hire
Building a team for Startup
Courtesy: https://thinkgrowth.org/the-startup-founders-guide-to-analytics-1d2176f20ac1
Building a team for an Enterprise
Courtesy: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/breaking-away-the-secrets-to-scaling-analytics
Who should lead the DS Team
Courtesy: https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
Spotify Case Study
‘Center of Excellence’ Model
Keys to Excellence
Courtesy: https://www.slideshare.net/productschool/the-why-how-of-enterprise-analytics-w-spotify-data-scientist-79046775
● Interview process focused on practical data skills
● ‘Data challenges’ - Airbnb data + real question
● Lightning talks
● Support community
● Multi-stage screening:
○ Recruiter screen
○ Take home data challenge
○ Onsite challenge
○ Trained graders
○ Two graders for each test to ensure consistency
○ 1:1s with hiring manager, business partner, CV
AirBnB
Courtesy: https://www.slideshare.net/Work-Bench/scaling-data-science-at-airbnb
Evolution Of AirBnB’s DS Team
2012 2013 2014 2015
Work
Structure
Centralised(Work
closely within team)
Started working
with other team
members
Embedded with
other teams
Embedded with
other teams
Team size 7 14 28 55
Specialisation All generalists(Data
Engineers + Data
Scientists + Data
Analysts)
Hired first Data
Engineer
Separate team for
Data Engineering
Data Science
Infrastructure
team, Specialised
roles for NLP, CV
Hiring Take home Data
challenge followed
by
1:1 interview with
whole team and
founders
Onsite data
challenge
Created rubrics
and grading
criteria
Started hiring
interns
Started focusing
on diversity and
specialised roles
for NLP,CV
Courtesy: https://vimeo.com/148942395
Facebook
● On-boards infra data scientists and engineers through the Bootcamp program
● Provides broad exposure to engineering systems in a supportive learning environment.
● Encourage engineering teams to identify mentors to guide new data scientists as they ramp up in their first projects.
● New data scientists receive mentoring on the ways to communicate the results of their complex analyses.
● Data Scientists are presented with the following options:
○ develop deep domain expertise in an area and spend several years embedded with a team
○ move across partner teams every 12 to 18 months in order to develop a broad understanding
● Provides opportunities to learn and master state-of-art skills:
○ Internal training sessions and chalk talks
○ Invite external speakers to cover important developments in the field
○ Closely connected to the academic community
○ Attend and present at major conferences such as INFORMS, KDD, and NIPS
Courtesy: https://code.fb.com/core-data/building-data-science-teams-to-have-an-impact-at-scale
Apple’s Acqui-hiring Strategy
● Apple acqui-hires startups to make its technology smarter and faster
● It buys a whole company to get the team and/or technology
● Hoping to compete with Google’s search service, Apple bought Siri in 2010
● Pandora, Spotify, and Google Music started to predict songs a user will like.
● Apple saw this, which likely prompted the company to purchase Beats Music
(a streaming music service that has a similar algorithm)
● Recently Apple has hired at least 18 people, including at least two co-founders,
one of whom is the CEO from an enterprise consulting startup
called Silicon Valley Data Science
Where should the focus be
Don’t focus on the technology
Focus on the functionality
The functionality is driven by business needs
The functionality is supported by algorithms & data
The algorithms are instrumental to business
Courtesy: Kristian Hammond, NorthWestern University
Data: Do you have the data that support it?
Task: Is your task genuinely data driven?
Scale: Do you need the scale automation
provides?
What you need to ask when considering AI
THANK YOU FOR YOUR ATTENTION
DO YOU HAVE ANY QUESTIONS ?

More Related Content

Building successful data science teams

  • 1. Demystifying Data Science Venkatesh Data Science Expert and Machine Learning Researcher
  • 2. What is Data Science? Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured,[1][2] similar to data mining.
  • 3. Well Tell me in Layman terms Data Domain Expertise Algorithms Insights Data products Automation / Optimization Business value
  • 4. Intelligent Systems - A simple definition Systems that perform actions that, if performed by humans, would be considered intelligent
  • 6. Companies have AI issues Engineering wants to get its hands on Machine Learning The C-Suite needs an “AI” strategy Marketing wants to include “AI” in product descriptions Product is afraid of falling behind Everyone is pitching you technology
  • 7. Wait.. What happened to Data warehousing? First, What is data warehousing? ● Integrated: Constructed by combining data from heterogeneous sources such as relational databases, flat files, etc. ● Time-Variant: Provides information with respect to a particular time period. ● Non-volatile: Data once entered into the warehouse should not change However, it does not provide: 1. Automatic discovery of patterns 2. Prediction of likely outcomes 3. Creation of actionable information Courtesy: https://www.educba.com/data-warehousing-vs-data-mining/
  • 8. What about Business intelligence? Reporting? ● Summarizes the factual/historical data ● Delivers reports, KPIs and trends in a visually pleasing manner ● Allows organisation to see the big picture ● Assists them to make better decisions to support the mission. ● BI systems are designed to look backwards based on real data from real events. “What Happened and what needs to change ?” ● Data Science looks forward, interpreting the information to predict what might happen in the future. “Why it happened and how to change it ?”
  • 9. STATISTICAL MACHINE LEARNING = Cat DEEP LEARNING 92% EVIDENCE-BASED REASONING RECOMMENDATION SYSTEMS NATURAL LANGUAGE GENERATION CHAT/CONVERSATIONAL INTERFACESROBOTIC PROCESS AUTOMATION TEXT ANALYSIS
  • 10. What makes a Data Science Team? Research Courtesy: https://www.business-science.io/business/2018/09/18/data-science-team.html
  • 11. Who are the members? Data Engineers Data Scientists Full Stack Developers Product Managers Research
  • 12. Data Engineer. Does he only do ETL? ● Industry has shifted from drag-and-drop ETL tools towards a more programmatic approach ● Nature of data that needs to be processed is changing day by day (Processing Files/Batches --> Real time stream data) Expected Skill Sets: ● Should not stick with a set of tools for building data pipelines ● Has to be a good software engineer ● Comfortable in working with open source platforms ● Adaptable to constantly evolving open source tools ● Employ a variety of tools and languages to marry systems together Courtesy: http://podcast.freecodecamp.org/ep-37-the-rise-of-the-data-engineer
  • 13. Why does a DS team need Full Stack Developer? ● Development of Pilots and MVP Applications ○ Productize the data science work so it can serve an internal stakeholder ○ Interactive display of results/stats/insights ● Responsible for bringing a Software Engineering culture into the Data Science process ○ Build Infrastructure as Code - Automatization of the Data Science team infrastructure and testing ○ Continuous Integration and Versioning Control ○ Development of APIs to help integrate data products and source into applications ○ Building tools for internal use like tools for data collection, data labelling Courtesy: https://towardsdatascience.com/what-is-the-role-of-an-ai-software-engineer-in-a-data-science-team-eec987203ceb
  • 14. Data Scientists come in many types Type A Type B Type C ● High understanding of domain knowledge ● Uses ready made tools instead of developing algorithms ● Has less or no hands-on experience in building software applications ● Insight oriented ● Focus in better understanding of the business ● Has basic theoretical knowledge in data science ● Has good hands-on experience in building software applications ● Capable of building an end-to-end prototype or MVP ● Deep understanding of data science algorithms ● Has great hands on product development skills
  • 15. Domain Experts ● Experts both by education and experience in that domain ● Aware of what data is available and judge how good it is ● Major contribution in Feature Engineering and Modeling ● Use and apply the deliverables of a data science project in the real world ● Communicate with the intended users of the project’s outcome ● Define the framework for a data science project as they would know ○ What are the current challenges ○ How they must be answered to be practically useful ● Can learn enough data science to make a reasonable model using standardized tools Courtesy: https://www.linkedin.com/pulse/role-domain-knowledge-data-science-patrick-bangert/
  • 16. Cutting edge Research ● Seek to understand and develop systems by advancing the longer-term academic problems surrounding AI ● Actively engage with the research community through ○ publications ○ participation in technical conferences and workshops ● Has the skills to craft customized data science and machine learning algorithms ● Their focus will be to do research, not solve a business problem ● Data science researchers should not be an early hire
  • 17. Building a team for Startup Courtesy: https://thinkgrowth.org/the-startup-founders-guide-to-analytics-1d2176f20ac1
  • 18. Building a team for an Enterprise Courtesy: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/breaking-away-the-secrets-to-scaling-analytics
  • 19. Who should lead the DS Team Courtesy: https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
  • 20. Spotify Case Study ‘Center of Excellence’ Model Keys to Excellence Courtesy: https://www.slideshare.net/productschool/the-why-how-of-enterprise-analytics-w-spotify-data-scientist-79046775
  • 21. ● Interview process focused on practical data skills ● ‘Data challenges’ - Airbnb data + real question ● Lightning talks ● Support community ● Multi-stage screening: ○ Recruiter screen ○ Take home data challenge ○ Onsite challenge ○ Trained graders ○ Two graders for each test to ensure consistency ○ 1:1s with hiring manager, business partner, CV AirBnB Courtesy: https://www.slideshare.net/Work-Bench/scaling-data-science-at-airbnb
  • 22. Evolution Of AirBnB’s DS Team 2012 2013 2014 2015 Work Structure Centralised(Work closely within team) Started working with other team members Embedded with other teams Embedded with other teams Team size 7 14 28 55 Specialisation All generalists(Data Engineers + Data Scientists + Data Analysts) Hired first Data Engineer Separate team for Data Engineering Data Science Infrastructure team, Specialised roles for NLP, CV Hiring Take home Data challenge followed by 1:1 interview with whole team and founders Onsite data challenge Created rubrics and grading criteria Started hiring interns Started focusing on diversity and specialised roles for NLP,CV Courtesy: https://vimeo.com/148942395
  • 23. Facebook ● On-boards infra data scientists and engineers through the Bootcamp program ● Provides broad exposure to engineering systems in a supportive learning environment. ● Encourage engineering teams to identify mentors to guide new data scientists as they ramp up in their first projects. ● New data scientists receive mentoring on the ways to communicate the results of their complex analyses. ● Data Scientists are presented with the following options: ○ develop deep domain expertise in an area and spend several years embedded with a team ○ move across partner teams every 12 to 18 months in order to develop a broad understanding ● Provides opportunities to learn and master state-of-art skills: ○ Internal training sessions and chalk talks ○ Invite external speakers to cover important developments in the field ○ Closely connected to the academic community ○ Attend and present at major conferences such as INFORMS, KDD, and NIPS Courtesy: https://code.fb.com/core-data/building-data-science-teams-to-have-an-impact-at-scale
  • 24. Apple’s Acqui-hiring Strategy ● Apple acqui-hires startups to make its technology smarter and faster ● It buys a whole company to get the team and/or technology ● Hoping to compete with Google’s search service, Apple bought Siri in 2010 ● Pandora, Spotify, and Google Music started to predict songs a user will like. ● Apple saw this, which likely prompted the company to purchase Beats Music (a streaming music service that has a similar algorithm) ● Recently Apple has hired at least 18 people, including at least two co-founders, one of whom is the CEO from an enterprise consulting startup called Silicon Valley Data Science
  • 25. Where should the focus be Don’t focus on the technology Focus on the functionality The functionality is driven by business needs The functionality is supported by algorithms & data The algorithms are instrumental to business Courtesy: Kristian Hammond, NorthWestern University
  • 26. Data: Do you have the data that support it? Task: Is your task genuinely data driven? Scale: Do you need the scale automation provides? What you need to ask when considering AI
  • 27. THANK YOU FOR YOUR ATTENTION DO YOU HAVE ANY QUESTIONS ?