SlideShare a Scribd company logo
Democratizing Data Science in
the Enterprise
Better Title: The NO BS Guide to
Getting Insights from your
Business Data
About Me
• Hackerpreneur
• Founder of Tellago
• Founder of KidoZen
• Board member
• Advisor: Microsoft, Oracle
• Angel Investor
• Speaker, Author
http://jrodthoughts.com
https://twitter.com/jrdothoughts
Agenda
• A brief history of data science
• Democratizing data science in the enterprise
• Building a great data science infrastructure
• Solving the last mile usability challenge
Key Takeaways
• How to build data science solutions in the real
world without breaking the bank?
• What technologies can help?
• Myths and realities of data science solutions
Data Science….Still Magic?
It’s not a trick, it’s an illusion.
Any sufficiently
advanced technology is
indistinguishable from
magic.
— Arthur C. Clarke
Democratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
1.create technology:
people who are not experts can
use it easily with little difficulty
and trust the output
2.make it “sufficiently advanced”
“data science”
d. conway, 2010
Basic
Research
Applied
Research
Working
Prototype
Quality
Code
Tool or
Service
Maybe someday, someone can use this.
I might be able to use this.
I can use this (sometimes).
Software engineers can use this.
People can use this.
The Wizard….The Data Scientist
Democratizing Data Science in the Enterprise
Fred Benenson
@fredbenenson,n -
Following
IMHO the majority of data work boils down to
3 things:
1. Counting stuff
2. Figuring out the denominator
3. The reproducibility of 1 & 2
•
*RETWEETS
32
FAVORITES
28
12:33 PM - 21 Aug 2013
They’re hot these days…
Democratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
“data science”
jobs, jobs, jobs
“data science”
jobs, jobs, jobs
Where do they come from?
“data science”
ancient history: 2001
“The Future of Data Analysis,”
W.
1962
John Tukey
introduces:
“Exploratory data anlaysis”
Tukey 1965, via John Chambers
TUKEY BEGAT S WHICH BEGAT R
30
hackNYDS.key -
Tukey 1972
? 1972
Jerome H. Friedman
TUKEY BEGAT ESL
TUKEY BEGAN VDQI
Tukey 1977
TUKEY BEGAT EDA
fast forward -> 2001
Data Science in the Enterprise
Seems like magic…
But it boils down to 2 factors….
Data Science Success Factors in the Enterprise
• Building a great data science infrastructure
• Solving the last mile problem
Tricks to build a great data science
infrastructure
Trick#1: Centralized Data Aggregation…
Goals & Challenges
• Correlate data from
disparate data sources
• Enable a centralized
data store for your
enterprise
• Incorporate new
information sources in
an agile way
• Traditional multi-
dimensional data
warehouses are difficult
to modify
• They are designed
around a specific set of
questions (schema-first)
• Challenges to
incorporate semi-
structure and
unstructured data
I would like to… But…
Centralized Data Aggregation: Best Practices
• Implement an enterprise data lake
• Rely on big data DW platforms such as Apache Hive
• Use a federated architecture efficiently partitioned for different
business units
• Establish SQL as the common query language
• Leverage in-memory computing to optimize query performance
Centralized Data Aggregation: Technologies &
Vendors
Trick#2: Data Discovery…
Goals & Challenges
• Organically discover
data sources relevant to
my job
• Help others discover
data more efficiently
• Collaborate with
colleagues about
specific data sources
• Business users typically
don’t have access to
the data lake
• There is no corporate
data repository
• There is no search and
metadata repository
I would like to… But…
Data Discovery: Best Practices
• Implement a corporate data catalog
• The data catalog should be the user interface to interact with the
corporate data lake
• Copy ideas from data catalogs in the internet
• Provide rich metadata experience in your data catalog
• Extend your data lake with search capabilities
Data Discovery: Technologies & Vendors
Trick#3: Establish a Common Query
Language…
Goals & Challenges
• Query data from
different business
systems in a consistent
way
• Correlate information
from different line of
business systems
• Reuse queries as new
sources of information
• Different business
systems use different
protocols to query data
• I need to learn a new
query language to
interact with my big
data infrastructure
• Queries over large data
sources can be SLOW
I would like to… But…
Query Language: Best Practices
• Standardize on SQL as the language query business data
• Implement a SQL interface for your data lake
• Correlate data sources using simple SQL joins
• Materialize query results in your data lake for future reuse
• Invest in in-memory technologies to optimize performance
Query Language: Technologies & Vendors
Trick#4: Focus on Data Quality…
Goals & Challenges
• Trust corporate data for
my applications
• Actively merge new and
historical data
• Integrate new data back
into line of business
systems
• Data in line of business
systems in poorly
curated
• Some data records
need to be validated or
cleanse
• Some data records
need to be enriched
with additional data
points
I would like to… But…
Data Quality: Best Practices
• Implement a data quality process
• Leverage your data catalog as the main user interface to control data
quality
• Trust the wisdom of the crowds to manage data quality
• Provide a great user experience to data quality
Data Quality: Technologies & Vendors
Trick#5: Understand your data….
Goals & Challenges
• Execute efficient
queries against my
corporate data
• Discover patterns and
trends about business
data sources
• Rapidly adapt to new
data sources added to
our business processes
• There is no simple way
to understand
corporate data sources
• We rely on users to
determine which
queries to execute
• New data patterns and
trends often go
undetected
I would like to… But…
Understanding your Data : Best Practices
• Leverage machine learning algorithms to understand business data
sources
• Leverage clustering algorithms to detect interesting patterns from
your business data
• Leverage classification algorithms to place data records in well-
defined groups
• Leverage statistical distribution algorithms to reveal interesting
information about your data
Understanding your Data : Technologies &
Vendors
Trick#6: Predict…
Goals & Challenges
• Efficiently predict well-
known variables in my
business data
• Adapt results to future
predictions
• Take actions based on
the predicted outcomes
• Our analytics are based
on after-the-fact
reports
• Traditional predictive
analytics technologies
don’t work well with
semi-structured and
unstructured data
• Traditional predictive
analytics require
complex infrastructure
I would like to… But…
Predict : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse
Predict : Technologies & Vendors
Trick#7: Take Actions…
Goals & Challenges
• Not have to read a
report to take actions
on my business data
• Model automatic
actions based on well-
defined data rules
• Evaluate the
effectiveness of the
rules and adapt
• Data results are mostly
communicated via
reports and dashboards
• There is no interface to
design rules against
business data
• Actions are
implemented based on
human interpretation
of data
I would like to… But…
Take Actions : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse
Take Actions: Technologies & Vendors
Trick#8: Embrace developers…
Goals & Challenges
• Leverage data analyses
in new applications
• Help developers
embrace corporate data
infrastructure
• Expose data analyses to
new mediums such as
mobile or IOT
• Data results are mostly
communicated via
reports and dashboards
• Data analysis efforts are
typically led by non-
developers
• There is no easy way to
organically discover and
reuse corporate data
sources
I would like to… But…
Leverage Developers: Best Practices
• Expose data sources and analyses via APIs
• Leverage industry standards to integrated with third party tools
• Provide data access samples and SDKs for different environments
such as mobile and IOT clients
• Incorporate developer’s feedback into your data sources
Take Actions: Technologies & Vendors
Trick#9: Real time data is different…
Goals & Challenges
• Process large volumes
or real time data
• Aggregate real time and
historical data
• Detect and filter
conditions in my real
time data before it goes
into corporate systems
• There is no
infrastructure to query
real time data
• We process real time
and historical data
using the same models
• Large data volumes
affect performance
I would like to… But…
Real Time Data Processing: Best Practices
• Implement a stream analytics platform
• Model queries over real time data streams
• Add the results of the aggregated queries into the data lake
• Replay data streams to simulate real time conditions
Real Time Data Processing: Technologies &
Vendors
Solving the last mile problem
Trick#1: Killer user experience…
Create a Killer User Experience
• Design matters
• Invest on a easy way for users to interact with corporate data source
• Leverage modern UX principles that work cross channels(mobile,
web)
• Make data discoverable
• Leverage metadata
• Facilitate collaboration
Trick#2: Test test test…
Test Test Test
• Incorporate test models into your data sources
• Simulate real world conditions at the data level
• Assume everything will fail
Trick#3: Integrate with existing tools…
Integrate with Third Party Tools
• Integrate your data lake with mainstream tools like Tableau or Excel
• Use industry standards so that data sources can be incorporated
Trick#5: Collaborate…
Collaborate
• Integrate data sources with modern messaging and collaboration
tools: Slack, Yammer etc
• Distribute updates via emails, push notifications, SMSs
Other things to consider
• On-premise, cloud or hybrid?
• Apply agile development practices to your data science
infrastructure
• Infrastructure is cool but usability is more important
Summary
• Data science is not magic, is an illusion
• Implementing data science in the enterprise is about solving two
problems
• Building a great data infrastructure
• Solving the last mile usability challenge
• Today this can be done with commodity technology
• Data scientists are just “people “ ;)
THANKS
Jesus Rodriguez
https://twitter.com/jrdothoughts
http://jrodthoughts.com/

More Related Content

Democratizing Data Science in the Enterprise

  • 1. Democratizing Data Science in the Enterprise
  • 2. Better Title: The NO BS Guide to Getting Insights from your Business Data
  • 3. About Me • Hackerpreneur • Founder of Tellago • Founder of KidoZen • Board member • Advisor: Microsoft, Oracle • Angel Investor • Speaker, Author http://jrodthoughts.com https://twitter.com/jrdothoughts
  • 4. Agenda • A brief history of data science • Democratizing data science in the enterprise • Building a great data science infrastructure • Solving the last mile usability challenge
  • 5. Key Takeaways • How to build data science solutions in the real world without breaking the bank? • What technologies can help? • Myths and realities of data science solutions
  • 7. It’s not a trick, it’s an illusion.
  • 8. Any sufficiently advanced technology is indistinguishable from magic. — Arthur C. Clarke
  • 11. 1.create technology: people who are not experts can use it easily with little difficulty and trust the output 2.make it “sufficiently advanced”
  • 13. Basic Research Applied Research Working Prototype Quality Code Tool or Service Maybe someday, someone can use this. I might be able to use this. I can use this (sometimes). Software engineers can use this. People can use this.
  • 16. Fred Benenson @fredbenenson,n - Following IMHO the majority of data work boils down to 3 things: 1. Counting stuff 2. Figuring out the denominator 3. The reproducibility of 1 & 2 • *RETWEETS 32 FAVORITES 28 12:33 PM - 21 Aug 2013
  • 24. Where do they come from?
  • 26. “The Future of Data Analysis,” W. 1962 John Tukey
  • 28. Tukey 1965, via John Chambers
  • 29. TUKEY BEGAT S WHICH BEGAT R 30 hackNYDS.key -
  • 38. Data Science in the Enterprise
  • 40. But it boils down to 2 factors….
  • 41. Data Science Success Factors in the Enterprise • Building a great data science infrastructure • Solving the last mile problem
  • 42. Tricks to build a great data science infrastructure
  • 43. Trick#1: Centralized Data Aggregation…
  • 44. Goals & Challenges • Correlate data from disparate data sources • Enable a centralized data store for your enterprise • Incorporate new information sources in an agile way • Traditional multi- dimensional data warehouses are difficult to modify • They are designed around a specific set of questions (schema-first) • Challenges to incorporate semi- structure and unstructured data I would like to… But…
  • 45. Centralized Data Aggregation: Best Practices • Implement an enterprise data lake • Rely on big data DW platforms such as Apache Hive • Use a federated architecture efficiently partitioned for different business units • Establish SQL as the common query language • Leverage in-memory computing to optimize query performance
  • 46. Centralized Data Aggregation: Technologies & Vendors
  • 48. Goals & Challenges • Organically discover data sources relevant to my job • Help others discover data more efficiently • Collaborate with colleagues about specific data sources • Business users typically don’t have access to the data lake • There is no corporate data repository • There is no search and metadata repository I would like to… But…
  • 49. Data Discovery: Best Practices • Implement a corporate data catalog • The data catalog should be the user interface to interact with the corporate data lake • Copy ideas from data catalogs in the internet • Provide rich metadata experience in your data catalog • Extend your data lake with search capabilities
  • 51. Trick#3: Establish a Common Query Language…
  • 52. Goals & Challenges • Query data from different business systems in a consistent way • Correlate information from different line of business systems • Reuse queries as new sources of information • Different business systems use different protocols to query data • I need to learn a new query language to interact with my big data infrastructure • Queries over large data sources can be SLOW I would like to… But…
  • 53. Query Language: Best Practices • Standardize on SQL as the language query business data • Implement a SQL interface for your data lake • Correlate data sources using simple SQL joins • Materialize query results in your data lake for future reuse • Invest in in-memory technologies to optimize performance
  • 55. Trick#4: Focus on Data Quality…
  • 56. Goals & Challenges • Trust corporate data for my applications • Actively merge new and historical data • Integrate new data back into line of business systems • Data in line of business systems in poorly curated • Some data records need to be validated or cleanse • Some data records need to be enriched with additional data points I would like to… But…
  • 57. Data Quality: Best Practices • Implement a data quality process • Leverage your data catalog as the main user interface to control data quality • Trust the wisdom of the crowds to manage data quality • Provide a great user experience to data quality
  • 60. Goals & Challenges • Execute efficient queries against my corporate data • Discover patterns and trends about business data sources • Rapidly adapt to new data sources added to our business processes • There is no simple way to understand corporate data sources • We rely on users to determine which queries to execute • New data patterns and trends often go undetected I would like to… But…
  • 61. Understanding your Data : Best Practices • Leverage machine learning algorithms to understand business data sources • Leverage clustering algorithms to detect interesting patterns from your business data • Leverage classification algorithms to place data records in well- defined groups • Leverage statistical distribution algorithms to reveal interesting information about your data
  • 62. Understanding your Data : Technologies & Vendors
  • 64. Goals & Challenges • Efficiently predict well- known variables in my business data • Adapt results to future predictions • Take actions based on the predicted outcomes • Our analytics are based on after-the-fact reports • Traditional predictive analytics technologies don’t work well with semi-structured and unstructured data • Traditional predictive analytics require complex infrastructure I would like to… But…
  • 65. Predict : Best Practices • Implement a modern predictive analytics platform • Leverage the data lake as the main source of information to predictive analytics algorithms • Leverage classification and clustering algorithms as the main mechanisms to train predictions • Expose predictions to other applications for future reuse
  • 68. Goals & Challenges • Not have to read a report to take actions on my business data • Model automatic actions based on well- defined data rules • Evaluate the effectiveness of the rules and adapt • Data results are mostly communicated via reports and dashboards • There is no interface to design rules against business data • Actions are implemented based on human interpretation of data I would like to… But…
  • 69. Take Actions : Best Practices • Implement a modern predictive analytics platform • Leverage the data lake as the main source of information to predictive analytics algorithms • Leverage classification and clustering algorithms as the main mechanisms to train predictions • Expose predictions to other applications for future reuse
  • 72. Goals & Challenges • Leverage data analyses in new applications • Help developers embrace corporate data infrastructure • Expose data analyses to new mediums such as mobile or IOT • Data results are mostly communicated via reports and dashboards • Data analysis efforts are typically led by non- developers • There is no easy way to organically discover and reuse corporate data sources I would like to… But…
  • 73. Leverage Developers: Best Practices • Expose data sources and analyses via APIs • Leverage industry standards to integrated with third party tools • Provide data access samples and SDKs for different environments such as mobile and IOT clients • Incorporate developer’s feedback into your data sources
  • 75. Trick#9: Real time data is different…
  • 76. Goals & Challenges • Process large volumes or real time data • Aggregate real time and historical data • Detect and filter conditions in my real time data before it goes into corporate systems • There is no infrastructure to query real time data • We process real time and historical data using the same models • Large data volumes affect performance I would like to… But…
  • 77. Real Time Data Processing: Best Practices • Implement a stream analytics platform • Model queries over real time data streams • Add the results of the aggregated queries into the data lake • Replay data streams to simulate real time conditions
  • 78. Real Time Data Processing: Technologies & Vendors
  • 79. Solving the last mile problem
  • 80. Trick#1: Killer user experience…
  • 81. Create a Killer User Experience • Design matters • Invest on a easy way for users to interact with corporate data source • Leverage modern UX principles that work cross channels(mobile, web) • Make data discoverable • Leverage metadata • Facilitate collaboration
  • 83. Test Test Test • Incorporate test models into your data sources • Simulate real world conditions at the data level • Assume everything will fail
  • 84. Trick#3: Integrate with existing tools…
  • 85. Integrate with Third Party Tools • Integrate your data lake with mainstream tools like Tableau or Excel • Use industry standards so that data sources can be incorporated
  • 87. Collaborate • Integrate data sources with modern messaging and collaboration tools: Slack, Yammer etc • Distribute updates via emails, push notifications, SMSs
  • 88. Other things to consider • On-premise, cloud or hybrid? • Apply agile development practices to your data science infrastructure • Infrastructure is cool but usability is more important
  • 89. Summary • Data science is not magic, is an illusion • Implementing data science in the enterprise is about solving two problems • Building a great data infrastructure • Solving the last mile usability challenge • Today this can be done with commodity technology • Data scientists are just “people “ ;)

Editor's Notes

  1. Doesn’t imply mobile Security is limiting – doesn’t convey analytics