Democratizing Data Science in the Enterprise
- 3. About Me
• Hackerpreneur
• Founder of Tellago
• Founder of KidoZen
• Board member
• Advisor: Microsoft, Oracle
• Angel Investor
• Speaker, Author
http://jrodthoughts.com
https://twitter.com/jrdothoughts
- 4. Agenda
• A brief history of data science
• Democratizing data science in the enterprise
• Building a great data science infrastructure
• Solving the last mile usability challenge
- 5. Key Takeaways
• How to build data science solutions in the real
world without breaking the bank?
• What technologies can help?
• Myths and realities of data science solutions
- 16. Fred Benenson
@fredbenenson,n -
Following
IMHO the majority of data work boils down to
3 things:
1. Counting stuff
2. Figuring out the denominator
3. The reproducibility of 1 & 2
•
*RETWEETS
32
FAVORITES
28
12:33 PM - 21 Aug 2013
- 41. Data Science Success Factors in the Enterprise
• Building a great data science infrastructure
• Solving the last mile problem
- 44. Goals & Challenges
• Correlate data from
disparate data sources
• Enable a centralized
data store for your
enterprise
• Incorporate new
information sources in
an agile way
• Traditional multi-
dimensional data
warehouses are difficult
to modify
• They are designed
around a specific set of
questions (schema-first)
• Challenges to
incorporate semi-
structure and
unstructured data
I would like to… But…
- 45. Centralized Data Aggregation: Best Practices
• Implement an enterprise data lake
• Rely on big data DW platforms such as Apache Hive
• Use a federated architecture efficiently partitioned for different
business units
• Establish SQL as the common query language
• Leverage in-memory computing to optimize query performance
- 48. Goals & Challenges
• Organically discover
data sources relevant to
my job
• Help others discover
data more efficiently
• Collaborate with
colleagues about
specific data sources
• Business users typically
don’t have access to
the data lake
• There is no corporate
data repository
• There is no search and
metadata repository
I would like to… But…
- 49. Data Discovery: Best Practices
• Implement a corporate data catalog
• The data catalog should be the user interface to interact with the
corporate data lake
• Copy ideas from data catalogs in the internet
• Provide rich metadata experience in your data catalog
• Extend your data lake with search capabilities
- 52. Goals & Challenges
• Query data from
different business
systems in a consistent
way
• Correlate information
from different line of
business systems
• Reuse queries as new
sources of information
• Different business
systems use different
protocols to query data
• I need to learn a new
query language to
interact with my big
data infrastructure
• Queries over large data
sources can be SLOW
I would like to… But…
- 53. Query Language: Best Practices
• Standardize on SQL as the language query business data
• Implement a SQL interface for your data lake
• Correlate data sources using simple SQL joins
• Materialize query results in your data lake for future reuse
• Invest in in-memory technologies to optimize performance
- 56. Goals & Challenges
• Trust corporate data for
my applications
• Actively merge new and
historical data
• Integrate new data back
into line of business
systems
• Data in line of business
systems in poorly
curated
• Some data records
need to be validated or
cleanse
• Some data records
need to be enriched
with additional data
points
I would like to… But…
- 57. Data Quality: Best Practices
• Implement a data quality process
• Leverage your data catalog as the main user interface to control data
quality
• Trust the wisdom of the crowds to manage data quality
• Provide a great user experience to data quality
- 60. Goals & Challenges
• Execute efficient
queries against my
corporate data
• Discover patterns and
trends about business
data sources
• Rapidly adapt to new
data sources added to
our business processes
• There is no simple way
to understand
corporate data sources
• We rely on users to
determine which
queries to execute
• New data patterns and
trends often go
undetected
I would like to… But…
- 61. Understanding your Data : Best Practices
• Leverage machine learning algorithms to understand business data
sources
• Leverage clustering algorithms to detect interesting patterns from
your business data
• Leverage classification algorithms to place data records in well-
defined groups
• Leverage statistical distribution algorithms to reveal interesting
information about your data
- 64. Goals & Challenges
• Efficiently predict well-
known variables in my
business data
• Adapt results to future
predictions
• Take actions based on
the predicted outcomes
• Our analytics are based
on after-the-fact
reports
• Traditional predictive
analytics technologies
don’t work well with
semi-structured and
unstructured data
• Traditional predictive
analytics require
complex infrastructure
I would like to… But…
- 65. Predict : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse
- 68. Goals & Challenges
• Not have to read a
report to take actions
on my business data
• Model automatic
actions based on well-
defined data rules
• Evaluate the
effectiveness of the
rules and adapt
• Data results are mostly
communicated via
reports and dashboards
• There is no interface to
design rules against
business data
• Actions are
implemented based on
human interpretation
of data
I would like to… But…
- 69. Take Actions : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse
- 72. Goals & Challenges
• Leverage data analyses
in new applications
• Help developers
embrace corporate data
infrastructure
• Expose data analyses to
new mediums such as
mobile or IOT
• Data results are mostly
communicated via
reports and dashboards
• Data analysis efforts are
typically led by non-
developers
• There is no easy way to
organically discover and
reuse corporate data
sources
I would like to… But…
- 73. Leverage Developers: Best Practices
• Expose data sources and analyses via APIs
• Leverage industry standards to integrated with third party tools
• Provide data access samples and SDKs for different environments
such as mobile and IOT clients
• Incorporate developer’s feedback into your data sources
- 76. Goals & Challenges
• Process large volumes
or real time data
• Aggregate real time and
historical data
• Detect and filter
conditions in my real
time data before it goes
into corporate systems
• There is no
infrastructure to query
real time data
• We process real time
and historical data
using the same models
• Large data volumes
affect performance
I would like to… But…
- 77. Real Time Data Processing: Best Practices
• Implement a stream analytics platform
• Model queries over real time data streams
• Add the results of the aggregated queries into the data lake
• Replay data streams to simulate real time conditions
- 81. Create a Killer User Experience
• Design matters
• Invest on a easy way for users to interact with corporate data source
• Leverage modern UX principles that work cross channels(mobile,
web)
• Make data discoverable
• Leverage metadata
• Facilitate collaboration
- 83. Test Test Test
• Incorporate test models into your data sources
• Simulate real world conditions at the data level
• Assume everything will fail
- 85. Integrate with Third Party Tools
• Integrate your data lake with mainstream tools like Tableau or Excel
• Use industry standards so that data sources can be incorporated
- 87. Collaborate
• Integrate data sources with modern messaging and collaboration
tools: Slack, Yammer etc
• Distribute updates via emails, push notifications, SMSs
- 88. Other things to consider
• On-premise, cloud or hybrid?
• Apply agile development practices to your data science
infrastructure
• Infrastructure is cool but usability is more important
- 89. Summary
• Data science is not magic, is an illusion
• Implementing data science in the enterprise is about solving two
problems
• Building a great data infrastructure
• Solving the last mile usability challenge
• Today this can be done with commodity technology
• Data scientists are just “people “ ;)
Editor's Notes
- Doesn’t imply mobile
Security is limiting – doesn’t convey analytics