Democratizing Data Science in the Enterprise

Democratizing Data Science in
the Enterprise

Better Title: The NO BS Guide to
Getting Insights from your
Business Data

About Me
• Hackerpreneur
• Founder of Tellago
• Founder of KidoZen
• Board member
• Advisor: Microsoft, Oracle
• Angel Investor
• Speaker, Author
http://jrodthoughts.com
https://twitter.com/jrdothoughts

Agenda
• A brief history of data science
• Democratizing data science in the enterprise
• Building a great data science infrastructure
• Solving the last mile usability challenge

Key Takeaways
• How to build data science solutions in the real
world without breaking the bank?
• What technologies can help?
• Myths and realities of data science solutions

It’s not a trick, it’s an illusion.

Any sufficiently
advanced technology is
indistinguishable from
magic.
— Arthur C. Clarke

Democratizing Data Science in the Enterprise

1.create technology:
people who are not experts can
use it easily with little difficulty
and trust the output
2.make it “sufficiently advanced”

“data science”
d. conway, 2010

Basic
Research
Applied
Research
Working
Prototype
Quality
Code
Tool or
Service
Maybe someday, someone can use this.
I might be able to use this.
I can use this (sometimes).
Software engineers can use this.
People can use this.

The Wizard….The Data Scientist

Fred Benenson
@fredbenenson,n -
Following
IMHO the majority of data work boils down to
3 things:
1. Counting stuff
2. Figuring out the denominator
3. The reproducibility of 1 & 2
•
*RETWEETS
32
FAVORITES
28
12:33 PM - 21 Aug 2013

“data science”
jobs, jobs, jobs

“data science”
ancient history: 2001

“The Future of Data Analysis,”
W.
1962
John Tukey

introduces:
“Exploratory data anlaysis”

TUKEY BEGAT S WHICH BEGAT R
30
hackNYDS.key -

Data Science in the Enterprise

But it boils down to 2 factors….

Data Science Success Factors in the Enterprise
• Building a great data science infrastructure
• Solving the last mile problem

Tricks to build a great data science
infrastructure

Trick#1: Centralized Data Aggregation…

Goals & Challenges
• Correlate data from
disparate data sources
• Enable a centralized
data store for your
enterprise
• Incorporate new
information sources in
an agile way
• Traditional multi-
dimensional data
warehouses are difficult
to modify
• They are designed
around a specific set of
questions (schema-first)
• Challenges to
incorporate semi-
structure and
unstructured data
I would like to… But…

Centralized Data Aggregation: Best Practices
• Implement an enterprise data lake
• Rely on big data DW platforms such as Apache Hive
• Use a federated architecture efficiently partitioned for different
business units
• Establish SQL as the common query language
• Leverage in-memory computing to optimize query performance

Centralized Data Aggregation: Technologies &
Vendors

Goals & Challenges
• Organically discover
data sources relevant to
my job
• Help others discover
data more efficiently
• Collaborate with
colleagues about
specific data sources
• Business users typically
don’t have access to
the data lake
• There is no corporate
data repository
• There is no search and
metadata repository

Data Discovery: Best Practices
• Implement a corporate data catalog
• The data catalog should be the user interface to interact with the
corporate data lake
• Copy ideas from data catalogs in the internet
• Provide rich metadata experience in your data catalog
• Extend your data lake with search capabilities

Data Discovery: Technologies & Vendors

Trick#3: Establish a Common Query
Language…

Goals & Challenges
• Query data from
different business
systems in a consistent
way
• Correlate information
from different line of
business systems
• Reuse queries as new
sources of information
• Different business
systems use different
protocols to query data
• I need to learn a new
query language to
interact with my big
data infrastructure
• Queries over large data
sources can be SLOW

Query Language: Best Practices
• Standardize on SQL as the language query business data
• Implement a SQL interface for your data lake
• Correlate data sources using simple SQL joins
• Materialize query results in your data lake for future reuse
• Invest in in-memory technologies to optimize performance

Query Language: Technologies & Vendors

Trick#4: Focus on Data Quality…

Goals & Challenges
• Trust corporate data for
my applications
• Actively merge new and
historical data
• Integrate new data back
into line of business
systems
• Data in line of business
systems in poorly
curated
• Some data records
need to be validated or
cleanse
• Some data records
need to be enriched
with additional data
points

Data Quality: Best Practices
• Implement a data quality process
• Leverage your data catalog as the main user interface to control data
quality
• Trust the wisdom of the crowds to manage data quality
• Provide a great user experience to data quality

Data Quality: Technologies & Vendors

Trick#5: Understand your data….

Goals & Challenges
• Execute efficient
queries against my
corporate data
• Discover patterns and
trends about business
data sources
• Rapidly adapt to new
data sources added to
our business processes
• There is no simple way
to understand
corporate data sources
• We rely on users to
determine which
queries to execute
• New data patterns and
trends often go
undetected

Understanding your Data : Best Practices
• Leverage machine learning algorithms to understand business data
sources
• Leverage clustering algorithms to detect interesting patterns from
your business data
• Leverage classification algorithms to place data records in well-
defined groups
• Leverage statistical distribution algorithms to reveal interesting
information about your data

Understanding your Data : Technologies &
Vendors

Goals & Challenges
• Efficiently predict well-
known variables in my
business data
• Adapt results to future
predictions
• Take actions based on
the predicted outcomes
• Our analytics are based
on after-the-fact
reports
• Traditional predictive
analytics technologies
don’t work well with
semi-structured and
unstructured data
• Traditional predictive
analytics require
complex infrastructure

Predict : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse

Predict : Technologies & Vendors

Goals & Challenges
• Not have to read a
report to take actions
on my business data
• Model automatic
actions based on well-
defined data rules
• Evaluate the
effectiveness of the
rules and adapt
• Data results are mostly
communicated via
reports and dashboards
• There is no interface to
design rules against
business data
• Actions are
implemented based on
human interpretation
of data

Take Actions : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive
analytics algorithms
• Leverage classification and clustering algorithms as the main
mechanisms to train predictions
• Expose predictions to other applications for future reuse

Take Actions: Technologies & Vendors

Trick#8: Embrace developers…

Goals & Challenges
• Leverage data analyses
in new applications
• Help developers
embrace corporate data
infrastructure
• Expose data analyses to
new mediums such as
mobile or IOT
• Data results are mostly
communicated via
reports and dashboards
• Data analysis efforts are
typically led by non-
developers
• There is no easy way to
organically discover and
reuse corporate data
sources

Leverage Developers: Best Practices
• Expose data sources and analyses via APIs
• Leverage industry standards to integrated with third party tools
• Provide data access samples and SDKs for different environments
such as mobile and IOT clients
• Incorporate developer’s feedback into your data sources

Trick#9: Real time data is different…

Goals & Challenges
• Process large volumes
or real time data
• Aggregate real time and
historical data
• Detect and filter
conditions in my real
time data before it goes
into corporate systems
• There is no
infrastructure to query
real time data
• We process real time
and historical data
using the same models
• Large data volumes
affect performance

Real Time Data Processing: Best Practices
• Implement a stream analytics platform
• Model queries over real time data streams
• Add the results of the aggregated queries into the data lake
• Replay data streams to simulate real time conditions

Real Time Data Processing: Technologies &
Vendors

Trick#1: Killer user experience…

Create a Killer User Experience
• Design matters
• Invest on a easy way for users to interact with corporate data source
• Leverage modern UX principles that work cross channels(mobile,
web)
• Make data discoverable
• Leverage metadata
• Facilitate collaboration

Test Test Test
• Incorporate test models into your data sources
• Simulate real world conditions at the data level
• Assume everything will fail

Trick#3: Integrate with existing tools…

Integrate with Third Party Tools
• Integrate your data lake with mainstream tools like Tableau or Excel
• Use industry standards so that data sources can be incorporated

Collaborate
• Integrate data sources with modern messaging and collaboration
tools: Slack, Yammer etc
• Distribute updates via emails, push notifications, SMSs

Other things to consider
• On-premise, cloud or hybrid?
• Apply agile development practices to your data science
infrastructure
• Infrastructure is cool but usability is more important

Summary
• Data science is not magic, is an illusion
• Implementing data science in the enterprise is about solving two
problems
• Building a great data infrastructure
• Solving the last mile usability challenge
• Today this can be done with commodity technology
• Data scientists are just “people “ ;)

THANKS
Jesus Rodriguez
https://twitter.com/jrdothoughts
http://jrodthoughts.com/

Democratizing Data Science in the Enterprise

Related slideshows

More Related Content

Democratizing Data Science in the Enterprise

Editor's Notes