How Cloud is Affecting Data Scientists

Data Science in the Cloud
November 2nd, 2017

Data Science Transition to
the Cloud
Various Cloud Offerings
Team Data Science

Data Science Transition to the Cloud

Data Scientists must have many tools at their
disposal
– R
– Python
– SQL
– Data Analysis
– Big Data
Now they must begin to consider another tool
to help them build and test out their models
– Cloud
Data Scientists and their Toolsets

Modeling and Validation take time and resources
Back in the day, data scientists would wait hours to run models and validate ideal performance
All this was done on a personal local machine with limited memory
Data Science Models are resource intensive

Something had to give
Minimize the volume of data in the model
– Limits model performance
Minimize the complexity of the model
– Limits variety and innovation
Have personal machines increased performance over the last couple of years to handle both
volume and complexity?

Laptops can now be purchased with 2 TB of Hard Drive using SSD and 32 GB of Memory
You’re looking at least $4K
What if you could run the same model on your top of the line laptop on the cloud for a few $$
each month?

Many models benefit from larger datasets, especially deep learning models
How many GPU’s does your local machine have?
CPU has a few cores, GPU has hundreds of cores
GPU’s can huge batches of data and perform the same operation over and over again
CPU’s can handle a few threads at a time
GPU vs CPU

Free users from hardware and local constraints
Collaborative
– Models can be accessed by anyone with an account
– Models can be shared across developers
Advantages to Machine Learning in the Cloud

Compliance and Security
Offline access
Disadvantages to Machine Learning in the Cloud

Microsoft Azure
Amazon Web Services
Google Cloud Platform
Azure vs AWS vs Google Cloud

Magic Quadrant for Cloud According to
Garner in 2014
AWS was on an island by itself
Cloud according to Gartner in 2014

For our purposes we want to do a comparison based on Machine Learning offerings for Azure,
Amazon Web Services, and Google Cloud?
But what about Machine Learning?

One of the most automated solutions on the market
Can load data from Amazon RDS, Redshift, and CSV files
Data can automatically be identified as categorical or numeric during preparation
Modeling does not support Unsupervised Learning, only Supervised out of the box
Predictions fall under only 2 main areas:
1. Binary and Multiclass classification
2. Regression
AWS Machine Learning

Similar to the Amazon Machine Learning offering
Predictions fall under similar categories out of the box:
1. Binary and Multiclass Classification
2. Regression
Google offers pre-trained models as templates that can be used as starting points for model
development
Google Prediction API

Almost all operations are manual as opposed to Google and Amazon
Supports a user-friendly graphical interface to visualize each step within a workflow for building
a model
– Looks very much like Visio or PowerPoint
Supports both Supervised and Unsupervised models
Azure ML

Variety of Algorithms available, not just limited to Classification and Regression:
– Classification: Binary and Multiclass
– Regression
– Anomaly Detection
– Recommendation
– Text Analysis
Azure also has a set of templates available in the Cortana Intelligence Gallery that are prebuilt
as templates for reuse
Azure ML

From a 2014 Capgemini Report
Only 27% of the big data projects are regarded as successful
Only 13% of organizations have achieved full-scale production for their big data
implementations
Only 8% of the big data projects are regarded as VERY successful
Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive
Analytics Program in while
Some Sobering Statistics

Why is this happening?
Most data scientists unfortunately are working in silos
Silos within an organization
Silos within their tools
Some Sobering Statistics

Team Data Science Process in AML is an agile and iterative process for delivering Machine
Learning solutions effectively across enterprise Data Science Teams
Released in April 2017
One developer picks up where the other one dropped off
Team Data Science Process in Azure ML

Team Data Science Process is comprised of the following four parts:
1. A standard data science life cycle definition
2. A standard template for documentation, structure, and reporting
3. Infrastructure for project execution and code repositories
4. Tools for implementing task lists, version control, code review, data exploration and
modeling
Team Data Science Process in Azure ML

The next time your boss asks you if you want to upgrade your laptop
Instead you may want to ask them to get you an iPad Pro or a Surface Notebook with a cloud
subscription
Key Takeaway

THANK YOU!
What questions do you have?

How Cloud is Affecting Data Scientists

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to How Cloud is Affecting Data Scientists

Similar to How Cloud is Affecting Data Scientists (20)

More from CCG

More from CCG (20)

Recently uploaded

Recently uploaded (20)

How Cloud is Affecting Data Scientists