SlideShare a Scribd company logo
The Polyglot
Data Scientist
Adventures with R, Python, and SQL
Audience Survey
• How many here have used:
– SQL?
– Python?
– R?
• What job titles do people have?
What We Won’t Cover
• Theories behind data science and machine learning
• Deep dive into Python
• Deep dive into R
• Deep dive into SQL Server
There is a data science VM available on
Azure. It won’t be covered in this
presentation.
See https://docs.microsoft.com/en-
us/sql/advanced-analytics/getting-started-
with-machine-learning-services for details.
Azure Support
What We Will Cover
• The Problem with Being a Polyglot
• What SQL Server + R or SQL Server + Python Solves
• A Glance at these in Action
Not a Microsoft sales person…
• Microsoft MVP in
Visual Studio
• Been into exploring
data most of my life
• Been in tech over 20
years
• Practitioner and
hobbyist, not
researcher
Sample Problem: Sensor Data
• Domain: House of Sadukie
• Problem: Temperature data is
stored miserably
• Goal: Display data in a
visualization that makes sense
Current Outcome – via MySQL & R
Polyglot
Knowing or using several languages
SQL Server
Data Scientist
A person employed to analyze and
interpret complex digital data, such as
the usage statistics of a website,
especially in order to assist a business
in its decision-making
Multi-Faceted Data Science
• Various categories:
– Statistics – modeling, sampling, clustering, reduction
– Mathematics – NSA, astronomers, military
– Data engineering – database/memory/file optimization, Hadoop, data flows
– Machine learning and algorithms
– Business – ROI optimization, decision sciences
– Software engineering – primarily polyglots in production code
– Visualization
– Spatial
Source: https://www.datasciencecentral.com/profiles/blogs/six-categories-of-
data-scientists
The Problem with Being a Polyglot
• Understanding strengths and weaknesses of the languages
• Knowing which language is appropriate for what situation
Multiple tools…
multiple solutions…
how many
programs do I
have to use?!?
And wouldn’t it be
awesome if I could
use one tool to do
most of the work?
What R and Python Have to Offer
for SQL
• Libraries specialized to handle data science domain problems
including:
– Visualization
– Data exploration
– Statistical and Mathematical Analysis
– Trending
– Regression
• Libraries + Data right from the source = quicker exploratory analysis
• Python and R are great working from one large table and branch for
different directions
– Which can inspire additional analyses
Sample Problem: Sensor Data
• Number of rows: 400k+
• 1 Table
• Questions to look into:
– What are temperature trends over
time?
– When are sensors going offline?
– What temperatures look spot on?
– What sensors are wavering in reads
and showing inconsistencies?
Bringing the Computation
to the Data
Advanced Analytics
in
SQL Server 2016/2017
• SQL Server 2016
• SQL Server R Services / Machine
Learning Services
• SQL Server 2017
• SQL Server R Services / Machine
Learning Services
• Python Support
Sample Problem: Sensor Data
• Possible Strategy:
– Use SQL to gather the data into a
dataset that has the most amount of
data to observe.
– Use Python or R to manipulate the
data results and allow for easy analysis
and substantial predictions based on
observations.
Not Just Windows!
R Server for Windows
R Server for Linux
- CentOS
- RHEL
- Ubuntu
- SUSE
R Server for Hadoop – cluster in the cloud
R Server for Teradata – not as Machine Learning
Server
SQL Server as our Base
R and/or Python on Top
Additional pieces provided by MachineML:
Microsoft Machine Learning Services, RevoScaleR, RevoScalePy
Microsoft
Machine Learning
Services
Machine Learning Services in SQL
Server
• Allows integration of other languages in SQL Server
– SQL Server 2016 can work with R
– SQL Server 2017 introduces Python support
• Scalable in that you can develop and test on a single machine
and then deploy to distributed or parallel processing platforms.
Platforms include:
– SQL Server on Windows
– Hadoop
– Spark
SQL Server Machine Learning
Services (In-Database)
• SQL Server R Services (In-Database) started in SQL Server 2016
• With SQL Server 2017, SQL Server Machine Learning Services (In-
Database) allows us to use R and Python within SQL Server
• Do not need to open IDE and SQL tools to accomplish the work –
no context switching needed!
• Can call libraries from Python or R to process data right within
SQL
Python vs R?
• SQL Server 2016? R
• SQL Server 2017? R and/or Python
• What are you familiar with?
• Look at tutorials – what makes sense?
• What features do you need and how are they supported by
Microsoft ML?
Python Support
• CPython 3.5
• revoscalepy – Python equivalents of RevoScaleR
• Remote compute contexts
• Also supports familiar libraries such as:
– scikit-learn
– Tensorflow
– Caffe
– Theano/Keras
R Code in SQL
DECLARE @rscript NVARCHAR(MAX);
SET @rscript = N'
SensorData <- SqlData;
print(summary(SensorData))';
DECLARE @sqlscript NVARCHAR(MAX);
SET @sqlscript = N'
SELECT * FROM Sensors;';
EXEC sp_execute_external_script
@language = N'R',
@script = @rscript,
@input_data_1 = @sqlscript,
@input_data_1_name = N'SqlData',
@output_data_1_name = N'SensorData';
Python Code in SQL
execute sp_execute_external_script
@language = N'Python',
@script = N'
summary = pandas.DataFrame.describe(InputDataSet)
print(summary.transpose())
',
@input_data_1 = N'SELECT * FROM Sensors';
GO
RevoScaleR and
RevoScalePy
What is RevoScaleR?
• A library written in R that includes functions for importing,
transforming, and analyzing data
• Scalable, portable, and easily distributable
• Things it can do include:
– Descriptive statistics
– Generalized linear models
– Logistic Regression
– Classification trees
– Decision forest
• Multithreaded and multinode
Running RevoScaleR
• Part of the Machine Learning Server and Microsoft R products
• Can use any R IDE to write scripts that use RevoScaleR
• Needs to be run on a computer with the interpreter and libraries
• Two modalities:
– Locally
– Remote compute context
– Shift execution to the server
– Windows server
– Hadoop
– Spark
Prediction
• Linear models
• Logistic regression models
• Generalized linear models
• Covariance and correlation
• Decision forest
• K-means clustering
Understanding Data with
RevoScaleR
Typical Workflow with RevoScaleRAnalyVVisuaMoveData
Import /
Export
TidyData
Clean
Manipulate
Transform
PresentData
Visualize
MakeDecisions
Analyze
Learn
Predict
Key Pieces for Analysis with
RevoScaleR
Data
Source
Compute
Context
Analytic
Function
Data Sources
• Comma-delimited text data
• SAS
• SPSS
• XDF
• ODBC
• Teradata
• SQL Server
Graphing
with
RevoScaleR
• rxHistogram
• rxLinePlot
• rxLorenz
• rxRocCurve
Descriptive Statistics
• rxQuantile
• rxSummary
• rxCrossTabs
• rxCube
Two Use Cases for Remote
Computer Context
• Running R in T-SQL scripts or stored procedures
• Calling RevoScaleR in R from a SQL context
Visual Studio 2017: One IDE with
Common Tools
• Python Tools for Visual Studio
• R Tools for Visual Studio
• SQL Server capabilities within Visual Studio
Additional Support
Polyglot Data Scientist Presentation
Resources
• R Services in SQL Server 2016 (Channel 9)
• Built-in machine learning in Microsoft SQL Server 2017 with Python
(Build 2017)
• MicrosoftML 1.3.0: What’s new for machine learning in Microsoft
R Server (Channel 9)
• Using Visual Studio for Machine Learning (Build 2017)
• Performance patterns for machine learning services in SQL Server
(Microsoft Ignite 2017)
Learn More
Resources
• Kaggle: The Home of Data Science and Machine Learning
• DataCamp: Learn R, Python, and Data Science Online
• Difference between Machine Learning, Data Science, AI, Deep
Learning, and Statistics – Vincent Granville
• Python Tutorial from Mode Analytics
• Coursera
– Mastering Software Development in R Specialization
– Data Science Specialization
– Applied Data Science with Python Specialization
– Executive Data Science Specialization
Contact Me
• Twitter: @sadukie
• Blog: http://codinggeekette.com
• Email:
sarah@cletechconsulting.com
Sarah Dutkiewicz
Cleveland Tech Consulting, LLC
Owner

More Related Content

The Polyglot Data Scientist - Exploring R, Python, and SQL Server

  • 1. The Polyglot Data Scientist Adventures with R, Python, and SQL
  • 2. Audience Survey • How many here have used: – SQL? – Python? – R? • What job titles do people have?
  • 3. What We Won’t Cover • Theories behind data science and machine learning • Deep dive into Python • Deep dive into R • Deep dive into SQL Server
  • 4. There is a data science VM available on Azure. It won’t be covered in this presentation. See https://docs.microsoft.com/en- us/sql/advanced-analytics/getting-started- with-machine-learning-services for details. Azure Support
  • 5. What We Will Cover • The Problem with Being a Polyglot • What SQL Server + R or SQL Server + Python Solves • A Glance at these in Action
  • 6. Not a Microsoft sales person… • Microsoft MVP in Visual Studio • Been into exploring data most of my life • Been in tech over 20 years • Practitioner and hobbyist, not researcher
  • 7. Sample Problem: Sensor Data • Domain: House of Sadukie • Problem: Temperature data is stored miserably • Goal: Display data in a visualization that makes sense
  • 8. Current Outcome – via MySQL & R
  • 9. Polyglot Knowing or using several languages
  • 11. Data Scientist A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making
  • 12. Multi-Faceted Data Science • Various categories: – Statistics – modeling, sampling, clustering, reduction – Mathematics – NSA, astronomers, military – Data engineering – database/memory/file optimization, Hadoop, data flows – Machine learning and algorithms – Business – ROI optimization, decision sciences – Software engineering – primarily polyglots in production code – Visualization – Spatial Source: https://www.datasciencecentral.com/profiles/blogs/six-categories-of- data-scientists
  • 13. The Problem with Being a Polyglot • Understanding strengths and weaknesses of the languages • Knowing which language is appropriate for what situation
  • 14. Multiple tools… multiple solutions… how many programs do I have to use?!? And wouldn’t it be awesome if I could use one tool to do most of the work?
  • 15. What R and Python Have to Offer for SQL • Libraries specialized to handle data science domain problems including: – Visualization – Data exploration – Statistical and Mathematical Analysis – Trending – Regression • Libraries + Data right from the source = quicker exploratory analysis • Python and R are great working from one large table and branch for different directions – Which can inspire additional analyses
  • 16. Sample Problem: Sensor Data • Number of rows: 400k+ • 1 Table • Questions to look into: – What are temperature trends over time? – When are sensors going offline? – What temperatures look spot on? – What sensors are wavering in reads and showing inconsistencies?
  • 18. Advanced Analytics in SQL Server 2016/2017 • SQL Server 2016 • SQL Server R Services / Machine Learning Services • SQL Server 2017 • SQL Server R Services / Machine Learning Services • Python Support
  • 19. Sample Problem: Sensor Data • Possible Strategy: – Use SQL to gather the data into a dataset that has the most amount of data to observe. – Use Python or R to manipulate the data results and allow for easy analysis and substantial predictions based on observations.
  • 20. Not Just Windows! R Server for Windows R Server for Linux - CentOS - RHEL - Ubuntu - SUSE R Server for Hadoop – cluster in the cloud R Server for Teradata – not as Machine Learning Server
  • 21. SQL Server as our Base R and/or Python on Top Additional pieces provided by MachineML: Microsoft Machine Learning Services, RevoScaleR, RevoScalePy
  • 23. Machine Learning Services in SQL Server • Allows integration of other languages in SQL Server – SQL Server 2016 can work with R – SQL Server 2017 introduces Python support • Scalable in that you can develop and test on a single machine and then deploy to distributed or parallel processing platforms. Platforms include: – SQL Server on Windows – Hadoop – Spark
  • 24. SQL Server Machine Learning Services (In-Database) • SQL Server R Services (In-Database) started in SQL Server 2016 • With SQL Server 2017, SQL Server Machine Learning Services (In- Database) allows us to use R and Python within SQL Server • Do not need to open IDE and SQL tools to accomplish the work – no context switching needed! • Can call libraries from Python or R to process data right within SQL
  • 25. Python vs R? • SQL Server 2016? R • SQL Server 2017? R and/or Python • What are you familiar with? • Look at tutorials – what makes sense? • What features do you need and how are they supported by Microsoft ML?
  • 26. Python Support • CPython 3.5 • revoscalepy – Python equivalents of RevoScaleR • Remote compute contexts • Also supports familiar libraries such as: – scikit-learn – Tensorflow – Caffe – Theano/Keras
  • 27. R Code in SQL DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' SensorData <- SqlData; print(summary(SensorData))'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT * FROM Sensors;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'SensorData';
  • 28. Python Code in SQL execute sp_execute_external_script @language = N'Python', @script = N' summary = pandas.DataFrame.describe(InputDataSet) print(summary.transpose()) ', @input_data_1 = N'SELECT * FROM Sensors'; GO
  • 30. What is RevoScaleR? • A library written in R that includes functions for importing, transforming, and analyzing data • Scalable, portable, and easily distributable • Things it can do include: – Descriptive statistics – Generalized linear models – Logistic Regression – Classification trees – Decision forest • Multithreaded and multinode
  • 31. Running RevoScaleR • Part of the Machine Learning Server and Microsoft R products • Can use any R IDE to write scripts that use RevoScaleR • Needs to be run on a computer with the interpreter and libraries • Two modalities: – Locally – Remote compute context – Shift execution to the server – Windows server – Hadoop – Spark
  • 32. Prediction • Linear models • Logistic regression models • Generalized linear models • Covariance and correlation • Decision forest • K-means clustering
  • 34. Typical Workflow with RevoScaleRAnalyVVisuaMoveData Import / Export TidyData Clean Manipulate Transform PresentData Visualize MakeDecisions Analyze Learn Predict
  • 35. Key Pieces for Analysis with RevoScaleR Data Source Compute Context Analytic Function
  • 36. Data Sources • Comma-delimited text data • SAS • SPSS • XDF • ODBC • Teradata • SQL Server
  • 38. Descriptive Statistics • rxQuantile • rxSummary • rxCrossTabs • rxCube
  • 39. Two Use Cases for Remote Computer Context • Running R in T-SQL scripts or stored procedures • Calling RevoScaleR in R from a SQL context
  • 40. Visual Studio 2017: One IDE with Common Tools • Python Tools for Visual Studio • R Tools for Visual Studio • SQL Server capabilities within Visual Studio
  • 42. Polyglot Data Scientist Presentation Resources • R Services in SQL Server 2016 (Channel 9) • Built-in machine learning in Microsoft SQL Server 2017 with Python (Build 2017) • MicrosoftML 1.3.0: What’s new for machine learning in Microsoft R Server (Channel 9) • Using Visual Studio for Machine Learning (Build 2017) • Performance patterns for machine learning services in SQL Server (Microsoft Ignite 2017)
  • 44. Resources • Kaggle: The Home of Data Science and Machine Learning • DataCamp: Learn R, Python, and Data Science Online • Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics – Vincent Granville • Python Tutorial from Mode Analytics • Coursera – Mastering Software Development in R Specialization – Data Science Specialization – Applied Data Science with Python Specialization – Executive Data Science Specialization
  • 45. Contact Me • Twitter: @sadukie • Blog: http://codinggeekette.com • Email: sarah@cletechconsulting.com Sarah Dutkiewicz Cleveland Tech Consulting, LLC Owner