Azure Data Lake Intro (SQLBits 2016)

SQLBits 2016
Azure Data Lake &
U-SQL
Michael Rys, @MikeDoesBigData
http://www.azure.com/datalake
{mrys, usql}@microsoft.com

Implement Data Warehouse
Reporting &
Analytics
Development
Reporting &
Analytics Design
Physical DesignDimension Modelling
ETL
Development
ETL Design
Install and TuneSetup Infrastructure
Traditional data warehousing approach
Data sources
ETL
BI and analytics
Data warehouse
Understand
Corporate
Strategy
Gather
Requirements
Business
Requirements
Technical
Requirements

The Data Lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

Source: ComScore 2009-2015 Search Report US
9%
11%
15%
16%
18%
19%
20%
0%
5%
10%
15%
20%
25%
2009 2010 2011 2012 2013 2014 2015
MICROSOFT DOUBLES SEARCH SHARE
How Microsoft has used
Big Data
We needed to better leverage data and
analytics to win in search
We changed our approach
• More experiments by more people!
So we…
Built an Exabyte-scale data lake for everyone
to put their data.
Built tools approachable by any developer.
Built machine learning tools for collaborating
across large experiment models.

Introducing Azure Data Lake
Big Data Made Easy

Cortana Analytics Suite
Big Data & Advanced Analytics

Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake

Azure Data Lake
Storage Service

No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE GRADE access control, encryption
at rest
Optimized for analytic workload
PERFORMANCE
Azure Data Lake
Store
A hyper scale repository for big
data analytics workloads
IN PREVIEW

Data Lake Store: Built for the cloud
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage and for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, Machine Learning, etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark.
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.

Four pillars of security and compliance

FULLY SUPPORTED Hadoop for the cloud
Available on LINUX and WINDOWS
Works on AZURE STORAGE or DATA LAKE
STORE
100% OPEN SOURCE Apache Hadoop (HDP 2.3)
Clusters up and RUNNING IN MINUTES
Use familiar BI TOOLS FOR ANALYSIS like Excel
Azure HDInsight
Hadoop Platform as a
Service on Azure

Azure Data Lake
Analytics Service

WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)

ADLA complements HDInsight
Target the same scenarios, tools, and customers
HDInsight
For developers familiar with the
Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor

No limits to SCALE
Includes U-SQL, a language that unifies the
benefits of SQL with the expressive power of C#
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data sources
ENTERPRISE GRADE role-based access control
and auditing
Pay PER QUERY and scale PER QUERY
Azure Data Lake
Analytics
A distributed analytics service
built on Apache YARN that
dynamically scales to your
needs
IN PREVIEW

Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM

Azure Data Lake Intro (SQLBits 2016)

Simplified management and administration
Web-based management
in Azure Portal
Automate tasks using
PowerShell
Role-based access control
with Azure AD
Monitor service
operations and activity

Get started
Log in to Azure Create an ADLA
account
Write and
submit an ADLA
job with U-SQL
(or Hive/Pig)
The job reads
and writes data
from storage
1 2 3 4
30 seconds
ADLS
Azure Blobs
Azure DB
…

Account Management
Create new account
List accounts
Update account properties
Delete account
Transferring Data
Upload into store from local
disk
Download from store to
local disk
Files and Folders
List contents of
folder
Create
Move
Delete
Does file exist
Security
Get ACLs
Update ACLs
Get Owner
Set Owner
File Content
Set file content
Append file content
Get file content
Merge files

Account Management
Create new account
List accounts
Update account properties
Delete account
Data Sources
Add a data source
List data sources
Update data source
Delete data source
Compute
List jobs
Submit job
Cancel job
Catalog Items
List items in U-SQL catalog
Update item
Catalog Secrets
Create catalog secret
List catalog secrets
Delete catalog secrets

ADL .NET SDKs
Azure and ADL REST APIs
ADL
PowerShell
ADL XPlat CLI
ADL Node.js SDK ADL Java SDK
Your application

Management
Create and manage ADLA accounts
Jobs
Submit and manage jobs
Catalog
Explore catalog items
Management
Create and manage ADLS accounts
File System
Upload, download, list, delete, rename, append
(WebHDFS)
Analytics Store

Analytics .NET SDK
Store .NET SDK
• Management
• Catalog
• Jobs
• Management
• Filesystem
• Uploader
SDKs NuGet packages

Azure Data Lake Intro (SQLBits 2016)

Related slideshows

More Related Content

Azure Data Lake Intro (SQLBits 2016)

Editor's Notes