SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecting a Serverless Data Lake on
AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?
A Data Lake allows you to store all your structured and
unstructured data, in one centralized repository, and at
any scale. With a Data Lake, you can store your data as-
is, without having to first structure the data, based on
potential questions you may have in the future. Data Lakes
also allow you to run different types of analytics on your
data like SQL queries, big data analytics, full text search,
real-time analytics, and machine learning to guide better
decisions.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Serverless computing?
No Server Management
High Availability
No Idle Capacity
$
Flexible Scaling
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From on-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Secure, highly scalable, durable object storage with millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
Secure, durable, and extremely low-cost storage for data archiving and long-term backup
Store data at $0.004/GB/month
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Traditional
enterprise
data
Big data
Dark data
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
• Automatically generates ETL code
• Code is customizable with Python
and Spark
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011
100101010111001010
100000111100101100
101010001100001
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Elasticsearch Service
Easy to deploy, secure, operate, and scale Elasticsearch
Customers use Elasticsearch for log analytics, full-text search & application monitoring
Easy to Use
Fully managed;
Deploy production-ready
clusters in minutes
Secure
Secure access with VPC to
keep all traffic within
AWS network
Open
Direct access to
Elasticsearch open-source
APIs; supports Logstash
and Kibana
Available
Zone awareness
replicates data between
two AZs; automatically
monitors & replaces
failed nodes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on AWS
PLATFORM SERVICES
APPLICATION SERVICES
FRAMEWORKS & INTERFACES
Caffe2 CNTK
Apache
MXNet
PyTorch
TensorFlo
w
Torch Keras Gluon
AWS Deep Learning AMIs
Amazon SageMaker AWS DeepLens
Rekognition Transcribe Translate Polly Comprehend Lex
INFRASTRUCTURE
CPU IoT & EdgeGPU (P3) Mobile
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Are we all ready to build a
Data Lake?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Lets Do That
Right Here…..Right Now!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
Transactions
Ingest
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Firehose – How it Works
Ingest Transform Deliver
Amazon S3
Amazon
Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key Features
Data durability:
• Data backup to S3 upon delivery or transformation failure
• 3X data replication in delivery stream for high data durability
Up to 24 hours data retention in delivery stream to absorb backpressure
from destinations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Data Transformation
Kinesis Firehose AWS Lambda
Pre-Built Data Transformation Blueprints
• General Processing
• Apache Log to JSON
• Apache Log to CSV
• Syslog to JSON
• Syslog to CSV
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
Transactions
• Transactions
• Reference
Ingest Store & Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon
Redshift, etc.) into a single categorized list that is searchable
• Unified Metadata Repository
across Data Stores
• Schema Versioning
• Shared across AWS Glue, Amazon
Athena, Amazon Redshift
Spectrum and Amazon EMR
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are Crawlers
Crawlers automatically build your Data Catalog and keep it in sync.
• Scan your data stored in various data stores, extract metadata and data
statistics, and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run ad hoc or on a schedule; serverless – only pay when crawler runs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Custom Classifiers
You can write a custom classifier by providing a Grok
pattern and a classification string for the matched
schema
A Grok pattern is a named set of regular expressions
(regex) that are used to match data one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon S3
(Processed)
Transactions
Enrich
• Transactions
• Reference • Enriched
Ingest Store & Catalog
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring:Automatic Code Generation
1. Customize the mappings
2. Glue generates transformation graph and Python or Scala code
3. Customize the code based on your requirements
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job authoring: Developer Endpoints
 Environment to iteratively explore data with Apache Spark SQL
 Develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue’s Apache Spark environment
Remote
interpreter
Interpreter
server
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DynamicFrame Transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
C
15+ transforms out-of-the box
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationalize() Transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Bookmarks
Suppose you want to periodically run a job
avoid reprocessing previous input
avoid generating duplicate output
Examples:
Process githubarchive files daily
Process firehose files hourly
Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
Bookmarks are per-job checkpoints
that track persisted state from
previous runs.
They track state of sources, transforms,
and sinks
run 1 run 2 run 3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Scheduling and monitoring
Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion, job
failure, job stopping events
 On-demand: e.g., AWS Lambda
…More coming soon!
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend
by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Serverless
 Auto-configure VPC and role-based access
 Customers can specify the capacity that
gets allocated to each job
 You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena:Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the best performance and lowest
cost
• No ETL required
• Stream data directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
QuickSight
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
QuickSight : Connect to data wherever it is
QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted
databases and third party business applications
On-premises
Securely connect to on-premise
databases and flat files like
Excel and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPICE
QuickSight is powered by SPICE, a super-fast calculation engine that delivers
performance and scale, regardless of how many users are active.
SPICEYour Data Source
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Governance
Create managed datasets that give power users and authors the flexibility to
perform self-serve analytics on data that you control.
Create datasets that:
• Can be shared with any user
• Automatically refresh
• Have row level security
• Users cannot modify
• Dynamically update
with changes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
User Management and AD Integration
QuickSight Enterprise Edition can integrate with your Active Directory to
dynamically manage users and groups.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you

More Related Content

Architecting a Serverless Data Lake on AWS

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecting a Serverless Data Lake on AWS
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is a Data Lake? A Data Lake allows you to store all your structured and unstructured data, in one centralized repository, and at any scale. With a Data Lake, you can store your data as- is, without having to first structure the data, based on potential questions you may have in the future. Data Lakes also allow you to run different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning to guide better decisions.
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Characteristics of a Data Lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Serverless computing? No Server Management High Availability No Idle Capacity $ Flexible Scaling
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From on-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet- based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3—Object Storage Secure, highly scalable, durable object storage with millisecond latency for data access Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier—Backup and Archive Secure, durable, and extremely low-cost storage for data archiving and long-term backup Store data at $0.004/GB/month Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storing is Not Enough, Data Needs to Be Discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Traditional enterprise data Big data Dark data CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Data Catalog • Automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL • Catalog contains table and job definitions • Computes statistics to make queries efficient Glue Data Catalog Discover data and extract schema Compliance
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—ETL Service • Automatically generates ETL code • Code is customizable with Python and Spark • Endpoints provided to edit, debug, test code • Jobs are scheduled or event-based • Serverless
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift—Data Warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum S3 data lakeRedshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Grok, Avro, & Parquet data formats • Pay only for the amount of data scanned
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR—Big Data Processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 100110000100101011 100101010111001010 100000111100101100 101010001100001
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Elasticsearch Service Easy to deploy, secure, operate, and scale Elasticsearch Customers use Elasticsearch for log analytics, full-text search & application monitoring Easy to Use Fully managed; Deploy production-ready clusters in minutes Secure Secure access with VPC to keep all traffic within AWS network Open Direct access to Elasticsearch open-source APIs; supports Logstash and Kibana Available Zone awareness replicates data between two AZs; automatically monitors & replaces failed nodes
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Analytics
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI Empower everyone Seamless connectivity Fast analysis Serverless
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS AnalyticsMachine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on AWS PLATFORM SERVICES APPLICATION SERVICES FRAMEWORKS & INTERFACES Caffe2 CNTK Apache MXNet PyTorch TensorFlo w Torch Keras Gluon AWS Deep Learning AMIs Amazon SageMaker AWS DeepLens Rekognition Transcribe Translate Polly Comprehend Lex INFRASTRUCTURE CPU IoT & EdgeGPU (P3) Mobile
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo Are we all ready to build a Data Lake?
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo Lets Do That Right Here…..Right Now!
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Kinesis Data Firehose Delivery Stream Kinesis Data Generator Transactions Ingest
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Data Firehose – How it Works Ingest Transform Deliver Amazon S3 Amazon Redshift Amazon Elasticsearch Service AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Apache Kafka
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key Features Data durability: • Data backup to S3 upon delivery or transformation failure • 3X data replication in delivery stream for high data durability Up to 24 hours data retention in delivery stream to absorb backpressure from destinations
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless Data Transformation Kinesis Firehose AWS Lambda Pre-Built Data Transformation Blueprints • General Processing • Apache Log to JSON • Apache Log to CSV • Syslog to JSON • Syslog to CSV
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) Transactions • Transactions • Reference Ingest Store & Catalog
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable • Unified Metadata Repository across Data Stores • Schema Versioning • Shared across AWS Glue, Amazon Athena, Amazon Redshift Spectrum and Amazon EMR
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are Crawlers Crawlers automatically build your Data Catalog and keep it in sync. • Scan your data stored in various data stores, extract metadata and data statistics, and add table definitions to your Data Catalog • Classify data using built-in and custom classifiers • You can write your own using Grok expressions • Discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 • Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Custom Classifiers You can write a custom classifier by providing a Grok pattern and a classification string for the matched schema A Grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. Example: %{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon S3 (Processed) Transactions Enrich • Transactions • Reference • Enriched Ingest Store & Catalog Process
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Authoring:Automatic Code Generation 1. Customize the mappings 2. Glue generates transformation graph and Python or Scala code 3. Customize the code based on your requirements
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job authoring: Developer Endpoints  Environment to iteratively explore data with Apache Spark SQL  Develop and test ETL code.  Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.  When you are satisfied with the results you can create an ETL job that runs your code. Glue’s Apache Spark environment Remote interpreter Interpreter server
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DynamicFrame Transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y C 15+ transforms out-of-the box
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relationalize() Transform Semi-structured schema Relational schema F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  • 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Bookmarks Suppose you want to periodically run a job avoid reprocessing previous input avoid generating duplicate output Examples: Process githubarchive files daily Process firehose files hourly Track timestamps or primary keys in DBs Track generated foreign keys for normalization Bookmarks are per-job checkpoints that track persisted state from previous runs. They track state of sources, transforms, and sinks run 1 run 2 run 3
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Execution:Scheduling and monitoring Compose jobs globally with event- based dependencies  Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms  Schedule-based: e.g., time of day  Event-based: e.g., job completion, job failure, job stopping events  On-demand: e.g., AWS Lambda …More coming soon! Logs and alerts are available in Amazon CloudWatch Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job Execution:Serverless  Auto-configure VPC and role-based access  Customers can specify the capacity that gets allocated to each job  You pay only for the resources you consume while consuming them There is no need to provision, configure, or manage servers Customer VPC Customer VPC Compute instances
  • 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon Athena Amazon S3 (Processed) Transactions Enrich Explore • Transactions • Reference • Enriched Ingest Store & Catalog Consume Process
  • 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena:Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  • 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  • 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are we building? Data Lake Kinesis Data Firehose Delivery Stream Kinesis Data Generator AmazonS3 (Raw) AWS Glue (Data Catalog) AWS Glue (Transform) Amazon QuickSight Amazon Athena Amazon S3 (Processed) Transactions Enrich Explore • Transactions • Reference • Enriched Ingest Store & Catalog Consume Process
  • 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. QuickSight : Connect to data wherever it is QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted databases and third party business applications On-premises Securely connect to on-premise databases and flat files like Excel and CSV In the cloud Connect to hosted database, big data formats, and secure VPCs Applications Connect directly to third party business applications • Salesforce • Square • Adobe Analytics • Jira • ServiceNow • Twitter • Github • Redshift • RDS • S3 • Athena • Aurora • Teradata • MySQL • Presto • Spark • SQL Server • Postgre SQL • MariaDB • Snowflake • Excel • CSV • Teradata • MySQL • SQL Server • PostgreSQL
  • 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SPICE QuickSight is powered by SPICE, a super-fast calculation engine that delivers performance and scale, regardless of how many users are active. SPICEYour Data Source
  • 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Governance Create managed datasets that give power users and authors the flexibility to perform self-serve analytics on data that you control. Create datasets that: • Can be shared with any user • Automatically refresh • Have row level security • Users cannot modify • Dynamically update with changes
  • 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. User Management and AD Integration QuickSight Enterprise Edition can integrate with your Active Directory to dynamically manage users and groups.
  • 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you