Architecting a Serverless Data Lake on AWS
- 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecting a Serverless Data Lake on
AWS
- 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?
A Data Lake allows you to store all your structured and
unstructured data, in one centralized repository, and at
any scale. With a Data Lake, you can store your data as-
is, without having to first structure the data, based on
potential questions you may have in the future. Data Lakes
also allow you to run different types of analytics on your
data like SQL queries, big data analytics, full text search,
real-time analytics, and machine learning to guide better
decisions.
- 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
- 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Serverless computing?
No Server Management
High Availability
No Idle Capacity
$
Flexible Scaling
- 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
- 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
- 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From on-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
- 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
- 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
- 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Secure, highly scalable, durable object storage with millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
- 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
Secure, durable, and extremely low-cost storage for data archiving and long-term backup
Store data at $0.004/GB/month
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
- 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Traditional
enterprise
data
Big data
Dark data
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
- 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance
- 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
• Automatically generates ETL code
• Code is customizable with Python
and Spark
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless
- 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
- 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
- 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
- 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011
100101010111001010
100000111100101100
101010001100001
- 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Elasticsearch Service
Easy to deploy, secure, operate, and scale Elasticsearch
Customers use Elasticsearch for log analytics, full-text search & application monitoring
Easy to Use
Fully managed;
Deploy production-ready
clusters in minutes
Secure
Secure access with VPC to
keep all traffic within
AWS network
Open
Direct access to
Elasticsearch open-source
APIs; supports Logstash
and Kibana
Available
Zone awareness
replicates data between
two AZs; automatically
monitors & replaces
failed nodes
- 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Analytics
- 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
- 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
- 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
- 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on AWS
PLATFORM SERVICES
APPLICATION SERVICES
FRAMEWORKS & INTERFACES
Caffe2 CNTK
Apache
MXNet
PyTorch
TensorFlo
w
Torch Keras Gluon
AWS Deep Learning AMIs
Amazon SageMaker AWS DeepLens
Rekognition Transcribe Translate Polly Comprehend Lex
INFRASTRUCTURE
CPU IoT & EdgeGPU (P3) Mobile
- 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Are we all ready to build a
Data Lake?
- 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo
Lets Do That
Right Here…..Right Now!
- 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
Transactions
Ingest
- 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Firehose – How it Works
Ingest Transform Deliver
Amazon S3
Amazon
Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka
- 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key Features
Data durability:
• Data backup to S3 upon delivery or transformation failure
• 3X data replication in delivery stream for high data durability
Up to 24 hours data retention in delivery stream to absorb backpressure
from destinations
- 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Data Transformation
Kinesis Firehose AWS Lambda
Pre-Built Data Transformation Blueprints
• General Processing
• Apache Log to JSON
• Apache Log to CSV
• Syslog to JSON
• Syslog to CSV
- 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
Transactions
• Transactions
• Reference
Ingest Store & Catalog
- 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon
Redshift, etc.) into a single categorized list that is searchable
• Unified Metadata Repository
across Data Stores
• Schema Versioning
• Shared across AWS Glue, Amazon
Athena, Amazon Redshift
Spectrum and Amazon EMR
- 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are Crawlers
Crawlers automatically build your Data Catalog and keep it in sync.
• Scan your data stored in various data stores, extract metadata and data
statistics, and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run ad hoc or on a schedule; serverless – only pay when crawler runs
- 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Custom Classifiers
You can write a custom classifier by providing a Grok
pattern and a classification string for the matched
schema
A Grok pattern is a named set of regular expressions
(regex) that are used to match data one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
- 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon S3
(Processed)
Transactions
Enrich
• Transactions
• Reference • Enriched
Ingest Store & Catalog
Process
- 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Authoring:Automatic Code Generation
1. Customize the mappings
2. Glue generates transformation graph and Python or Scala code
3. Customize the code based on your requirements
- 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job authoring: Developer Endpoints
Environment to iteratively explore data with Apache Spark SQL
Develop and test ETL code.
Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
When you are satisfied with the results you can create an ETL job that runs your code.
Glue’s Apache Spark environment
Remote
interpreter
Interpreter
server
- 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DynamicFrame Transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
C
15+ transforms out-of-the box
- 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationalize() Transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
- 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Bookmarks
Suppose you want to periodically run a job
avoid reprocessing previous input
avoid generating duplicate output
Examples:
Process githubarchive files daily
Process firehose files hourly
Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
Bookmarks are per-job checkpoints
that track persisted state from
previous runs.
They track state of sources, transforms,
and sinks
run 1 run 2 run 3
- 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Scheduling and monitoring
Compose jobs globally with event-
based dependencies
Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
Schedule-based: e.g., time of day
Event-based: e.g., job completion, job
failure, job stopping events
On-demand: e.g., AWS Lambda
…More coming soon!
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend
by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
- 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job Execution:Serverless
Auto-configure VPC and role-based access
Customers can specify the capacity that
gets allocated to each job
You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
- 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
- 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena:Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the best performance and lowest
cost
• No ETL required
• Stream data directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
- 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date
- 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
- 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are we building?
Data Lake
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
QuickSight
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
• Reference • Enriched
Ingest Store & Catalog Consume
Process
- 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
QuickSight : Connect to data wherever it is
QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted
databases and third party business applications
On-premises
Securely connect to on-premise
databases and flat files like
Excel and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL
- 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPICE
QuickSight is powered by SPICE, a super-fast calculation engine that delivers
performance and scale, regardless of how many users are active.
SPICEYour Data Source
- 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Governance
Create managed datasets that give power users and authors the flexibility to
perform self-serve analytics on data that you control.
Create datasets that:
• Can be shared with any user
• Automatically refresh
• Have row level security
• Users cannot modify
• Dynamically update
with changes
- 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
User Management and AD Integration
QuickSight Enterprise Edition can integrate with your Active Directory to
dynamically manage users and groups.
- 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you