Architecting a Serverless Data Lake on AWS

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecting a Serverless Data Lake on
AWS

What is a Data Lake?
A Data Lake allows you to store all your structured and
unstructured data, in one centralized repository, and at
any scale. With a Data Lake, you can store your data as-
is, without having to first structure the data, based on
potential questions you may have in the future. Data Lakes
also allow you to run different types of analytics on your
data like SQL queries, big data analytics, full text search,
real-time analytics, and machine learning to guide better
decisions.

Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything

What is Serverless computing?
No Server Management
High Availability
No Idle Capacity
$
Flexible Scaling

Building a Data Lake on AWS
AnalyticsMachine Learning
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog

Data Movement From on-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications

Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely

Amazon S3—Object Storage
Secure, highly scalable, durable object storage with millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%

Amazon Glacier—Backup and Archive
Secure, durable, and extremely low-cost storage for data archiving and long-term backup
Store data at $0.004/GB/month
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$

Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Traditional
enterprise
data
Big data
Dark data
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

AWS Glue—Data Catalog
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance

AWS Glue—ETL Service
• Automatically generates ETL code
• Code is customizable with Python
and Spark
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless

Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$

Amazon Redshift Spectrum
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned

Amazon EMR—Big Data Processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011
100101010111001010
100000111100101100
101010001100001

Amazon Elasticsearch Service
Easy to deploy, secure, operate, and scale Elasticsearch
Customers use Elasticsearch for log analytics, full-text search & application monitoring
Easy to Use
Fully managed;
Deploy production-ready
clusters in minutes
Secure
Secure access with VPC to
keep all traffic within
AWS network
Open
Direct access to
Elasticsearch open-source
APIs; supports Logstash
and Kibana
Available
Zone awareness
replicates data between
two AZs; automatically
monitors & replaces
failed nodes

Amazon Kinesis Data Analytics

Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$

Amazon QuickSight
Fast, easy to use, serverless analytics at 1/10 the cost of traditional BI
Empower
everyone
Seamless
connectivity
Fast analysis Serverless

Machine Learning on AWS
PLATFORM SERVICES
APPLICATION SERVICES
FRAMEWORKS & INTERFACES
Caffe2 CNTK
Apache
MXNet
PyTorch
TensorFlo
w
Torch Keras Gluon
AWS Deep Learning AMIs
Amazon SageMaker AWS DeepLens
Rekognition Transcribe Translate Polly Comprehend Lex
INFRASTRUCTURE
CPU IoT & EdgeGPU (P3) Mobile

Demo
Are we all ready to build a
Data Lake?

Demo
Lets Do That
Right Here…..Right Now!

What are we building?
Kinesis Data Firehose
Delivery Stream
Kinesis Data
Generator
Transactions
Ingest

Kinesis Data Firehose – How it Works
Ingest Transform Deliver
Amazon S3
Amazon
Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka

Key Features
Data durability:
• Data backup to S3 upon delivery or transformation failure
• 3X data replication in delivery stream for high data durability
Up to 24 hours data retention in delivery stream to absorb backpressure
from destinations

Serverless Data Transformation
Kinesis Firehose AWS Lambda
Pre-Built Data Transformation Blueprints
• General Processing
• Apache Log to JSON
• Apache Log to CSV
• Syslog to JSON
• Syslog to CSV

Data Lake
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
Transactions
• Transactions
• Reference
Ingest Store & Catalog

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon
Redshift, etc.) into a single categorized list that is searchable
• Unified Metadata Repository
across Data Stores
• Schema Versioning
• Shared across AWS Glue, Amazon
Athena, Amazon Redshift
Spectrum and Amazon EMR

What are Crawlers
Crawlers automatically build your Data Catalog and keep it in sync.
• Scan your data stored in various data stores, extract metadata and data
statistics, and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run ad hoc or on a schedule; serverless – only pay when crawler runs

Custom Classifiers
You can write a custom classifier by providing a Grok
pattern and a classification string for the matched
schema
A Grok pattern is a named set of regular expressions
(regex) that are used to match data one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}

Data Lake
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon S3
(Processed)
Transactions
Enrich
• Transactions
• Reference • Enriched
Ingest Store & Catalog
Process

Job Authoring:Automatic Code Generation
1. Customize the mappings
2. Glue generates transformation graph and Python or Scala code
3. Customize the code based on your requirements

Job authoring: Developer Endpoints
 Environment to iteratively explore data with Apache Spark SQL
 Develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue’s Apache Spark environment
Remote
interpreter
Interpreter
server

DynamicFrame Transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
C
15+ transforms out-of-the box

Relationalize() Transform
Semi-structured schema Relational schema
F
K
A B B C.X C.
Y
P
K
Valu
e
Offs
et
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing

Job Bookmarks
Suppose you want to periodically run a job
avoid reprocessing previous input
avoid generating duplicate output
Examples:
Process githubarchive files daily
Process firehose files hourly
Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
Bookmarks are per-job checkpoints
that track persisted state from
previous runs.
They track state of sources, transforms,
and sinks
run 1 run 2 run 3

Job Execution:Scheduling and monitoring
Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion, job
failure, job stopping events
 On-demand: e.g., AWS Lambda
…More coming soon!
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend
by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based

Job Execution:Serverless
 Auto-configure VPC and role-based access
 Customers can specify the capacity that
gets allocated to each job
 You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances

Data Lake
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
Ingest Store & Catalog Consume
Process

Amazon Athena:Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the best performance and lowest
cost
• No ETL required
• Stream data directly from Amazon S3
• Take advantage of Amazon S3 durability and availability

Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Data Lake
Delivery Stream
Kinesis Data
Generator
AmazonS3
(Raw)
AWS Glue
(Data Catalog)
AWS Glue
(Transform)
Amazon
QuickSight
Amazon
Athena
Amazon S3
(Processed)
Transactions
Enrich
Explore
• Transactions
Ingest Store & Catalog Consume
Process

QuickSight : Connect to data wherever it is
QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted
databases and third party business applications
On-premises
Securely connect to on-premise
databases and flat files like
Excel and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL

SPICE
QuickSight is powered by SPICE, a super-fast calculation engine that delivers
performance and scale, regardless of how many users are active.
SPICEYour Data Source

Data Governance
Create managed datasets that give power users and authors the flexibility to
perform self-serve analytics on data that you control.
Create datasets that:
• Can be shared with any user
• Automatically refresh
• Have row level security
• Users cannot modify
• Dynamically update
with changes

User Management and AD Integration
QuickSight Enterprise Edition can integrate with your Active Directory to
dynamically manage users and groups.

Thank you

Architecting a Serverless Data Lake on AWS

Related slideshows

More Related Content

Architecting a Serverless Data Lake on AWS