Building Data Lakes in the AWS Cloud

John Mallory
7 September 2017
Building Data Lakes with AWS

Rethink how to become a data-driven business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. take advantage of a rich platform of
services to respond quickly to changing business needs

Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Infrastructure

Often Undertaken with Silos of Tools and Data
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL

Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront

This Leads to Friction & Pain
• Challenging to move data across silos
• Forced to keep multiple copies of data
• Complex data transformation & governance
• Users struggle to find data they need
• Slows innovation and evolution
• Expensive

Enter the Data Lake Architecture
Data Lake is a new and increasingly
popular architecture to store and
analyze massive volumes and
heterogeneous types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest & Transformation
• Bring Functionality to the Data
• Schema on Read

Legacy Data Architectures Are Monolithic
Multiple layers of functionality
all on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node

Consideration 1 – S3 for the Data Lake

Consolidate Data / Separate Storage & Compute
•Amazon S3 as the data lake storage tier; not a single analytics tool
like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more efficient to
operate
•Decoupled storage and compute allow us to evolve to clusterless
architectures (i.e. AWS Lambda, Amazon Athena, Redshift Spectrum,
AWS Glue, Amazon Macie)
•Do not build data silos in Hadoop or an EDW
•Gain the flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture

An AWS Data Lake Architecture
AWS Glue
ETL & Data Catalog
Serverless
Compute
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Athena
Amazon Athena
Interactive Query
Data
Processing
Amazon EMR
Managed Hadoop Applications
Amazon Redshift
Petabyte-scale Data Warehousing
Storage
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore

• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data
warehouse and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
�� Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon
Redshift & S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set
(~1100 tables)

Nasdaq uses Presto on Amazon EMR and
Amazon Redshift as a tiered data lake
Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU

Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
 Multiple upload
 Range GET
 Scalable Throughput
 Store as much as you need
 Scale storage and compute
independently
 Scale without limits
 Affordable
Scalable
 Amazon EMR
 Amazon Redshift/Spectrum
 Amazon DynamoDB
 Amazon Athena
 Amazon Rekognition
 Amazon Glue
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
 Simple Management Tools
 Hadoop compatibility
Easy to use
Why Choose Amazon S3 for Data Lake?

Optimize Costs with Data Tiering
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
• Use Amazon S3 Analytics for storage class
analysis
New

Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right security controls in S3

Manage your data
S3 object Tags
Manage storage based on object tags
• Classify your data
• Tag your objects with key-value pairs
• Write policies once based on the type of data
Discoverability Lifecycle PolicyAccess Control

Manage S3 Security
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
Manage permissions with tags

Access control by cluster tag and IAM roles
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key

Macie: A New Approach
Amazon Macie
Understand Your Data
Natural Language
Processing (NLP)
Understand Data Access
Machine Learning

Amazon Macie Uses Machine Learning
• Understand behavioral analytics to baseline normal
behavior
• Train and develop contextualized alerts by understanding
the value of data being accessed
• Context for content

Business Critical Data in Amazon S3
• Static website content
• Source code
• SSL certificates, private
keys
• iOS and Android app
signing keys
• Database backups
• OAuth and Cloud SAAS
API Keys

Building Data Lakes in the AWS Cloud

Consideration 2 – Ingest & Catalog

AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly
into AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer
Acceleration
• Move data up to 300% faster
using AWS’s private network

Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch

Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Extract metadata
with Lambda
Data
Sources
Search
capabilities

Glue data catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are
updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.

Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable

Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields

Data Catalog: Version control
List of table versionsCompare schema versions

Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=1
0
date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type

Consideration 3 – Optimizing Performance

Getting high Throughput Performance with S3
• S3 can scale to many thousands of requests per second
• Need a good key naming scheme
• Only at scale do you need to consider your key naming
scheme
• What are Partitions?
• Why?
• Spread Keys Lexigraphically
• Goal of Partitioning is too spread the heat
• Prevent HotSpots

Distributing key names
Add randomness to the beginning of the key name…
<my_bucket>/6213-2013_11_13.jpg
<my_bucket>/4653-2013_11_13.jpg
<my_bucket>/9873-2013_11_13.jpg
<my_bucket>/4657-2013_11_13.jpg
<my_bucket>/1256-2013_11_13.jpg
<my_bucket>/8345-2013_11_13.jpg
<my_bucket>/0321-2013_11_13.jpg
<my_bucket>/5654-2013_11_13.jpg
<my_bucket>/2345-2013_11_13.jpg
<my_bucket>/7567-2013_11_13.jpg
<my_bucket>/3455-2013_11_13.jpg
<my_bucket>/4313-2013_11_13.jpg
Partitions:
<my_bucket>/0
<my_bucket>/1
<my_bucket>/2
<my_bucket>/3
<my_bucket>/4
<my_bucket>/5
<my_bucket>/6
<my_bucket>/7
<my_bucket>/8
<my_bucket>/9

Data Recommendations for EMR and S3
Performance Best Practices:
• Reduce Number of S3 objects by aggregating small files
into larger ones (s3distcp – group-by option)
• Goal: Files >128MB
• Use EMRFS with Consistent View
• Parquet with Snappy compression is emerging as the
best compression algorithm
• Reverse partition scheme to HOUR, DAY, MONTH,
YEAR

Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Consideration 4 – Query in Place

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Data Catalog
AthenaEMR Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
Amazon Analytics End to End Architecture
IAM
Other
Sources

Explore Your Data Without ETL
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades

Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

What About ETL?
Raw Data Assets Transformed Into Usable Ones

ETL is the most time-consuming part of analytics
ETL Data Warehousing Business Intelligence
80% of time
spent here
Amazon Redshift Amazon QuickSight

AWS Glue
Simple, flexible, cost-effective ETL
 AWS Glue is a fully managed ETL (extract, transform, and load) service
 Categorize your data, clean it, enrich it and move it reliably
between various data stores
 Once catalogued, your data is immediately searchable and queryable
across your data silos
 Simple and cost-effective
 Serverless; runs on a fully managed, scale-out Spark environment

Build event-driven ETL pipelines

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
Supports Standard ANSI SQL
High Performance
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Fully Managed Petabyte-scale Data
Warehouse

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL

Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state
Real-time prediction
KPI
process
store
Stream
Amazon Kinesis
Analytics
Amazon
S3
Log
Amazon
KinesisFan out

Case Study: Clickstream Analysis
Hearst Corporation monitors trending content for over 250 digital properties
worldwide and processes more than 30TB of data per day, using an architecture
that includes Amazon Kinesis and Spark running on Amazon EMR.
Store → Process | Analyze → Answers

Amazon Kinesis
Amazon
EMR
Amazon EMR
Amazon Redshift
Elasticsearch
Clickstream
Hearst Corporation monitors
trending content for over 250
digital properties worldwide
and processes more than
30TB of data per day, using
an architecture that includes
Amazon Kinesis and Spark
running on Amazon EMR.

Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Stream Amazon
Kinesis
Firehose
Amazon Athena
Files
Amazon Kinesis
Analytics

Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data lake
Amazon EMR
Amazon
Kinesis
Amazon RedShift
Answers &
Insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting

Choose the Right Tools
Amazon Redshift, Spectrum
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots

Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
You Don’t
Have to
Choose
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis

Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
Evolve as Needed

Building Data Lakes in the AWS Cloud

Related slideshows

More Related Content

Building Data Lakes in the AWS Cloud