Building Data Lakes in the AWS Cloud
- 2. Rethink how to become a data-driven business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. take advantage of a rich platform of
services to respond quickly to changing business needs
- 3. Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Infrastructure
- 4. Often Undertaken with Silos of Tools and Data
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL
- 5. Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront
- 6. This Leads to Friction & Pain
• Challenging to move data across silos
• Forced to keep multiple copies of data
• Complex data transformation & governance
• Users struggle to find data they need
• Slows innovation and evolution
• Expensive
- 7. Enter the Data Lake Architecture
Data Lake is a new and increasingly
popular architecture to store and
analyze massive volumes and
heterogeneous types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest & Transformation
• Bring Functionality to the Data
• Schema on Read
- 8. Legacy Data Architectures Are Monolithic
Multiple layers of functionality
all on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node
- 10. Consolidate Data / Separate Storage & Compute
•Amazon S3 as the data lake storage tier; not a single analytics tool
like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more efficient to
operate
•Decoupled storage and compute allow us to evolve to clusterless
architectures (i.e. AWS Lambda, Amazon Athena, Redshift Spectrum,
AWS Glue, Amazon Macie)
•Do not build data silos in Hadoop or an EDW
•Gain the flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
- 11. An AWS Data Lake Architecture
AWS Glue
ETL & Data Catalog
Serverless
Compute
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Athena
Amazon Athena
Interactive Query
Data
Processing
Amazon EMR
Managed Hadoop Applications
Amazon Redshift
Petabyte-scale Data Warehousing
Storage
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore
- 12. • Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data
warehouse and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon
Redshift & S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set
(~1100 tables)
- 13. Nasdaq uses Presto on Amazon EMR and
Amazon Redshift as a tiered data lake
Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
- 14. Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
Multiple upload
Range GET
Scalable Throughput
Store as much as you need
Scale storage and compute
independently
Scale without limits
Affordable
Scalable
Amazon EMR
Amazon Redshift/Spectrum
Amazon DynamoDB
Amazon Athena
Amazon Rekognition
Amazon Glue
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Simple Management Tools
Hadoop compatibility
Easy to use
Why Choose Amazon S3 for Data Lake?
- 15. Optimize Costs with Data Tiering
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
• Use Amazon S3 Analytics for storage class
analysis
New
- 16. Encryption ComplianceSecurity
Identity and Access
Management (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Private VPC endpoints to
Amazon S3
SSL endpoints
Server Side Encryption
(SSE-S3)
S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
Client-side Encryption
Buckets access logs
Lifecycle Management
Policies
Access Control Lists
(ACLs)
Versioning & MFA deletes
Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right security controls in S3
- 17. Manage your data
S3 object Tags
Manage storage based on object tags
• Classify your data
• Tag your objects with key-value pairs
• Write policies once based on the type of data
Discoverability Lifecycle PolicyAccess Control
- 18. Manage S3 Security
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
Manage permissions with tags
- 19. Access control by cluster tag and IAM roles
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
- 20. Macie: A New Approach
Amazon Macie
Understand Your Data
Natural Language
Processing (NLP)
Understand Data Access
Machine Learning
- 21. Amazon Macie Uses Machine Learning
• Understand behavioral analytics to baseline normal
behavior
• Train and develop contextualized alerts by understanding
the value of data being accessed
• Context for content
- 22. Business Critical Data in Amazon S3
• Static website content
• Source code
• SSL certificates, private
keys
• iOS and Android app
signing keys
• Database backups
• OAuth and Cloud SAAS
API Keys
- 25. AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly
into AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer
Acceleration
• Move data up to 300% faster
using AWS’s private network
- 26. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
- 27. Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Extract metadata
with Lambda
Data
Sources
Search
capabilities
- 29. Glue data catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
Search over metadata for data discovery
Connection info – JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other metadata are
updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.
- 30. Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
- 31. AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
- 34. Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=1
0
date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
- 36. Getting high Throughput Performance with S3
• S3 can scale to many thousands of requests per second
• Need a good key naming scheme
• Only at scale do you need to consider your key naming
scheme
• What are Partitions?
• Why?
• Spread Keys Lexigraphically
• Goal of Partitioning is too spread the heat
• Prevent HotSpots
- 37. Distributing key names
Add randomness to the beginning of the key name…
<my_bucket>/6213-2013_11_13.jpg
<my_bucket>/4653-2013_11_13.jpg
<my_bucket>/9873-2013_11_13.jpg
<my_bucket>/4657-2013_11_13.jpg
<my_bucket>/1256-2013_11_13.jpg
<my_bucket>/8345-2013_11_13.jpg
<my_bucket>/0321-2013_11_13.jpg
<my_bucket>/5654-2013_11_13.jpg
<my_bucket>/2345-2013_11_13.jpg
<my_bucket>/7567-2013_11_13.jpg
<my_bucket>/3455-2013_11_13.jpg
<my_bucket>/4313-2013_11_13.jpg
Partitions:
<my_bucket>/0
<my_bucket>/1
<my_bucket>/2
<my_bucket>/3
<my_bucket>/4
<my_bucket>/5
<my_bucket>/6
<my_bucket>/7
<my_bucket>/8
<my_bucket>/9
- 38. Data Recommendations for EMR and S3
Performance Best Practices:
• Reduce Number of S3 objects by aggregating small files
into larger ones (s3distcp – group-by option)
• Goal: Files >128MB
• Use EMRFS with Consistent View
• Parquet with Snappy compression is emerging as the
best compression algorithm
• Reverse partition scheme to HOUR, DAY, MONTH,
YEAR
- 39. Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
- 41. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Data Catalog
AthenaEMR Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
Amazon Analytics End to End Architecture
IAM
Other
Sources
- 42. Explore Your Data Without ETL
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
- 44. Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3
- 45. Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
- 47. ETL is the most time-consuming part of analytics
ETL Data Warehousing Business Intelligence
80% of time
spent here
Amazon Redshift Amazon QuickSight
- 48. AWS Glue
Simple, flexible, cost-effective ETL
AWS Glue is a fully managed ETL (extract, transform, and load) service
Categorize your data, clean it, enrich it and move it reliably
between various data stores
Once catalogued, your data is immediately searchable and queryable
across your data silos
Simple and cost-effective
Serverless; runs on a fully managed, scale-out Spark environment
- 50. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
Supports Standard ANSI SQL
High Performance
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Fully Managed Petabyte-scale Data
Warehouse
- 51. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
- 52. Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state
Real-time prediction
KPI
process
store
Stream
Amazon Kinesis
Analytics
Amazon
S3
Log
Amazon
KinesisFan out
- 53. Case Study: Clickstream Analysis
Hearst Corporation monitors trending content for over 250 digital properties
worldwide and processes more than 30TB of data per day, using an architecture
that includes Amazon Kinesis and Spark running on Amazon EMR.
Store → Process | Analyze → Answers
- 54. Amazon Kinesis
Amazon
EMR
Amazon EMR
Amazon Redshift
Elasticsearch
Clickstream
Hearst Corporation monitors
trending content for over 250
digital properties worldwide
and processes more than
30TB of data per day, using
an architecture that includes
Amazon Kinesis and Spark
running on Amazon EMR.
- 57. Choose the Right Tools
Amazon Redshift, Spectrum
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots
- 58. Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
You Don’t
Have to
Choose
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis
- 59. Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
Evolve as Needed