SlideShare a Scribd company logo
John Mallory
7 September 2017
Building Data Lakes with AWS
Rethink how to become a data-driven business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. take advantage of a rich platform of
services to respond quickly to changing business needs
Finding Value in Data is a Journey
Business Monitoring
Business Insights
New Business Opportunity
Business Optimization
Business Transformation
Evolving Tools and Infrastructure
Often Undertaken with Silos of Tools and Data
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL
Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront
This Leads to Friction & Pain
• Challenging to move data across silos
• Forced to keep multiple copies of data
• Complex data transformation & governance
• Users struggle to find data they need
• Slows innovation and evolution
• Expensive
Enter the Data Lake Architecture
Data Lake is a new and increasingly
popular architecture to store and
analyze massive volumes and
heterogeneous types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest & Transformation
• Bring Functionality to the Data
• Schema on Read
Legacy Data Architectures Are Monolithic
Multiple layers of functionality
all on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node
Consideration 1 – S3 for the Data Lake
Consolidate Data / Separate Storage & Compute
•Amazon S3 as the data lake storage tier; not a single analytics tool
like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more efficient to
operate
•Decoupled storage and compute allow us to evolve to clusterless
architectures (i.e. AWS Lambda, Amazon Athena, Redshift Spectrum,
AWS Glue, Amazon Macie)
•Do not build data silos in Hadoop or an EDW
•Gain the flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
An AWS Data Lake Architecture
AWS Glue
ETL & Data Catalog
Serverless
Compute
AWS Lambda
Trigger-based Code Execution
Amazon Redshift Spectrum
Fast @ Exabyte scale
Athena
Amazon Athena
Interactive Query
Data
Processing
Amazon EMR
Managed Hadoop Applications
Amazon Redshift
Petabyte-scale Data Warehousing
Storage
Amazon S3
Exabyte-scale Object Storage
AWS Glue Data Catalog
Hive-compatible Metastore
• Nasdaq implements an S3 data lake + Redshift data warehouse
architecture
• Most recent two years of data is kept in the Redshift data
warehouse and snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
�� Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon
Redshift & S3 data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set
(~1100 tables)
Nasdaq uses Presto on Amazon EMR and
Amazon Redshift as a tiered data lake
Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
 Multiple upload
 Range GET
 Scalable Throughput
 Store as much as you need
 Scale storage and compute
independently
 Scale without limits
 Affordable
Scalable
 Amazon EMR
 Amazon Redshift/Spectrum
 Amazon DynamoDB
 Amazon Athena
 Amazon Rekognition
 Amazon Glue
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
 Simple Management Tools
 Hadoop compatibility
Easy to use
Why Choose Amazon S3 for Data Lake?
Optimize Costs with Data Tiering
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
• Use Amazon S3 Analytics for storage class
analysis
New
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right security controls in S3
Manage your data
S3 object Tags
Manage storage based on object tags
• Classify your data
• Tag your objects with key-value pairs
• Write policies once based on the type of data
Discoverability Lifecycle PolicyAccess Control
Manage S3 Security
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
Manage permissions with tags
Access control by cluster tag and IAM roles
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Macie: A New Approach
Amazon Macie
Understand Your Data
Natural Language
Processing (NLP)
Understand Data Access
Machine Learning
Amazon Macie Uses Machine Learning
• Understand behavioral analytics to baseline normal
behavior
• Train and develop contextualized alerts by understanding
the value of data being accessed
• Context for content
Business Critical Data in Amazon S3
• Static website content
• Source code
• SSL certificates, private
keys
• iOS and Android app
signing keys
• Database backups
• OAuth and Cloud SAAS
API Keys
Building Data Lakes in the AWS Cloud
Consideration 2 – Ingest & Catalog
AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly
into AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer
Acceleration
• Move data up to 300% faster
using AWS’s private network
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Extract metadata
with Lambda
Data
Sources
Search
capabilities
Catalog with AWS Glue
Glue data catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are
updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.
Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
Data Catalog: Version control
List of table versionsCompare schema versions
Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=1
0
date=15…
month=No
v
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
Consideration 3 – Optimizing Performance
Getting high Throughput Performance with S3
• S3 can scale to many thousands of requests per second
• Need a good key naming scheme
• Only at scale do you need to consider your key naming
scheme
• What are Partitions?
• Why?
• Spread Keys Lexigraphically
• Goal of Partitioning is too spread the heat
• Prevent HotSpots
Distributing key names
Add randomness to the beginning of the key name…
<my_bucket>/6213-2013_11_13.jpg
<my_bucket>/4653-2013_11_13.jpg
<my_bucket>/9873-2013_11_13.jpg
<my_bucket>/4657-2013_11_13.jpg
<my_bucket>/1256-2013_11_13.jpg
<my_bucket>/8345-2013_11_13.jpg
<my_bucket>/0321-2013_11_13.jpg
<my_bucket>/5654-2013_11_13.jpg
<my_bucket>/2345-2013_11_13.jpg
<my_bucket>/7567-2013_11_13.jpg
<my_bucket>/3455-2013_11_13.jpg
<my_bucket>/4313-2013_11_13.jpg
Partitions:
<my_bucket>/0
<my_bucket>/1
<my_bucket>/2
<my_bucket>/3
<my_bucket>/4
<my_bucket>/5
<my_bucket>/6
<my_bucket>/7
<my_bucket>/8
<my_bucket>/9
Data Recommendations for EMR and S3
Performance Best Practices:
• Reduce Number of S3 objects by aggregating small files
into larger ones (s3distcp – group-by option)
• Goal: Files >128MB
• Use EMRFS with Consistent View
• Parquet with Snappy compression is emerging as the
best compression algorithm
• Reverse partition scheme to HOUR, DAY, MONTH,
YEAR
Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Consideration 4 – Query in Place
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Data Catalog
AthenaEMR Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
Amazon Analytics End to End Architecture
IAM
Other
Sources
Explore Your Data Without ETL
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
What About ETL?
Raw Data Assets Transformed Into Usable Ones
ETL is the most time-consuming part of analytics
ETL Data Warehousing Business Intelligence
80% of time
spent here
Amazon Redshift Amazon QuickSight
AWS Glue
Simple, flexible, cost-effective ETL
 AWS Glue is a fully managed ETL (extract, transform, and load) service
 Categorize your data, clean it, enrich it and move it reliably
between various data stores
 Once catalogued, your data is immediately searchable and queryable
across your data silos
 Simple and cost-effective
 Serverless; runs on a fully managed, scale-out Spark environment
Build event-driven ETL pipelines
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
Supports Standard ANSI SQL
High Performance
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Fully Managed Petabyte-scale Data
Warehouse
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state
Real-time prediction
KPI
process
store
Stream
Amazon Kinesis
Analytics
Amazon
S3
Log
Amazon
KinesisFan out
Case Study: Clickstream Analysis
Hearst Corporation monitors trending content for over 250 digital properties
worldwide and processes more than 30TB of data per day, using an architecture
that includes Amazon Kinesis and Spark running on Amazon EMR.
Store → Process | Analyze → Answers
Amazon Kinesis
Amazon
EMR
Amazon EMR
Amazon Redshift
Elasticsearch
Clickstream
Hearst Corporation monitors
trending content for over 250
digital properties worldwide
and processes more than
30TB of data per day, using
an architecture that includes
Amazon Kinesis and Spark
running on Amazon EMR.
Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Stream Amazon
Kinesis
Firehose
Amazon Athena
Files
Amazon Kinesis
Analytics
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data lake
Amazon EMR
Amazon
Kinesis
Amazon RedShift
Answers &
Insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
Choose the Right Tools
Amazon Redshift, Spectrum
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots
Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
You Don’t
Have to
Choose
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis
Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
Evolve as Needed

More Related Content

Building Data Lakes in the AWS Cloud

  • 1. John Mallory 7 September 2017 Building Data Lakes with AWS
  • 2. Rethink how to become a data-driven business • Business outcomes - start with the insights and actions you want to drive, then work backwards to a streamlined design • Experimentation - start small, test many ideas, keep the good ones and scale those up, paying only for what you consume • Agile and timely - deploy data processing infrastructure in minutes, not months. take advantage of a rich platform of services to respond quickly to changing business needs
  • 3. Finding Value in Data is a Journey Business Monitoring Business Insights New Business Opportunity Business Optimization Business Transformation Evolving Tools and Infrastructure
  • 4. Often Undertaken with Silos of Tools and Data Hadoop Spark NoSQL Storage Arrays Databases Data Warehouse Structured Data SQL Raw Data ETL Advanced Analytics ETL
  • 5. Legacy Data Warehouses & RDBMS • Complex to setup and manage • Do not scale • Takes months to add new data sources • Queries take too long • Cost $MM upfront
  • 6. This Leads to Friction & Pain • Challenging to move data across silos • Forced to keep multiple copies of data • Complex data transformation & governance • Users struggle to find data they need • Slows innovation and evolution • Expensive
  • 7. Enter the Data Lake Architecture Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Benefits of a Data Lake • All Data in One Place • Quick Ingest & Transformation • Bring Functionality to the Data • Schema on Read
  • 8. Legacy Data Architectures Are Monolithic Multiple layers of functionality all on a single cluster CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage Hadoop Master Node
  • 9. Consideration 1 – S3 for the Data Lake
  • 10. Consolidate Data / Separate Storage & Compute •Amazon S3 as the data lake storage tier; not a single analytics tool like Hadoop or a data warehouse •Decoupled storage and compute is cheaper and more efficient to operate •Decoupled storage and compute allow us to evolve to clusterless architectures (i.e. AWS Lambda, Amazon Athena, Redshift Spectrum, AWS Glue, Amazon Macie) •Do not build data silos in Hadoop or an EDW •Gain the flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture
  • 11. An AWS Data Lake Architecture AWS Glue ETL & Data Catalog Serverless Compute AWS Lambda Trigger-based Code Execution Amazon Redshift Spectrum Fast @ Exabyte scale Athena Amazon Athena Interactive Query Data Processing Amazon EMR Managed Hadoop Applications Amazon Redshift Petabyte-scale Data Warehousing Storage Amazon S3 Exabyte-scale Object Storage AWS Glue Data Catalog Hive-compatible Metastore
  • 12. • Nasdaq implements an S3 data lake + Redshift data warehouse architecture • Most recent two years of data is kept in the Redshift data warehouse and snapshotted into S3 for disaster recovery • Data between two and five years old is kept in S3 • Presto on EMR is used to ad-hoc query data in S3 • Transitioned from an on-premises data warehouse to Amazon Redshift & S3 data lake architecture • Over 1,000 tables migrated • Average daily ingest of over 7B rows • Migrated off legacy DW to AWS (start to finish) in 7 man-months • AWS costs were 43% of legacy budget for the same data set (~1100 tables)
  • 13. Nasdaq uses Presto on Amazon EMR and Amazon Redshift as a tiered data lake Full Presentation: https://www.youtube.com/watch?v=LuHxnOQarXU
  • 14. Designed for 11 9s of durability • Multiple Encryption Options • Robust/Highly Flexible Access Controls Durable Secure High performance  Multiple upload  Range GET  Scalable Throughput  Store as much as you need  Scale storage and compute independently  Scale without limits  Affordable Scalable  Amazon EMR  Amazon Redshift/Spectrum  Amazon DynamoDB  Amazon Athena  Amazon Rekognition  Amazon Glue Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies  Simple Management Tools  Hadoop compatibility Easy to use Why Choose Amazon S3 for Data Lake?
  • 15. Optimize Costs with Data Tiering • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use Amazon S3 Analytics for storage class analysis New
  • 16. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right security controls in S3
  • 17. Manage your data S3 object Tags Manage storage based on object tags • Classify your data • Tag your objects with key-value pairs • Write policies once based on the type of data Discoverability Lifecycle PolicyAccess Control
  • 18. Manage S3 Security { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*" "Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}} } ] } Manage permissions with tags
  • 19. Access control by cluster tag and IAM roles Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key
  • 20. Macie: A New Approach Amazon Macie Understand Your Data Natural Language Processing (NLP) Understand Data Access Machine Learning
  • 21. Amazon Macie Uses Machine Learning • Understand behavioral analytics to baseline normal behavior • Train and develop contextualized alerts by understanding the value of data being accessed • Context for content
  • 22. Business Critical Data in Amazon S3 • Static website content • Source code • SSL certificates, private keys • iOS and Android app signing keys • Database backups • OAuth and Cloud SAAS API Keys
  • 24. Consideration 2 – Ingest & Catalog
  • 25. AWS Snowball & Snowmobile • Accelerate PBs with AWS-provided appliances • 50, 80, 100 TB models • 100PB Snowmobile AWS Storage Gateway • Instant hybrid cloud • Up to 120 MB/s cloud upload rate (4x improvement), and Choose the Right Ingestion Methods Amazon Kinesis Firehose • Ingest device streams directly into AWS data stores AWS Direct Connect • COLO to AWS • Use native copy tools Native/ISV Connectors • Sqoop, Flume, DistCp • Commvault, Veritas, etc Amazon S3 Transfer Acceleration • Move data up to 300% faster using AWS’s private network
  • 26. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Redshift and Elasticsearch Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure. Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data. Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch
  • 27. Catalog Your Data S3 Put data in S3 Amazon DynamoDB Amazon Elasticsearch Service Extract metadata with Lambda Data Sources Search capabilities
  • 29. Glue data catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark etc. We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through Crawlers.
  • 30. Data Catalog: Crawlers Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
  • 31. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  • 32. Data Catalog: Table details Table schema Table properties Data statistics Nested fields
  • 33. Data Catalog: Version control List of table versionsCompare schema versions
  • 34. Data Catalog: Detecting partitions file 1 file N… file 1 file N… date=1 0 date=15… month=No v S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  • 35. Consideration 3 – Optimizing Performance
  • 36. Getting high Throughput Performance with S3 • S3 can scale to many thousands of requests per second • Need a good key naming scheme • Only at scale do you need to consider your key naming scheme • What are Partitions? • Why? • Spread Keys Lexigraphically • Goal of Partitioning is too spread the heat • Prevent HotSpots
  • 37. Distributing key names Add randomness to the beginning of the key name… <my_bucket>/6213-2013_11_13.jpg <my_bucket>/4653-2013_11_13.jpg <my_bucket>/9873-2013_11_13.jpg <my_bucket>/4657-2013_11_13.jpg <my_bucket>/1256-2013_11_13.jpg <my_bucket>/8345-2013_11_13.jpg <my_bucket>/0321-2013_11_13.jpg <my_bucket>/5654-2013_11_13.jpg <my_bucket>/2345-2013_11_13.jpg <my_bucket>/7567-2013_11_13.jpg <my_bucket>/3455-2013_11_13.jpg <my_bucket>/4313-2013_11_13.jpg Partitions: <my_bucket>/0 <my_bucket>/1 <my_bucket>/2 <my_bucket>/3 <my_bucket>/4 <my_bucket>/5 <my_bucket>/6 <my_bucket>/7 <my_bucket>/8 <my_bucket>/9
  • 38. Data Recommendations for EMR and S3 Performance Best Practices: • Reduce Number of S3 objects by aggregating small files into larger ones (s3distcp – group-by option) • Goal: Files >128MB • Use EMRFS with Consistent View • Parquet with Snappy compression is emerging as the best compression algorithm • Reverse partition scheme to HOUR, DAY, MONTH, YEAR
  • 39. Use the Right Data Formats • Pay by the amount of data scanned per query • Use Compressed Columnar Formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 40. Consideration 4 – Query in Place
  • 41. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Data Catalog AthenaEMR Redshift Spectrum Amazon ML / MXNet RDS QuickSight Kinesis Database Migration Service Glue Amazon Analytics End to End Architecture IAM Other Sources
  • 42. Explore Your Data Without ETL Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 43. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  • 44. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3
  • 45. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 46. What About ETL? Raw Data Assets Transformed Into Usable Ones
  • 47. ETL is the most time-consuming part of analytics ETL Data Warehousing Business Intelligence 80% of time spent here Amazon Redshift Amazon QuickSight
  • 48. AWS Glue Simple, flexible, cost-effective ETL  AWS Glue is a fully managed ETL (extract, transform, and load) service  Categorize your data, clean it, enrich it and move it reliably between various data stores  Once catalogued, your data is immediately searchable and queryable across your data silos  Simple and cost-effective  Serverless; runs on a fully managed, scale-out Spark environment
  • 50. Relational data warehouse Massively parallel; Petabyte scale Fully managed Supports Standard ANSI SQL High Performance Amazon Redshift a lot faster a lot simpler a lot cheaper Fully Managed Petabyte-scale Data Warehouse
  • 51. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 52. Amazon EMR Real-time Analytics Amazon Kinesis KCL app AWS Lambda Spark Streaming Amazon SNS Amazon ML Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alerts App state Real-time prediction KPI process store Stream Amazon Kinesis Analytics Amazon S3 Log Amazon KinesisFan out
  • 53. Case Study: Clickstream Analysis Hearst Corporation monitors trending content for over 250 digital properties worldwide and processes more than 30TB of data per day, using an architecture that includes Amazon Kinesis and Spark running on Amazon EMR. Store → Process | Analyze → Answers
  • 54. Amazon Kinesis Amazon EMR Amazon EMR Amazon Redshift Elasticsearch Clickstream Hearst Corporation monitors trending content for over 250 digital properties worldwide and processes more than 30TB of data per day, using an architecture that includes Amazon Kinesis and Spark running on Amazon EMR.
  • 55. Interactive & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Stream Amazon Kinesis Firehose Amazon Athena Files Amazon Kinesis Analytics
  • 56. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Amazon S3 Data lake Amazon EMR Amazon Kinesis Amazon RedShift Answers & Insights Hot HomesUsers Properties Agents User Profile Recommendation Hot Homes Similar Homes Agent Follow-up Agent Scorecard Marketing A/B Testing Real Time Data … Amazon DynamoDB BI / Reporting
  • 57. Choose the Right Tools Amazon Redshift, Spectrum Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition & Amazon Polly Image Recognition & Text-to-Speech AI APIs Amazon Lex Voice or Text Chatbots
  • 58. Amazon S3 Data Lake Amazon Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 AWS Data Lake You Don’t Have to Choose Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis
  • 59. Use S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse Decoupled storage and compute is cheaper and more efficient to operate Decoupled storage and compute allow us to evolve to clusterless architectures like Athena Do not build data silos in Hadoop or the Enterprise DW Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture Evolve as Needed