SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Russell Nash – AWS Solutions Architect, AWS
Building
a Modern Data Architecture
on AWS
In partnership with:
SCALABLE FLEXIBLE MANAGEABLE
COST
EFFECTIVE
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Modern Data Architecture
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Database Analytics Flat File
Processing
Real-time
Pipeline
Data Lake
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Database Analytics
Amazon
Redshift
Source
Database
MPP SQL Database
Optimised for Analytics
Gigabytes to Petabytes
Fully relational
Amazon
Redshift
Building a Modern Data Architecture on AWS - Webinar
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
SQL
SQL SQL SQLResults Results Results
Results
160 GB
2 PB
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
ETL
Amazon
Redshift
Source
Database
Database Analytics
AWS Database
Migration Service
Amazon
RedshiftSource
Database
ETL
Data Integration
Partners
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
ELT
Amazon
Redshift
Amazon
Redshift
Source
Database
Database Analytics
https://aws.amazon.com/solutions/case-studies/boingo-wireless/
Database Analytics Flat File
Processing
Real-time
Pipeline
Data Lake
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Batch Processing
Flat
Files
Amazon
S3
Amazon
S3
Object Storage
Low Cost
Highly Scalable
11 9’s of durability
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Batch Processing
AWS
Snowball
AWS
CLI & SDK
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
In pioneer days, they used oxen for heavy pulling,
and when one ox couldn’t budge a log,
they didn’t try to grow a bigger ox.
Grace Hopper
Building a Modern Data Architecture on AWS - Webinar
PIG
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
PIG
SQL
Amazon
EMR
PIG
SQL
Amazon
EMR
Amazon
S3
EMRFS
Amazon
EMR
• Managed Hadoop
• Optimized with S3
• Open Source Support
Compute Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Cost & Time
# CPUs
Time
# CPUs
Time
Wall clock time: 1 hourWall clock time: 10 hours
Building a Modern Data Architecture on AWS - Webinar
Spot Price – M3.2XL
On-Demand Spot-Price
$0.08$0.75
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Batch Processing
Amazon
EMR
Amazon
S3
AWS
Glue
AWS
Snowball
AWS
CLI & SDK
AWS
Glue
• Managed Transform Engine
• Job Scheduler
• Data Catalog
• Built on Apache Spark
Building a Modern Data Architecture on AWS - Webinar
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Batch Processing
Amazon
EMR
Amazon
S3
AWS
Glue
Amazon
Redshift
Amazon
EMR
AWS
Snowball
AWS
CLI & SDK
PIG
SQL
Amazon
EMR
Amazon
S3
EMRFS
R
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Flat
Files
Amazon
S3
Batch Processing
Amazon
EMR
Amazon
S3
AWS
Glue
Amazon
Redshift
Amazon
EMR
Amazon
AthenaAWS
Snowball
AWS
CLI & SDK
Amazon
Athena
Query S3 data with SQL
Serverless
Instant Spin-Up
Pay per Query
Athena
S3
Comparison of SQL Processing engines
Amazon
Redshift
Amazon
Athena
Data Structure
Languages
Semi Semi
SQL, HiveQL SQL
Full
SQL
Data Store S3/HDFS S3 Local
SQL
Semi
SQL
S3/HDFS
Performance
Comparison of SQL Processing engines
Transformation
SQL Queries
For S3/HDFS
Fully Featured
SQL
Database
Use Case
Amazon
Redshift
Amazon
Athena
SQL
Serverless
SQL Queries
for S3
https://aws.amazon.com/solutions/case-studies/finra/
Database Analytics Flat File
Processing
Real-time
Pipeline
Data Lake
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Real-time Pipeline
Amazon
Kinesis
Machines
Devices
Mobile
Clickstream
Availability
Zone
Availability
Zone
Availability
Zone
Amazon Kinesis
Stream
AWS Lambda
KCL App
Amazon EMR
Streaming
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
S3
(Log)
Amazon
ElasticSearch
(Dashboard)
Real-time Pipeline
Amazon
Elasticsearch
• Search and Analytics
• Scalable
• Fully Managed
• Integrated – Logstash, Kibana
Building a Modern Data Architecture on AWS - Webinar
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
S3
(Logs)
Amazon
ElasticSearch
(Dashboards)
Amazon EMR
(Predictions)
ML
Amazon SNS
(Alerts)
Real-time Pipeline
Amazon
Redshift
(Analytics)
StreamAlert
https://medium.com/airbnb-engineering/streamalert-real-time-data-analysis-and-alerting-e8619e3e5043
Database Analytics Flat File
Processing
Real-time
Pipeline
Data Lake
Any data Any analysisData Lake
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
Amazon
EMR
Data Lake
Amazon
Redshift
ETL
Amazon
Athena
EC2
AWS
CLI & SDK
Amazon
S3
Amazon
EMR
Amazon
S3
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
New X1 Instance - Tons of Memory
• Large-scale, in-memory applications
• Intel® Xeon® E7 8880 v3 Haswell processors
• Up to 2TB of memory
• Up to 128 vCPUs per instance
Intel® Processor Technologies
Intel®	AVX	– Dramatically	increases	performance	for	highly	parallel	HPC	workloads	
such	as	life	science	engineering,	data	mining,	financial	analysis,	media	processing
Intel®	AES-NI	– Enhances	security	with	new	encryption	instructions	that	reduce	the	
performance	penalty	associated	with	encrypting/decrypting	data
Intel®	Turbo	Boost	Technology	– Increases	computing	power	with	performance	that	
adapts	to	spikes	in	workloads
Intel	Transactional	Synchronization	(TSX)	Extensions	– Enables	execution	of	
transactions	that	are	independent	to	accelerate	throughput
P	state	&	C	state	control – provides	granular	performance	tuning	for	cores	and	sleep	
states	to	improve	overall	application	performance
REGISTER NOW
http://amzn.to/2jFt11N
Complimentary labs are available only till 31 March 2017
Get hands on experience working with the AWS Technology.
Access the complimentary Big Data on AWS self-paced labs

More Related Content

Building a Modern Data Architecture on AWS - Webinar