AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
- 4. Big data challenges
How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
What to expect from this session
- 7. v
Plethora of Tools
Amazon Glacier
S3 DynamoDB
RDS
EMR
Amazon Redshift
Data PipelineAmazon Kinesis
Cassandra
CloudSearchKinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
- 8. v
Is there a reference architecture ?
What tools should I use ?
How ?
Why ?
- 9. v
Architectural Principles
• Decoupled “data bus”
• Data → Store → Process → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Big data ≠ big cost
- 10. v
Simplify Big Data Processing
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
Amazon Machine
Learning
Collect Process Analyze
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Data Answers
- 12. v
• Types of Data
• Transactional
• Database reads & writes (OLTP)
• Cache
• Search
• Logs
• Streams
• File
• Log files (/var/log)
• Log collectors & frameworks
• Stream
• Log records
• Sensors & IoT data
Database
File
Storage
Stream
Storage
A
iOS Android
Web Apps
Logstash
LoggingIoTApplications
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Search
Collect Store
LoggingIoT
- 15. v
Stream Storage Options
• AWS managed services
• Amazon Kinesis → streams
• DynamoDB Streams → table + streams
• Amazon SQS → queue
• Amazon SNS → pub/sub
• Unmanaged
• Apache Kafka → stream
- 16. v
Why Stream Storage?
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Streaming MapReduce
• Parallel consumption
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Shard 1 / Partition 1
Shard 2 / Partition 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Producer 2
Producer 3
Producer N
Kafka TopicDynamoDB Stream Kinesis Stream
- 17. v
What About Queues & Pub/Sub ?
• Decouple producers &
consumers/subscribers
• Persistent buffer
• Collect multiple streams
• No client ordering
• No parallel consumption for
Amazon SQS
• Amazon SNS can route to
multiple queues or ʎ
functions
• No streaming MapReduce
Consumers
Producers
Producers
Amazon SNS
Amazon SQS
queue
topic
function
ʎ
AWS Lambda
Amazon SQS
queue
Subscriber
- 18. v
Which stream storage should I use?
Amazon
Kinesis
DynamoDB
Streams
Amazon SQS
Amazon SNS
Kafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)
- 20. v
Why Is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No needto run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot instances
• Multiple distinct (Spark, Hive, Presto) clusters can use the same data
• Unlimited number of objects
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy
• Secure – SSL, client/server-side encryption at rest
• Low cost
- 21. v
What about HDFS & Amazon Glacier?
• Use HDFS for very frequently
accessed (hot) data
• Use Amazon S3 Standard for
frequently accessed data
• Use Amazon S3 Standard – IA for
infrequently accessed data
• Use Amazon Glacier for archiving
cold data
- 22. v
Database +
Search
Tier
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCacheStreamStorageFileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store
LoggingIoTApplications
- 24. v
Best Practice — Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Amazon
CloudSearch
Cache
Redis
Memcached
SQL
Amazon Aurora
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Cassandra
Amazon
DynamoDB
HBase
MongoDB
Database + Search Tier
- 26. v
What Data Store Should I Use?
• Data structure → Fixed schema, JSON, key-value
• Access patterns → Store data in the format you will access it
• Data / access characteristics → Hot, warm, cold
• Cost → Right cost
- 27. v
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, Value) Cache, NoSQL
- 29. v
Data / Access Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
Hot Data Warm Data Cold Data
- 31. v
What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
Aurora
Amazon
Elasticsearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(64 TB
Max)
GB–TB GB–PB
(~nodes)
MB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(400 KB max)
KB
(64 KB)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate High -
Very High
Very High
(no limit)
High High Low – Very
High
Low –
Very High
(no limit)
Very Low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10
Durability Low -
Moderate
Very High Very High High High Very High Very High
Hot Data
Warm Data
Cold Data
Hot Data Warm Data Cold Data
- 34. v
Process / Analyze
• Analysis of data is a process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision-making.
•Examples
• Interactive dashboards → Interactive analytics
• Daily/weekly/monthly reports → Batch analytics
• Billing/fraud alerts, 1 minute metrics → Real-time analytics
• Sentiment analysis, prediction models → Machine learning
- 36. v
Batch Analysis
• Takes large amount of (warm/cold) data
• Takes minutes or hours to get answers back
• Example: Generating daily, weekly, or monthly reports
- 37. v
Real-Time Analytics
• Take small amount of hot data and ask questions
• Takes short amount of time (milliseconds or seconds) to get your answer
back
• Real-time (event)
• Real-time response to events in data streams
• Example: Billing/Fraud Alerts
• Near real-time (micro-batch)
• Near real-time operations on small batches of events in data streams
• Example: 1 Minute Metrics
- 38. v
Predictions via Machine Learning
• ML gives computers the ability to learn without being explicitly
programmed
• Machine Learning Algorithms:
- Supervised Learning ← “teach” program
- Classification ← Is this transaction fraud? (Yes/No)
- Regression ← Customer Life-time value?
- Unsupervised Learning ← let it learn by itself
- Clustering ← Market Segmentation
- 39. v
Analysis Tools and Frameworks
• Machine Learning
• Mahout, Spark ML, Amazon ML
• Interactive Analytics
• Amazon Redshift, Presto, Impala, Spark
• Batch Processing
• MapReduce, Hive, Pig, Spark
• Stream Processing
• Micro-batch: Spark Streaming, KCL, Hive, Pig
• Real-time: Storm, AWS Lambda, KCL
Amazon
Redshift
Impala
Pig
Amazon Machine
Learning
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
StreamProcessingBatchInteractive
Analyze
Interactive
ML
- 40. v
What Stream Processing Technology Should I Use?
Spark Streaming Apache Storm Amazon Kinesis
Client Library
AWS Lambda Amazon EMR (Hive,
Pig)
Scale /
Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-
time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto
Scaling
AWS managed Yes (Amazon EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming
languages
Java, Python, Scala Any language via
Thrift
Java, via
MultiLangDaemon (
.Net, Python, Ruby,
Node.js)
Node.js, Java Hive, Pig, Streaming
languages
Query Latency
High
- 41. v
What Data Processing Technology Should I Use?
AmazonR
edshift
Impala Presto Spark Hive
Query Latency Low Low Low Low Medium (Tez) – High
(MapReduce)
Durability High High High High High
Data Volume 1.6 PB
Max
~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL
Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
HighMedium
- 44. v
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Amazon
QuickSight
- 45. v
Consume
• Predictions
• Analysis and Visualization
• Notebooks
•
• IDE
• Applications & API
Consume
Analysis&Visualization
Amazon
QuickSight
Notebooks
Predictions
Apps & APIs
IDE
Store Analyze ConsumeETL
Business
users
Data Scientist,
Developers
- 47. v
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Reference Architecture
- 49. v
Multi-Stage Decoupled “Data Bus”
• Multiple stages
• Storage decoupled from processing
Store Process Store ProcessData Answers
process
store
- 50. v
Multiple Processing Applications (or Connectors) Can
Read from or Write to Multiple Data Stores
Amazon Kinesis AWS LambdaData Amazon
DynamoDB
Amazon
Kinesis S3
Connector
Amazon S3
process
store
- 51. v
Processing Frameworks (KCL, Storm, Hive, Spark, etc.)
Could Read from Multiple Data Stores
Amazon
Kinesis
AWS Lambda Amazon S3Data Amazon
DynamoDB
Hive Spark
Answers
Storm
Answers
Amazon Kinesis
S3
Connector
process
store
- 52. Spark Streaming
Apache Storm
AWS Lambda
KCL
Amazon
Redshift Spark
Impala
Presto
Hive
Amazon
Redshift
Hive
Spark
Presto
Impala
Amazon Kinesis
Apache Kafka
Amazon
DynamoDB
Amazon S3data
Hot Cold
Data Temperature
ProcessingLatency
Low
High
Answers
Amazon EMR (HDFS)
Hive
Native
KCL
AWS Lambda
Data Temperature vs Processing Latency
Batch
- 57. Online Labs & Training
Gain confidence and hands-on
experience with AWS.
Watch free Instructional Videos and
explore Self-Paced Labs
Instructor Led Classes
Learn how to design, deploy and operate
highly available, cost-effective and
secure applications on AWS in courses
led by qualified AWS instructors
Validate your technical expertise
with AWS and use practice exams
to help you prepare for AWS
Certification
AWS Certification
More info at http://aws.amazon.com/training