BDA303 Serverless big data architectures: Design patterns and best practices
- 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Snively, AWS Sr. SA, Data & Analytics
April 19, 2017
Serverless Big Data Architectures
Serverless Streaming Data Analytics
- 4. No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault
tolerance built in
Serverless characteristics
- 5. Data and analytics flow
Ingest/
Collect
Store
Analyze/
Process
Visualization/
Consume
Orchestrate/Transform
- 7. Orchestration/Transform
AWS Big Data services
Ingest/ Collect Store Analyze/ Process
Visualization/
Consume
Batch
ETL/ELT
Realtime
ETL/ELT
Transactional
/ CDC
B.I. Tools
Data Science
Notebooks
Bulk Transport
File/Object Upload
Streaming Ingest
Commits Transactional
NoSQL
Data Lake
Streaming Storage
Dashboards
Batch Analytics
Interactive
Querying
Machine Learning/
Deep Learning
Realtime Analytics
…
- 8. Orchestration/Transform
AWS Big Data services
Ingest/ Collect Store Analyze/ Process
Visualization/
Consume
= Serverless
Serverless
Managed
Virtualized
Batch
ETL/ELT
Realtime
ETL/ELT
Transactional
/ CDC
B.I. Tools
Data Science
Notebooks
Bulk Transport
File/Object Upload
Streaming Ingest
Commits Transactional
NoSQL
Data Lake
Streaming Storage
Dashboards
Batch Analytics
Interactive
Querying
Machine Learning/
Deep Learning
Realtime Analytics
- 9. Orchestration/Transform
AWS Big Data services
EMR EC2
S3
Amazon
RedshiftDynamoDB
AWS DMS (CDC)
AWS Lambda
Kinesis Analytics Amazon Athena
Amazon
QuickSight
Aurora
AWS Glue AWS Step
Functions
Kinesis
Streams
Ingest/ Collect Store Analyze/ Process
Visualization/
Consume
AWS
Snowball
ISV
Connectors
Kinesis
Firehose
S3 Transfer
Acceleration
= Serverless
Amazon
Elasticsearch
- 11. Big Data storage for virtually all AWS services
Amazon S3
• Store anything
• Object storage
• Scalable
• 99.999999999% durability
• Extremely low cost
- 14. Amazon Kinesis
Streams
• For technical developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift
and Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
- 15. AWS Lambda
• Run your code in the cloud - fully
managed and highly available
• Triggered through API or state
changes in your setup
• Scales automatically to match the
incoming event rate
• Node.js (JavaScript), Python, Java,
and C#
• Charged per 100ms execution time
Serverless compute
- 17. AWS Glue
Fully managed ETL service
• Catalog data sources
• Identify data formats & data types
• Error handling
• Manage and scale resources
• Generate ETL code
• Schedules & executes ETL jobs
- 18. AWS Glue: Services
Data catalog
Hive metastore-compatible metadata repository of data sources.
Crawls data source to infer table, data type, partition format.
Job execution
Runs jobs in Spark containers – automatic scaling based on SLA.
AWS Glue is serverless – only pay for the resources you consume.
Job authoring
Generates Python code to move data from source to destination.
Edit with your favorite IDE; share code snippets using Git.
- 19. •Fast and cloud-powered
•Easy to use, no infrastructure to
manage
•Scales to hundreds of thousands of
users
•Quick calculations with SPICE
•1/10th the cost of legacy BI software
Business Intelligence
Amazon
QuickSight
- 22. Interactive Queries
Ingest/ Collect Store Analyze/ Process
Visualization/
Consume
Producer Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Impala
Spark
Interactive
Amazon
Athena
Serverless
Managed
Virtualized
QuickSight
- 24. Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
AWS Glue
Clusterless ETL
Amazon ElastiCache
Redis
Data Lake and
Real-time
Analytics
- 25. Serverless ETL
Store Transform Store Analyze/ Process
Visualize/
Consume
Amazon S3
Apache
Kafka
Kinesis
Streams Amazon EMR
Spark
Flink
AWS Glue
AWS Lambda
ISV
Amazon S3
Apache
Kafka
Amazon
Redshift
Kinesis
Streams
Data CatalogAWS Glue
DynamoDB
Streams
DynamoDB Hive M/D
- 26. Serverless nicely fits into big data platforms
• AWS serverless Big Data services
• Complements existing big data flows
• Focus on the analytics and not on infrastructure or servers
• Don’t focus on the scaling, availability, and undifferentiated
heavy lifting
• Pay only for what you use
• Easily try out different tools, analytics, and solutions