SlideShare a Scribd company logo
Big data and serverless
Marek Kuczynski
Sr. Solutions Architect – Startups
@marekq
marekku@amazon.nl
A W S U s e r G r o u p N e t h e r l a n d s M e e t u p
Various choices for compute on AWS
Amazon EC2
Virtual server instances
in the cloud
Amazon ECS,
EKS, and Fargate
Container management service
for running
Docker on a managed
cluster of EC2
AWS Lambda
Serverless compute
for stateless code execution in
response to triggers
Event based architectures
SERVICES (ANYTHING)
Changes in
data state
Requests to
endpoints
Changes in
resource state
EVENT SOURCE FUNCTION
Node.js
Python
Java
C#
Go
Ruby
PowerShell
Bring your own runtime
Common Lambda use cases
Web
Applications
• Static
websites
• Complex web
apps
• Packages for
Flask and
Express
Data
Processing
• Real time
• MapReduce
• Batch
Chatbots
• Powering
chatbot logic
Backends
• Apps &
services
• Mobile
• IoT
</></>
Amazon
Alexa
• Powering
voice-enabled
apps
• Alexa Skills Kit
IT
Automation
• Policy engines
• Extending AWS
services
• Infrastructure
management
AWS serverless portfolio
COMPUTE AND DATASTORES
AWS
Lambda
AWS
Fargate
Amazon
API Gateway
Amazon
SNS
Amazon
MQ
Amazon
SQS
AWS
Step Functions
APPLICATION INTEGRATION
DEVELOPER TOOLS
SECURITY AND ADMINISTRATION
Amazon Aurora
Serverless
Amazon
S3
Amazon
DynamoDB
AWS
AppSync
AWS
IAM
Amazon
Cognito
Amazon
Inspector
Amazon
VPC
Amazon
GuardDuty
AWS
CloudFormation
AWS
Cloud9
AWS
CloudTrail
Amazon
CloudWatch
AWS
X-Ray
AWS
CodePipeline
AWS
Config
AWS
SSO
AWS
Shield
AWS
WAF
Amazon
Kinesis
AWS Serverless
Application
Repository
Serverless is a spectrum
More operations Less operations
Build well architected
• Scalability
• Is scalability seamless, semi-automatic or a manual process?
• Resilience
• To what degree can we (automatically) recover from issues on infrastructure?
• Cost
• Can we control cost based on pricing per operation/invocation?
• Maintenance and operations
• How much OS/software maintenance will be needed going forward?
• Security
• How do I keep infrastructure secure and handle authentication/authorization?
A serverless, three tier application
Data stored in
DynamoDB
Dynamic content in
AWS Lambda
Amazon API
Gateway
Browser
Amazon
CloudFront
Amazon S3
Amazon Cognito
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
My
Demo of a serverless blog – https://marek.rocks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
My
Demo of a serverless blog – https://marek.rocks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monthly costs of running the blog
The website has been running stable for 3+ years with a few
hundred visitors every month.
• Route53 hosted zone $0,50
• Lambda function cost $0,30
• DynamoDB costs $0,20
• API Gateway costs $0,10
• Email costs $0,02
• Domain name $1
No maintenance (patching, scaling, backups) is required.
TCO is at least 10 x cheaper than running this on EC2.
Building and orchestrating a
serverless data
AWS solutions to build a serverless data lake
Amazon
S3
bucket(s)
Amazon ESAWS
Glue
Amazon
DynamoDB
Catalog & search
AWS Key
Management
Service (AWS KMS)
AWS
CloudTrail
IAM Amazon
Macie
Security & auditing
Amazon
Cognito
Amazon
API
Gateway
IAM
API/UI
Amazon
Athena
Amazon
QuickSight
Aurora
Serverless or
Redshift
Analytics & processing
AWS
Glue
AWS
Lambda
Amazon
Kinesis
Data
Streams
Amazon
Kinesis
Data
Firehose
AWS
Direct
Connect
Ingest
Ingestion using Kinesis
Amazon CloudWatch:
Delivery metrics
Amazon S3:
Buffered files
Kinesis
Agent
Record
producers Amazon Redshift or Aurora:
Table loads
Amazon Elasticsearch Service:
Domain loads
Amazon S3:
Source record backup
AWS Lambda:
Transformations &
enrichment
Amazon DynamoDB:
Lookup tables
Raw records
Lookup
Transformed records
Transformed recordsRaw records
Kinesis Data Firehose:
Delivery stream
Architectures patterns to push or pull data
S3
bucket
object
Lambda
function
1. File put into bucket
2. Lambda invoked
Lambda
function
2. Lambda invoked
SNS
topic
1. Data published to a
topic
Data
1. Message inserted
into to a queue
message
Amazon
SQS
Lambda
function
3. Function
removes
message from
queue
2. Lambda polls
queue and
invokes function
Recent launch; richer workflows using Step Functions
Simplify building workloads such as
order processing, report generation,
and data analysis
Add services in minutes
Write and maintain less code
AWS
Step Functions
AWS
Lambda
Amazon
ECS
AWS
Fargate
AWS
Batch
Amazon
SageMaker
AWS
Glue
Amazon
DynamoDB
Amazon
SNS
Amazon
SQS
Simpler integration, less code
With serverless polling With new service integration
AWS
Lambda
functions
No
Lambda
functions
Serverless data lakes -
Analytics
Analytics
Various choices for analytics of your data
• S3 Select on CSV, JSON and Apache Parquet objects
• Amazon Athena
• AWS Lambda
• Predictions with Amazon SageMaker
• Amazon EMR
• AWS Glue (ETL)
Analytics & processing
S3 Select – selecting fields from individual files
S3 Select – selecting fields from individual files
Athena – running a query on files in S3 buckets
44.66 seconds...Data scanned: 169.53GB
Cost: $5/TB or $0.005/GB = $0.85
SELECT custid, year, sum(count) FROM sales
WHERE custid = ‘157231’
GROUP BY gram, year ORDER BY year ASC;
Analytics & processing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Run a data science framework in Lambda
• pandas
• SciPy
• NumPy
• matplotlib
Just released : S3 Batch Operations
Amazon S3
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
Lambda
Function
This new feature can;
• Modify the ACL’s or tags of
objects on S3 at scale.
• Copy objects to a new bucket
while preserving properties.
• Let Lambda (re)process all your
files stored on S3.
AWS takes care of running the
operations, even if your bucket
has billions of objects.
Big data and serverless - AWS UG The Netherlands
Why relational and not NoSQL?
Sometimes it’s not possible to use a NoSQL database;
• You need to integrate with other backend applications that run on a
relational database (i.e. WordPress) or are hard to modify.
• You need access to complex queries that are harder to do with NoSQL
(i.e. multiple joins, fuzzy searches).
• There may be other database features that your application requires
(logging, ACID compliance).
How does Serverless Aurora work?
Availability zone 1
Region
App on
EC2 or
Lambda
Shared distributed storage volume
Multi-tenant proxy layer
Warm-pool of Aurora instances
Monitoring service
Introducing Amazon Relational Database Service Data API
• Simple web service protocol for database access
• SQL statements packaged as HTTP requests
• Access your database from Lambda and AppSync
• Access your database from the AWS SDK & CLI
Data API Service
Aurora Serverless
Introducing RDS Console Query Editor
• Access your database from
AWS Management Console
• No database client application
or terminal required
• The same requests can be
made using the AWS SDK or
CLI.
A serverless, relational three tier application
Data stored in
Aurora serverless
Dynamic content in
AWS Lambda
Amazon API
Gateway
Browser
Amazon
CloudFront
Amazon S3
Amazon Cognito
Big data and serverless - AWS UG The Netherlands
Search and Data Catalog
• Use DynamoDB as a metadata repository
• Optionally use Amazon ElasticSearch for
more complex queries
AWS Lambda
Metadata Index
(DynamoDB)
Search Index
(Amazon ES)
ObjectCreated
ObjectDeleted PutItem
Update Index
S3 Bucket
https://aws.amazon.com/answers/big-data/data-lake-solution/
Catalog & Search
AWS Glue
Crawlers
AWS Glue
Data Catalog
Amazon
QuickSight
Amazon
Redshift
Spectrum
Amazon
Athena
S3
Bucket(s)
Catalog & Search
Use Glue Crawlers to build a data catalogue
AWS Lake Formation (in preview)
Build, secure, and manage data lakes, reducing the set up time from months to days
Big data and serverless - AWS UG The Netherlands
AWS CodeStar
• Quickly develop, build, and deploy
applications on AWS
• Start developing on AWS in minutes
• Work across your team, securely
• Manage software delivery easily
• Choose from a variety of project
templates
AWS CodeStar
Project templates for EC2, AWS Lambda, and Elastic Beanstalk
Services deployed for you when using CodeStar
Source Build Test Deploy Monitor
AWS CodeBuild +
Third Party
AWS CodeCommit AWS CodeBuild AWS CodeDeploy
AWS CodePipeline
AWS X-Ray
Amazon
CloudWatch
<-THIS
BECOMES THIS->
SAM Template
Use AWS X-Ray to debug functions
Big data and serverless - AWS UG The Netherlands
Further reading and events
Well Architected Lens for serverless
https://d1.awsstatic.com/whitepapers/architecture/AWS-Serverless-
Applications-Lens.pdf
Serverless Application Repository
https://serverlessrepo.aws.amazon.com/
Free developer event - AWS DevDay on June 19th in Utrecht
https://aws.amazon.com/events/Devdays-Utrecht/
No server is easier to manage than "no
server.”
Werner Vogels—Amazon CTO
Thank you!
Marek Kuczynski
Sr. Solutions Architect - startups
@marekq
marekku@amazon.nl

More Related Content

Big data and serverless - AWS UG The Netherlands

  • 1. Big data and serverless Marek Kuczynski Sr. Solutions Architect – Startups @marekq marekku@amazon.nl A W S U s e r G r o u p N e t h e r l a n d s M e e t u p
  • 2. Various choices for compute on AWS Amazon EC2 Virtual server instances in the cloud Amazon ECS, EKS, and Fargate Container management service for running Docker on a managed cluster of EC2 AWS Lambda Serverless compute for stateless code execution in response to triggers
  • 3. Event based architectures SERVICES (ANYTHING) Changes in data state Requests to endpoints Changes in resource state EVENT SOURCE FUNCTION Node.js Python Java C# Go Ruby PowerShell Bring your own runtime
  • 4. Common Lambda use cases Web Applications • Static websites • Complex web apps • Packages for Flask and Express Data Processing • Real time • MapReduce • Batch Chatbots • Powering chatbot logic Backends • Apps & services • Mobile • IoT </></> Amazon Alexa • Powering voice-enabled apps • Alexa Skills Kit IT Automation • Policy engines • Extending AWS services • Infrastructure management
  • 5. AWS serverless portfolio COMPUTE AND DATASTORES AWS Lambda AWS Fargate Amazon API Gateway Amazon SNS Amazon MQ Amazon SQS AWS Step Functions APPLICATION INTEGRATION DEVELOPER TOOLS SECURITY AND ADMINISTRATION Amazon Aurora Serverless Amazon S3 Amazon DynamoDB AWS AppSync AWS IAM Amazon Cognito Amazon Inspector Amazon VPC Amazon GuardDuty AWS CloudFormation AWS Cloud9 AWS CloudTrail Amazon CloudWatch AWS X-Ray AWS CodePipeline AWS Config AWS SSO AWS Shield AWS WAF Amazon Kinesis AWS Serverless Application Repository
  • 6. Serverless is a spectrum More operations Less operations
  • 7. Build well architected • Scalability • Is scalability seamless, semi-automatic or a manual process? • Resilience • To what degree can we (automatically) recover from issues on infrastructure? • Cost • Can we control cost based on pricing per operation/invocation? • Maintenance and operations • How much OS/software maintenance will be needed going forward? • Security • How do I keep infrastructure secure and handle authentication/authorization?
  • 8. A serverless, three tier application Data stored in DynamoDB Dynamic content in AWS Lambda Amazon API Gateway Browser Amazon CloudFront Amazon S3 Amazon Cognito
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. My Demo of a serverless blog – https://marek.rocks
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. My Demo of a serverless blog – https://marek.rocks
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Monthly costs of running the blog The website has been running stable for 3+ years with a few hundred visitors every month. • Route53 hosted zone $0,50 • Lambda function cost $0,30 • DynamoDB costs $0,20 • API Gateway costs $0,10 • Email costs $0,02 • Domain name $1 No maintenance (patching, scaling, backups) is required. TCO is at least 10 x cheaper than running this on EC2.
  • 12. Building and orchestrating a serverless data
  • 13. AWS solutions to build a serverless data lake Amazon S3 bucket(s) Amazon ESAWS Glue Amazon DynamoDB Catalog & search AWS Key Management Service (AWS KMS) AWS CloudTrail IAM Amazon Macie Security & auditing Amazon Cognito Amazon API Gateway IAM API/UI Amazon Athena Amazon QuickSight Aurora Serverless or Redshift Analytics & processing AWS Glue AWS Lambda Amazon Kinesis Data Streams Amazon Kinesis Data Firehose AWS Direct Connect Ingest
  • 14. Ingestion using Kinesis Amazon CloudWatch: Delivery metrics Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift or Aurora: Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup AWS Lambda: Transformations & enrichment Amazon DynamoDB: Lookup tables Raw records Lookup Transformed records Transformed recordsRaw records Kinesis Data Firehose: Delivery stream
  • 15. Architectures patterns to push or pull data S3 bucket object Lambda function 1. File put into bucket 2. Lambda invoked Lambda function 2. Lambda invoked SNS topic 1. Data published to a topic Data 1. Message inserted into to a queue message Amazon SQS Lambda function 3. Function removes message from queue 2. Lambda polls queue and invokes function
  • 16. Recent launch; richer workflows using Step Functions Simplify building workloads such as order processing, report generation, and data analysis Add services in minutes Write and maintain less code AWS Step Functions AWS Lambda Amazon ECS AWS Fargate AWS Batch Amazon SageMaker AWS Glue Amazon DynamoDB Amazon SNS Amazon SQS
  • 17. Simpler integration, less code With serverless polling With new service integration AWS Lambda functions No Lambda functions
  • 18. Serverless data lakes - Analytics
  • 19. Analytics Various choices for analytics of your data • S3 Select on CSV, JSON and Apache Parquet objects • Amazon Athena • AWS Lambda • Predictions with Amazon SageMaker • Amazon EMR • AWS Glue (ETL) Analytics & processing
  • 20. S3 Select – selecting fields from individual files
  • 21. S3 Select – selecting fields from individual files
  • 22. Athena – running a query on files in S3 buckets 44.66 seconds...Data scanned: 169.53GB Cost: $5/TB or $0.005/GB = $0.85 SELECT custid, year, sum(count) FROM sales WHERE custid = ‘157231’ GROUP BY gram, year ORDER BY year ASC; Analytics & processing
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run a data science framework in Lambda • pandas • SciPy • NumPy • matplotlib
  • 24. Just released : S3 Batch Operations Amazon S3 Lambda Function Lambda Function Lambda Function Lambda Function Lambda Function Lambda Function Lambda Function Lambda Function Lambda Function This new feature can; • Modify the ACL’s or tags of objects on S3 at scale. • Copy objects to a new bucket while preserving properties. • Let Lambda (re)process all your files stored on S3. AWS takes care of running the operations, even if your bucket has billions of objects.
  • 26. Why relational and not NoSQL? Sometimes it’s not possible to use a NoSQL database; • You need to integrate with other backend applications that run on a relational database (i.e. WordPress) or are hard to modify. • You need access to complex queries that are harder to do with NoSQL (i.e. multiple joins, fuzzy searches). • There may be other database features that your application requires (logging, ACID compliance).
  • 27. How does Serverless Aurora work? Availability zone 1 Region App on EC2 or Lambda Shared distributed storage volume Multi-tenant proxy layer Warm-pool of Aurora instances Monitoring service
  • 28. Introducing Amazon Relational Database Service Data API • Simple web service protocol for database access • SQL statements packaged as HTTP requests • Access your database from Lambda and AppSync • Access your database from the AWS SDK & CLI Data API Service Aurora Serverless
  • 29. Introducing RDS Console Query Editor • Access your database from AWS Management Console • No database client application or terminal required • The same requests can be made using the AWS SDK or CLI.
  • 30. A serverless, relational three tier application Data stored in Aurora serverless Dynamic content in AWS Lambda Amazon API Gateway Browser Amazon CloudFront Amazon S3 Amazon Cognito
  • 32. Search and Data Catalog • Use DynamoDB as a metadata repository • Optionally use Amazon ElasticSearch for more complex queries AWS Lambda Metadata Index (DynamoDB) Search Index (Amazon ES) ObjectCreated ObjectDeleted PutItem Update Index S3 Bucket https://aws.amazon.com/answers/big-data/data-lake-solution/ Catalog & Search
  • 33. AWS Glue Crawlers AWS Glue Data Catalog Amazon QuickSight Amazon Redshift Spectrum Amazon Athena S3 Bucket(s) Catalog & Search Use Glue Crawlers to build a data catalogue
  • 34. AWS Lake Formation (in preview) Build, secure, and manage data lakes, reducing the set up time from months to days
  • 36. AWS CodeStar • Quickly develop, build, and deploy applications on AWS • Start developing on AWS in minutes • Work across your team, securely • Manage software delivery easily • Choose from a variety of project templates
  • 37. AWS CodeStar Project templates for EC2, AWS Lambda, and Elastic Beanstalk
  • 38. Services deployed for you when using CodeStar Source Build Test Deploy Monitor AWS CodeBuild + Third Party AWS CodeCommit AWS CodeBuild AWS CodeDeploy AWS CodePipeline AWS X-Ray Amazon CloudWatch
  • 40. Use AWS X-Ray to debug functions
  • 42. Further reading and events Well Architected Lens for serverless https://d1.awsstatic.com/whitepapers/architecture/AWS-Serverless- Applications-Lens.pdf Serverless Application Repository https://serverlessrepo.aws.amazon.com/ Free developer event - AWS DevDay on June 19th in Utrecht https://aws.amazon.com/events/Devdays-Utrecht/
  • 43. No server is easier to manage than "no server.” Werner Vogels—Amazon CTO
  • 44. Thank you! Marek Kuczynski Sr. Solutions Architect - startups @marekq marekku@amazon.nl