Patterns and Practices
for building resilient
serverless applications
presented by Yan Cui
“the capacity to recover quickly from difficulties; toughness.”
it’s not about
preventing failures!

everything fails, all the time
we need to build applications that can withstand failures
don’t run your application on one server…

entire data centers can
go down…
run your application in multiple AZs and regions
Failures on load: exhaustion of resources
Failures on load: exhaustion of resources

Failures on load: exhaustion of resources
CPU saturation
Failures in distributed systems
Service A Service B Service C
Failures in distributed systems
Service A Service B Service C
Failures in distributed systems
Service A Service B Service C
HTTP 502

Failures in distributed systems
Service A Service B Service C
You suck!
microservices death stars circa 2015
Yan Cui
AWS user for 10 years
Yan Cui

Serverless in Production, an experience report (cloudXchange)
Serverless in Production, an experience report (cloudXchange)Serverless in Production, an experience report (cloudXchange)
Serverless in Production, an experience report (cloudXchange)

This document provides advice on preparing serverless applications for production based on the author's experience deploying 170 Lambda functions to production. It covers important areas to consider like testing at the unit, integration, and acceptance levels; setting up CI/CD pipelines; monitoring, logging, and alerting; distributed tracing; security; and configuration management. The author emphasizes the importance of testing end-to-end without mocking external services, setting up production-ready monitoring and metrics dashboards, and choosing deployment frameworks that are tried and tested.

serverlessawsaws lambda
Serverless in production, an experience report (BuildStuff)
Serverless in production, an experience report (BuildStuff)Serverless in production, an experience report (BuildStuff)
Serverless in production, an experience report (BuildStuff)

This document provides a summary of best practices for deploying and managing applications on AWS Lambda. It discusses strategies for testing Lambda functions, including unit, integration and acceptance testing. It also covers logging, monitoring, distributed tracing, configuration management, avoiding cold starts, and other operational considerations like partial failures and hot Kinesis streams. The goal is to help developers build serverless applications on Lambda that are performant, robust, and cost-effective to operate at scale.

Beware the potholes on the road to serverless
Beware the potholes on the road to serverlessBeware the potholes on the road to serverless
Beware the potholes on the road to serverless

The document discusses various best practices and anti-patterns for serverless application development on AWS Lambda. Some of the key points discussed include: 1. Keep Lambda functions simple and single-purposed to avoid slow cold starts from too many dependencies. 2. Use separate AWS accounts for different teams and environments to avoid hitting shared resource limits and better compartmentalize security. 3. Research serverless services like SNS, SQS and Kinesis before building to understand their capabilities and use cases. 4. Use deployment frameworks to avoid console-driven development and simplify continuous deployment. 5. Consider monorepos or one repo per service for related functions rather than one repo per function. 6

awsaws lambdaserverless
Lambda execution environment
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
don’t pay for idle
redundant resources!
Load balancing

A chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayA chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage away

Presented at ServerlessDays Warsaw Recording: You might have heard about chaos engineering in the context of Netflix and Amazon, and how they kill EC2 servers in production at random to verify that their systems can stay up in the face of infrastructure failures. But did you know that the same ideas can be applied to serverless applications? Yes, despite not having access to the underlying servers, we can still apply principles of chaos engineering to uncover failure modes in our system (and there are plenty!) so we can build defence against them and make our serverless applications more robust and more resilient!

awsaws lambdaserverless
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps

The document discusses refactoring a monolithic application architecture to a serverless one in 8 steps: 1) Reverse Conway's Maneuver by structuring teams around independent services, 2) Identify service boundaries, 3) Organize code into separate repositories for each service, 4) Choose deployment and CI/CD tools, 5) Keep functions simple and single-purpose, 6) Migrate features to new services incrementally, and 7) Maintain API compatibility during migration. The overall approach is to break the monolith into independent microservices that can be developed and deployed separately for improved scalability, resilience and development velocity.

serverlessawsaws lambda
Data replication in different AZ’s
Global Tables
There are throttling everywhere!
Beware of timeout mismatch
API Gateway

Integration timeout 

Default: 29s

Max: 15 minutes
Beware of timeout mismatch

Max: 15 minutes

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours

Beware of timeout mismatch

Max: 15 minutes

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours
set VisibilityTimeout to
6x Lambda timeout
Offload computing operations to queues
Offload computing operations to queues
Offload computing operations to queues
better absorb
downstream problems

Offload computing operations to queues
need way to replay
DLQ events
Offload computing operations to queues
great for fire-and-forget tasks
“what if the client is waiting for a response?”

What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020

The document discusses serverless computing and various use cases for it. It begins by explaining what serverless means, which is not having to manage servers yourself and only paying for resources when they are used. Various AWS serverless services are then mentioned like API Gateway, Lambda, DynamoDB, S3, etc. Common questions around serverless are addressed such as how to handle websockets, cold starts, data pipelines, business workflows, and video encoding. Use cases covered include REST APIs, real-time apps, big data processing, and devops automation.

awsaws lambdaserverless
“Decoupled Invocation”
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
reporting for duty!

task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard…
not ready…
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard…
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results

task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
{ … }
a distributed
a distributed
needs rollback

How to choose the right messaging service
How to choose the right messaging serviceHow to choose the right messaging service
How to choose the right messaging service

At the heart of every event-driven architecture is a conduit for messages to flow through. AWS offers many services that can act as such conduit - EventBridge, SNS, SQS, Kinesis, DynamoDB streams, MSK, IOT Core and Amazon MQ just to name a few! These services have different characteristics and trade-offs around performance, scalability and cost. Picking the right service for your workload is not always easy. In this talk, let’s talk about how to pick the right messaging service to use in your event-driven architecture and play the game of trade-offs to your advantage.

how do you implement distributed transactions?
The Saga pattern
A pattern for managing failures where each action
has a compensating action for rollback
The Saga pattern
The Saga pattern
Begin transaction
Start book hotel request
End book hotel request
Start book flight request
End book flight request
Start book car rental request
End book car rental request
End transaction

The Saga pattern
model both actions and
compensating actions as
Lambda functions
The Saga pattern
use Step Functions as the
coordinator for the saga
The Saga pattern
The Saga pattern

The Saga pattern
no distributed
do the work here

24 hours data retention
24 hours data retention
need alerting to ensure
issue are addressed quickly

Mind the poison message
needs to deal with
poinson messages
Mind the poison message
Mind the poison message
6, 3, 1, 1, 1, 1, …

Mind the poison message
6, 3, 1, 1, 1, 1, …
only count the “same��� batch
Mind the poison message
Mind the poison message
have to fetch
from the stream
Mind the poison message
have to fetch
from the stream
do it before they expire
from the stream!

Mind the partial failures
Mind the partial failures
LambdaSQS Poller
LambdaSQS Poller
Mind the partial failures
Mind the partial failures
LambdaSQS Poller

Mind the partial failures
LambdaSQS Poller
Mind the partial failures
LambdaSQS Poller
batch fails as a unit
Mind the partial failures
Mind the partial failures

Mind the partial failures
Mind the partial failures
Mind the retry storm
Service A
Mind the retry storm
Service A

Mind the retry storm
Service A
Mind the retry storm
Service A
Mind the retry storm
Service A
Mind the retry storm
Service A

Mind the retry storm
Service A
retry storm
circuit breaker pattern
After X consecutive timeouts, trip the circuit
circuit breaker pattern
After X consecutive timeouts, trip the circuit
When circuit is open, fail fast

circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through every Y mins
After X consecutive timeouts, trip the circuit
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through every Y mins
If request succeeds, close the circuit
After X consecutive timeouts, trip the circuit
where do I keep the state of the circuit?
no dependency on external service
takes longer & more requests to stop all traffic
new containers would generate more traffic

external service
minimizes no. of total requests to trip circuit
new containers respect collective decision
dependency on an external service
which approach should I use?
It depends. Maybe start with the simplest solution first?
multi-region, active-active
API Gateway Lambda DynamoDBRoute53

SQS Lambda DynamoDB Lambda API Gateway

SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
Global Table

SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
Global Table
SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
Global Table
SQS Lambda DynamoDB Lambda API Gateway
SQS Lambda DynamoDB Lambda API Gateway
Global Table
Multi-region architecture - benefits & tradeoffs
Protection against

regional failures
Higher complexity Very hard to test

“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch

“what if DynamoDB has an elevated error rate?”
“what if service X has elevated latency?”
identify weaknesses before they manifest in system-wide, aberrant behaviors

everything fails, all the time
“the capacity to recover quickly from difficulties; toughness.”
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead

Learn GraphQL and AppSync by building a
Twitter clone with these technologies

Patterns and practices for building resilient serverless applications