Expect the unexpected: Anticipate and prepare for failures in microservices based architectures
- 2. Introduction
• Senior Software Engineer at Blue Jeans
Network
• Worked at Sun Microsystems/Oracle for 13
years
• Committer to numerous open source projects
including GlassFish Application Server
- 6. What you will learn
• Monoliths v/s microservices
• Challenges at scale
• Preventing Cascading failures
• Resilience planning at various stages
• Dealing with latencies in response
• Real world examples
- 12. Microservices
• Disadvantages
– Not a free lunch!
– Distributed systems prone to failures
– Eventual consistency
– More effort in terms of deployments, release
managements
– Challenges in testing the various services evolving
independently, regression tests etc
- 14. Resilient system
• Processes transactions, even when there are
transient impulses, persistent stresses
• Functions even when there are component
failures disrupting normal processing
• Accepts failures will happen
• Designs for crumple zones
- 15. Kinds of failures
• Challenges at scale
• Integration point failures
– Network errors
– Semantic errors.
– Slow responses
– Outright hang
– GC issues
- 18. Anticipate failures at scale
• Anticipate growth
• Design for next order of magnitude
• Design for 10x plan to rewrite for 100x
- 19. Resiliency planning Stage 1
• When developing code
– Avoiding Cascading failures
• Circuit breaker
• Timeouts
• Retry
• Bulkhead
• Cache optimizations
– Avoid malicious clients
• Rate limiting
- 23. Cascading failures
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
- 27. Timeouts
• Clients may prefer a response
– failure
– success
– job queued for later
All aggregation requests to microservices should
have reasonable timeouts set
- 28. Types of Timeouts
• Connection timeout
– Max time before connection can be established or
Error
• Socket timeout
– Max time of inactivity between two packets once
connection is established
- 29. Timeouts pattern
• Timeouts + Retries go together
• Transient failures can be remedied with fast
retries
• However problems in network can last for a
while so probability of retries failing
- 30. Timeouts in code
In JAX-RS
Client client = ClientBuilder.newClient();
client.property(ClientProperties.CONNECT_TIMEOUT, 5000);
client.property(ClientProperties.READ_TIMEOUT, 5000)
- 31. Retry pattern
• Retry for failures in case of network failures,
timeouts or server errors
• Helps transient network errors such as
dropped connections or server fail over
- 32. Retry pattern
• If one of the services is slow or malfunctioning
and other services keep retrying then the
problem becomes worse
• Solution
– Exponential backup
– Circuit breaker pattern
- 33. Circuit breaker pattern
Circuit breaker A circuit breaker is an electrical device used in an
electrical panel that monitors and controls the amount of amperes
(amps) being sent through
- 34. Circuit breaker pattern
• Safety device
• If a power surge occurs in the electrical wiring,
the breaker will trip.
• Flips from “On” to “Off” and shuts electrical
power from that breaker
- 35. Circuit breaker
• Netflix Hystrix follows circuit breaker pattern
• If a service’s error rate exceeds a threshold it
will trip the circuit breaker and block the
requests for a specific period of time
- 38. Bulkhead
• An example of bulkhead could be isolating the
database dependencies per service
• Similarly other infrastructure components can
be isolated such as cache infrastructure
- 39. Rate Limiting
• Restricting the number of requests that can be
made by a client
• Client can be identified based on the access
token used
• Additionally clients can be identified based on
IP address
- 40. Rate Limiting
• With JAX-RS Rate limiting can be implemented
as a filter
• This filter can check the access count for a
client and if within limit accept the request
• Else throw a 429 Error
• Code at https://github.com/bhakti-
mehta/samples/tree/master/ratelimiting
- 41. Cache optimizations
• Stores response information related to
requests in a temporary storage for a specific
period of time
• Ensures that server is not burdened
processing those requests in future when
responses can be fulfilled from the cache
- 43. Dealing with latencies in response
• Have a timeout for the aggregation service
• Dispatch requests in parallel and collect
responses
• Associate a priority with all the responses
collected
- 44. Handling partial failures best practices
• One service calls another which can be slow or
unavailable
• Never block indefinitely waiting for the service
• Try to return partial results
• Provide a caching layer and return cached
data
- 45. Asynchronous Patterns
• Pattern to deal with long running jobs
• Some resources may take longer time to
provide results
• Not needing client to wait for the response
- 47. Asynchronous API
• Reactive patterns
• Message Passing
– Akka actor model
• Message queues
– Communication between services via shared
message queues
– Websockets
- 48. Logging
• Complex distributed systems introduce many
points of failure
• Logging helps link events/transactions between
various components that make an application or
a business service
• ELK stack
• Splunk, syslog
• Loggly
• LogEntries
- 49. Logging best practices
• Include detailed, consistent pattern across
service logs
• Obfuscate sensitive data
• Identify caller or initiator as part of logs
• Do not log payloads by default
- 50. Best practices when designing APIs for
mobile clients
– Avoid chattiness
– Use aggregator pattern
- 58. Metrics
• Response times, throughput
– Identify slow running DB queries
• GC rate and pause duration
– Garbage collection can cause slow responses
• Monitor unusual activity
• Third party library metrics
– For example Couchbase hits
– atop
- 59. Rollout of new features
• Phasing rollout of new features
• Have a way to turn features off if not behaving
as expected
• Alerts and more alerts!
- 60. Real time examples
• Netflix's Simian Army induces failures of
services and even datacenters during the
working day to test both the application's
resilience and monitoring.
• Latency Monkey to simulate slow running
requests
• Wiremock to mock services
• Saboteur to create deliberate network
mayhem
Editor's Notes
- The service will have a caching layer and a database layer