Devoxx2017
- 2. INTRODUCTION
➤ Platform@Atlassian
➤ In the past Platform Lead at BlueJeans Network
➤ Worked at Sun Microsystems/Oracle for 13 years
➤ Committer to numerous open source projects including
GlassFish Application Server
- 7. PATH TO MICROSERVICES
➤ Advantages
➤ Simplicity
➤ Isolation of problems
➤ Scale up and scale down
➤ Easy deployment
➤ Polyglotism and heterogenity
- 11. RESILIENT SYSTEM
➤ Processes transactions, even when there are transient
impulses, persistent stresses
➤ Functions even when there are component failures disrupting
normal processing
➤ Accepts failures will happen
➤ Design for crumple zones
- 12. RESILIENT SYSTEM
Be the duck
Behave normally when
the system is not
performing as expected
in face of outages
Behave normally
How the customer should perceive you?
- 14. KINDS OF FAILURES
➤ Challenges at scale
➤ Integration point failures
➤ Network errors
➤ Semantic errors.
➤ Slow responses
➤ Outright hang
➤ GC issues
- 15. THE NEW WAY OF LIFE
You build it
You run it !!
(You own it
You plan for it !!! ]
- 18. THINGS THAT WENT WRONG
➤ Bad node in load balancer group
➤ Deployment of new code
➤ Gradual increase in latency
➤ Abuse by clients
➤ Not enough prod like data in staging
➤ No easy way to trigger stale/lenient fallbacks
➤ Less alerts
- 20. ACTION PLAN
➤ Circuit breakers
➤ Fallback (lenient acceptable values)
➤ Predictive caching
➤ Reduce surface area by clients
➤ Load tests
➤ Failure injection testing
➤ Monitor
➤ Alerts
Development time
Before a deploy
Post deploy
- 21. The more you sweat on the field
the less you bleed in war!!!
- 22. RESILIENCY PLANNING STAGE 1
➤ When developing code
➤ Avoiding Cascading failures
➤ Circuit breaker
➤ Timeouts
➤ Retry
➤ Bulkhead
➤ Cache optimisations
➤ Avoid malicious clients
➤ Rate limiting
- 23. RESILIENCY PLANNING STAGE 2
➤ Planning for dealing with failures before deploy to prod
➤ load test
➤ a/b test
➤ longevity
➤ dark launch features
- 26. CASCADING FAILURES
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
- 27. HYSTRIX- CIRCUIT BREAKER PATTERN
• Fault tolerance pattern as a library
• Automatic fail fast
• Automatic fail over
• Metrics- Circuit breaker open, calls/sec, Execution time
median, 90, 95 99 percentile
• If command has high failure rate in last 10 seconds it is
unlikely to succeed now
- 29. RETRY PATTERN AND TIMEOUTS
➤ Retry for failures in case of network failures, timeouts or
server errors
➤ Helps transient network errors such as dropped connections
or server fail over
- 32. RATE LIMITING
➤ Restricting the number of requests that can be made by a
client
➤ Client can be identified based on the access token used
➤ Additionally clients can be identified based on IP address
- 34. TALE OF THE NEVER LEAVING CACHE ENTRIES
➤ Longer TTL
➤ Not evicted soon enough
➤ Bottlenecks
➤ Failures
- 35. LOGGING BEST PRACTICES
➤ Include detailed, consistent pattern across service logs
➤ Obfuscate sensitive data
➤ Identify caller or initiator as part of logs
➤ Do not log payloads
➤ Request tracing across services
- 43. METRICS
➤ Response times, throughput
➤ Identify slow running DB queries
➤ GC rate and pause duration
➤ Garbage collection can cause slow responses
➤ Monitor unusual activity
➤ Create alerts when thresholds are exceeded
➤ Run books for actions to be taken on alerts
- 44. Thoughts of the on call person paged at 3 am
debugging an issue in your code
- 46. SAVED BY THE METRICS AND ALERTS
➤ MaxDBConnection alert
➤ CPU Utilisation spiking up
➤ Analysed slow running queries
➤ Some select queries taking very long avg of 718 ms 95
percentile 2030 ms.
➤ Unidentified cause which was a bug fix which introduced
pagination and the ORDER BY clause needed to match a
function based index
- 47. ROLLOUT OF NEW FEATURES
➤ Phasing rollout of new features
➤ Dark launch features
➤ Have a way to turn features off if not behaving as expected
➤ Alerts and more alerts!
- 48. AWS S3 OUTAGE
➤ S3 outage in US East
➤ Number of services affected
➤ 3rd party services we depend on have degraded performances
➤ Lots of key take aways from this
- 49. Cheat sheet
A Alerts K Key invalidations
B Bulkheads L Logging
C Circuit Breakers M Metrics & monitoring
D Data obfuscation N Network latencies
E Eventual consistent O Optimizing queries
F Fallbacks & Hystrix P Phased rollouts
G GC settings Q Queues bounded
H Health checks R Run books
I Injecting failure S Staged deployments
J Jitter with Retries T Timeouts
- 50. TAKEAWAY
➤ Inevitability of failures
➤ Expect systems will fail
➤ Failure prevention - Plan for failures Not if but when
➤ Automate
Keep Calm and Cloud On!
- 51. REFERENCES
➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png
➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg
➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex-
menapocalpse.jpg
➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg
➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg
➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg
➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign-
resized_2.jpg
➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug-
Around-The-Neck-With-A-Rope-Image.jpg
➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons
License