In a large scale distributed system, detecting and pinpointing failures gets exponentially harder as an architecture gets more complex. Netflix's cloud architecture is composed of thousands of services and hundreds of thousands of VMs and containers. Failures can happen at any level and can often cascade quickly, some can cause massive outages on several systems, while others only only break one or two. This creates a needle in a haystack problem that requires automated and precise detection. Zuul, as the front door for all of Netflix's cloud traffic, sees all requests and responses and is ideally positioned to identify and isolate only the broken paths in the maze of microservices.
We leveraged Zuul to stream real-time events for each request-response and built an anomaly detector to automatically identify and alert services in trouble. We scaled this detector to thousands of nodes, handling millions of requests, without a single line of machine learning. Sometimes you need machine learning and sometimes you don't. Although it's en vogue to apply machine learning to every problem, it can be more practical and approachable to solve certain problems with old-fashioned math!
In this talk, we'll discuss how we built this system with stream processing, anomaly detection algorithms, and a rules engine. We will also deep-dive into the anomaly detection algorithm and show how sometimes a simple, elegant algorithm can be just as good as any sophisticated machine learning.
Report
Share
Report
Share
1 of 73
Download to read offline
More Related Content
Scalable Anomaly Detection (with Zero Machine Learning)
22. Need for Speed
Metrics processing delay
Rolling counts to prevent false positives
Several loops of investigation and paging
Slower resolutions mean massive impact
24. Need for Context
Alerts fire on a single metric in a single app
Don't get or look at other teams' alerts
All suspects will be paged
Need to show what, where, when in alert
25. Need for Accuracy
Static thresholds don't catch all alerts
Built for obvious high impact issues
Need more adaptive thresholds
27. What's the Solution?
Alerting in real-time
Combine several anomalies into one alert
Orient operators to shorten investigation
Indicate problem system to reduce pages
29. Mantis Framework
Collect and process real time events
All critical services send events to Mantis
Handles millions of events per second
Events are queried and pushed on demand
36. Requirements
Cheap real time analysis
Stable enough to detect sharp spikes
Dynamic enough to adjust to new trends
Lenient enough to filter false positives
38. Median Estimation Algorithm
The "Cody" algorithm (aka MAD)
Effectively stochastic gradient descent
If streaming_value > median_estimate
median_estimate++
If streaming_value < median_estimate
median_estimate--
41. Simple and Useful
Simple codebase
Predictable behavior
Very cheap to calculate over thousands of
metrics and permutations
42. Recovery Detection
Critical to see if alert is ongoing or transient
Real time recovery notice mitigates action
Adds additional information for root cause
Don't need to alert if it's recovered
43. How to Detect Recovery?
Calculate distribution of pre-alert set
Do the same for the recent range of values
Can we confidently say the recent range is
part of the same distribution?
44. Distribution Boundaries
Hoeffding Bound
For range r and sample size n,
The true mean of the distribution is within ϵ
of the empirical mean with 1 - 𝛿 certainty
ϵ = r2
log
1
δ
2n
46. Recovery Algorithm
Hoeffding bound on pre-alert distribution
Mean on last 30 seconds of data points
Is recent mean below the Hoeffding Bound?
If yes, we're back within a healthy range
50. Zuul Aggregation
Select status for all requests
Aggregate values of status
Group by Zuul cluster and destination origin
Window for 10s with 1s interval
Groupings: cluster, origin, cluster_origin
51. Detection On All Data
Each grouping will return a set of values
Flatten groupings into list of of all combos
Detect for each combination of groupings
Anomaly on any of the values gives us info
about which grouping is in trouble
54. Making Detection Useful
Most anomalies aren't really anomalies
How do we assess impact?
Has there been a recovery?
How can we filter out the noise?
55. Rules Engine
Open a window for all anomalies
At the end of the window, run rules on each
Rules tag events with actionable details
Make decisions on tagged events
56. Useful Rules
Time - start of anomaly and recovery
Impact - percentage of traffic affected
Description - sentence describing errors
Alert - filter out anomalies that aren't real
Page - page on critical errors
57. Timeline of Events
Open window on first anomaly
Close window after 3 minutes
End up with rich timeline of events
Time-based correlation will indicate issue
59. Timeline of Events
7:11:33 AM - Zuul-Website (cluster), API Service (origin) started having
connectivity issues at 13.08%
7:12:10 AM - API Service (origin) started failing with timeouts at 1.29%
7:12:22 AM - Zuul-Website (cluster) started failing with timeouts at 1.78%
7:14:50 AM - API Service (origin) started throttling retries (e.g. retry
storm) at 43.86%
7:14:52 AM - Zuul-API (cluster), API Service (origin) started throttling
retries (e.g. retry storm) at 63.12%
7:14:52 AM - Zuul-API (cluster) started throttling retries (e.g. retry
storm) at 60.24%
62. Raju for API
Hundreds of Hystrix commands
Aggregate Hystrix response statuses
Group by command name and API cluster
Same timeline as Zuul!
63. Raju for API
11:31:13 AM - API - API Service (cluster), Authenticate Customer (hystrix cmd) started failing with
status FAILURE at 21.26%
11:33:50 AM - API - Authenticate Customer (hystrix cmd) started failing with status FAILURE at 21.35%
11:38:16 AM - Zuul started failing with request read timeouts for Zuul-API (cluster), API
Service (origin) at 1.18%
11:39:58 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started throttling requests
at 32.63%
11:40:52 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started failing with read
timeouts at 2.83%
11:41:02 AM - API - API Service (cluster), Get Customer Permissions (hystrix cmd) started failing with
status FAILURE at 5.11%
11:43:17 AM - Zuul - Zuul-API (cluster) started throttling retries (e.g. retry storm) at 1.92%
67. Emailing the Culprits
Origins should be aware of upstream issues
May not be aware of any problems
They need the notification more than we do
Follow up with recovery notifications
71. Tradeoffs
Not easily transferable across services
Rules need to be created manually
Alerting decisions need to be tweaked
Need deep operational knowledge