SlideShare a Scribd company logo
Scalable
Anomaly Detection
(with Zero Machine Learning)
|Arthur Gonigberg
Cloud Gateway
Routing
Monitoring
Security
Resiliency
Flexibility
Zuul.
Scalable Anomaly Detection (with Zero Machine Learning)
Zuul
API Mid-Tier
Zuul
API Mid-Tier
Zuul
API Mid-Tier
Zuul
API Mid-Tier
Zuul
API Mid-Tier
Needle
3:00AM
Anomaly
Static Threshold
Zuul - Origin Errors
Zuul Errors
Origin Errors
Anomaly
Dynamic Threshold
Origin 1 Origin 2 Cluster 1
Origin 3 Origin 4 Cluster 2
Raju
Rule-based
Anomaly
Judgement and
Understanding
Raju
Detect anomalies in real time
Build a timeline of failures across systems
Orient on-call operators
Indicate relationships and causality
No A.I.
Anomaly Agenda
Why Static Alerts Suck
Real-Time Event Processing
Detecting Anomalies at Scale
Making Alerts Useful
Need for Speed
Static
Alerts
Suck
Small Incident
0m - Incident happens
5m - Pager fires
15m - Zuul on-call responds
30m - Metrics investigation points to downstream issue
30m - Page downstream on-call
35m - Downstream on-call responds
45m - Issue identified
50m - Incident mitigated
Need for Speed
Metrics processing delay
Rolling counts to prevent false positives
Several loops of investigation and paging
Slower resolutions mean massive impact
Timing is Critical2 hours
5 minutes
2013 20182014 2015 2016 2017
Customer
Impact
Need for Context
Alerts fire on a single metric in a single app
Don't get or look at other teams' alerts
All suspects will be paged
Need to show what, where, when in alert
Need for Accuracy
Static thresholds don't catch all alerts
Built for obvious high impact issues
Need more adaptive thresholds
Operational Burden
Constant pagers firing
Alert fatigue
Endless dashboards
Difficult on-call rotations
What's the Solution?
Alerting in real-time
Combine several anomalies into one alert
Orient operators to shorten investigation
Indicate problem system to reduce pages
Stream Processing
Real Time
Events
Mantis Framework
Collect and process real time events
All critical services send events to Mantis
Handles millions of events per second
Events are queried and pushed on demand
Zuul Mantis
Request
Source
Job Job Job
Query
Submit SQL-style queries on arbitrary data
Sample any portion of traffic on any service
Insanely useful feature for debugging
Stream Processing
Mantis jobs subscribe to request source
Then receive a stream of events
Arbitrary code can be executed on each
Aggregate
Build complex result sets in real time
Aggregate and group by any key
Reduce large sets of data for usefulness
Job Chaining
Chain several Mantis jobs together
Build a stream of map and reduce jobs
Consolidate events to make decisions
Finding Needles
in Haystacks
Anomaly
Detection
Requirements
Cheap real time analysis
Stable enough to detect sharp spikes
Dynamic enough to adjust to new trends
Lenient enough to filter false positives
Median Estimation
Median is a robust indicator
Avoids outliers well
Cheap to calculate
Median Estimation Algorithm
The "Cody" algorithm (aka MAD)
Effectively stochastic gradient descent
If streaming_value > median_estimate
median_estimate++
If streaming_value < median_estimate
median_estimate--
Scalable Anomaly Detection (with Zero Machine Learning)
Scalable Anomaly Detection (with Zero Machine Learning)
Simple and Useful
Simple codebase
Predictable behavior
Very cheap to calculate over thousands of
metrics and permutations
Recovery Detection
Critical to see if alert is ongoing or transient
Real time recovery notice mitigates action
Adds additional information for root cause
Don't need to alert if it's recovered
How to Detect Recovery?
Calculate distribution of pre-alert set
Do the same for the recent range of values
Can we confidently say the recent range is
part of the same distribution?
Distribution Boundaries
Hoeffding Bound
For range r and sample size n,
The true mean of the distribution is within ϵ
of the empirical mean with 1 - 𝛿 certainty
ϵ = r2
log
1
δ
2n
Certainty (1-𝛿)
99.9%
99%
95%
Hoeffding Bound
Recovery Algorithm
Hoeffding bound on pre-alert distribution
Mean on last 30 seconds of data points
Is recent mean below the Hoeffding Bound?
If yes, we're back within a healthy range
Pre-Alert
Range
Recovery
Ranges
Making Alerts
Useful
Building
Raju
Zuul Request Source
Aggregator
Anomaly Detector
Raju
Zuul Aggregation
Select status for all requests
Aggregate values of status
Group by Zuul cluster and destination origin
Window for 10s with 1s interval
Groupings: cluster, origin, cluster_origin
Detection On All Data
Each grouping will return a set of values
Flatten groupings into list of of all combos
Detect for each combination of groupings
Anomaly on any of the values gives us info
about which grouping is in trouble
Origin 1 Origin 2 Cluster 1
Origin 3 Origin 4 Cluster 2
Anomalous Groupings
{"metric":"zuul_status",
"predictions": [
{"value":"SUCCESS",
"count":4551,
"grouping":"zuul-cluster-1",
"isAnomaly": false},
{"value":"FAILURE_READ_TIMEOUT",
"count":40,
"grouping":"zuul-cluster-1",
"isAnomaly": false},
{"value":"FAILURE_THROTTLED",
"count":1000,
"grouping":"zuul-cluster-1",
"isAnomaly": true},
{"value":"SUCCESS",
"count":100,
"grouping":"zuul-cluster-2",
"isAnomaly": false},
{"value":"FAILURE_READ_TIMEOUT",
"count":100,
"grouping":"zuul-cluster-1_api",
"isAnomaly": false},
{"value":"SUCCESS",
"count":100,
"grouping":"zuul-cluster-2_api",
"isAnomaly": false}
{"value":"FAILURE_500",
"count":100,
"grouping":"api",
"isAnomaly": false}]}
Making Detection Useful
Most anomalies aren't really anomalies
How do we assess impact?
Has there been a recovery?
How can we filter out the noise?
Rules Engine
Open a window for all anomalies
At the end of the window, run rules on each
Rules tag events with actionable details
Make decisions on tagged events
Useful Rules
Time - start of anomaly and recovery
Impact - percentage of traffic affected
Description - sentence describing errors
Alert - filter out anomalies that aren't real
Page - page on critical errors
Timeline of Events
Open window on first anomaly
Close window after 3 minutes
End up with rich timeline of events
Time-based correlation will indicate issue
Timeline of Events
Timeline of Events
7:11:33 AM - Zuul-Website (cluster), API Service (origin) started having
connectivity issues at 13.08%
7:12:10 AM - API Service (origin) started failing with timeouts at 1.29%
7:12:22 AM - Zuul-Website (cluster) started failing with timeouts at 1.78%
7:14:50 AM - API Service (origin) started throttling retries (e.g. retry
storm) at 43.86%
7:14:52 AM - Zuul-API (cluster), API Service (origin) started throttling
retries (e.g. retry storm) at 63.12%
7:14:52 AM - Zuul-API (cluster) started throttling retries (e.g. retry
storm) at 60.24%
API
Scripts
H
H
H
H
Platform
App 1
App 2
App 3
App 4
Needle
Raju for API
Hundreds of Hystrix commands
Aggregate Hystrix response statuses
Group by command name and API cluster
Same timeline as Zuul!
Raju for API
11:31:13 AM - API - API Service (cluster), Authenticate Customer (hystrix cmd) started failing with
status FAILURE at 21.26%
11:33:50 AM - API - Authenticate Customer (hystrix cmd) started failing with status FAILURE at 21.35%
11:38:16 AM - Zuul started failing with request read timeouts for Zuul-API (cluster), API
Service (origin) at 1.18%
11:39:58 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started throttling requests
at 32.63%
11:40:52 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started failing with read
timeouts at 2.83%
11:41:02 AM - API - API Service (cluster), Get Customer Permissions (hystrix cmd) started failing with
status FAILURE at 5.11%
11:43:17 AM - Zuul - Zuul-API (cluster) started throttling retries (e.g. retry storm) at 1.92%
Start-Play Drop Events
More Start-Play Drop Events
iOS UI Service
Zuul Cluster
Spinnaker Events
Emailing the Culprits
Origins should be aware of upstream issues
May not be aware of any problems
They need the notification more than we do
Follow up with recovery notifications
Is it better?
Machine
Learning
Hard to Train
Comparatively few outages
Very hard to train reinforced model
False positive rate would be very high
Non-stationary data set
Benefits
Small, simple codebase
Predictable behavior
Hand-written rules = high true-positive rate
Cheap processing for thousands of signals
Tradeoffs
Not easily transferable across services
Rules need to be created manually
Alerting decisions need to be tweaked
Need deep operational knowledge
Simplicity FTW
Thank you.
Arthur Gonigberg
Cloud Gateway Team
Twitter @agonigberg
Github artgon

More Related Content

Scalable Anomaly Detection (with Zero Machine Learning)