Scalable Anomaly Detection (with Zero Machine Learning)

Scalable
Anomaly Detection
(with Zero Machine Learning)
|Arthur Gonigberg
Cloud Gateway

Routing
Monitoring
Security
Resiliency
Flexibility
Zuul.

Origin 1 Origin 2 Cluster 1
Origin 3 Origin 4 Cluster 2

Raju
Rule-based
Anomaly
Judgement and
Understanding

Raju
Detect anomalies in real time
Build a timeline of failures across systems
Orient on-call operators
Indicate relationships and causality

Anomaly Agenda
Why Static Alerts Suck
Real-Time Event Processing
Detecting Anomalies at Scale
Making Alerts Useful

Need for Speed
Static
Alerts
Suck

Small Incident
0m - Incident happens
5m - Pager fires
15m - Zuul on-call responds
30m - Metrics investigation points to downstream issue
30m - Page downstream on-call
35m - Downstream on-call responds
45m - Issue identified
50m - Incident mitigated

Need for Speed
Metrics processing delay
Rolling counts to prevent false positives
Several loops of investigation and paging
Slower resolutions mean massive impact

Timing is Critical2 hours
5 minutes
2013 20182014 2015 2016 2017
Customer
Impact

Need for Context
Alerts fire on a single metric in a single app
Don't get or look at other teams' alerts
All suspects will be paged
Need to show what, where, when in alert

Need for Accuracy
Static thresholds don't catch all alerts
Built for obvious high impact issues
Need more adaptive thresholds

Operational Burden
Constant pagers firing
Alert fatigue
Endless dashboards
Difficult on-call rotations

What's the Solution?
Alerting in real-time
Combine several anomalies into one alert
Orient operators to shorten investigation
Indicate problem system to reduce pages

Stream Processing
Real Time
Events

Mantis Framework
Collect and process real time events
All critical services send events to Mantis
Handles millions of events per second
Events are queried and pushed on demand

Zuul Mantis
Request
Source
Job Job Job

Query
Submit SQL-style queries on arbitrary data
Sample any portion of traffic on any service
Insanely useful feature for debugging

Stream Processing
Mantis jobs subscribe to request source
Then receive a stream of events
Arbitrary code can be executed on each

Aggregate
Build complex result sets in real time
Aggregate and group by any key
Reduce large sets of data for usefulness

Job Chaining
Chain several Mantis jobs together
Build a stream of map and reduce jobs
Consolidate events to make decisions

Finding Needles
in Haystacks
Anomaly
Detection

Requirements
Cheap real time analysis
Stable enough to detect sharp spikes
Dynamic enough to adjust to new trends
Lenient enough to filter false positives

Median Estimation
Median is a robust indicator
Avoids outliers well
Cheap to calculate

Median Estimation Algorithm
The "Cody" algorithm (aka MAD)
Effectively stochastic gradient descent
If streaming_value > median_estimate
median_estimate++
If streaming_value < median_estimate
median_estimate--

Simple and Useful
Simple codebase
Predictable behavior
Very cheap to calculate over thousands of
metrics and permutations

Recovery Detection
Critical to see if alert is ongoing or transient
Real time recovery notice mitigates action
Adds additional information for root cause
Don't need to alert if it's recovered

How to Detect Recovery?
Calculate distribution of pre-alert set
Do the same for the recent range of values
Can we confidently say the recent range is
part of the same distribution?

Distribution Boundaries
Hoeffding Bound
For range r and sample size n,
The true mean of the distribution is within ϵ
of the empirical mean with 1 - 𝛿 certainty
ϵ = r2
log
1
δ
2n

Certainty (1-𝛿)
99.9%
99%
95%
Hoeffding Bound

Recovery Algorithm
Hoeffding bound on pre-alert distribution
Mean on last 30 seconds of data points
Is recent mean below the Hoeffding Bound?
If yes, we're back within a healthy range

Pre-Alert
Range
Recovery
Ranges

Making Alerts
Useful
Building
Raju

Zuul Request Source
Aggregator
Anomaly Detector
Raju

Zuul Aggregation
Select status for all requests
Aggregate values of status
Group by Zuul cluster and destination origin
Window for 10s with 1s interval
Groupings: cluster, origin, cluster_origin

Detection On All Data
Each grouping will return a set of values
Flatten groupings into list of of all combos
Detect for each combination of groupings
Anomaly on any of the values gives us info
about which grouping is in trouble

Anomalous Groupings
{"metric":"zuul_status",
"predictions": [
{"value":"SUCCESS",
"count":4551,
"grouping":"zuul-cluster-1",
"isAnomaly": false},
{"value":"FAILURE_READ_TIMEOUT",
"count":40,
{"value":"FAILURE_THROTTLED",
"count":1000,
"isAnomaly": true},
{"value":"SUCCESS",
"count":100,
{"value":"FAILURE_READ_TIMEOUT",
"count":100,
"grouping":"zuul-cluster-1_api",
{"value":"SUCCESS",
"count":100,
"grouping":"zuul-cluster-2_api",
"isAnomaly": false}
{"value":"FAILURE_500",
"count":100,
"grouping":"api",
"isAnomaly": false}]}

Making Detection Useful
Most anomalies aren't really anomalies
How do we assess impact?
Has there been a recovery?
How can we filter out the noise?

Rules Engine
Open a window for all anomalies
At the end of the window, run rules on each
Rules tag events with actionable details
Make decisions on tagged events

Useful Rules
Time - start of anomaly and recovery
Impact - percentage of traffic affected
Description - sentence describing errors
Alert - filter out anomalies that aren't real
Page - page on critical errors

Timeline of Events
Open window on first anomaly
Close window after 3 minutes
End up with rich timeline of events
Time-based correlation will indicate issue

Timeline of Events
7:11:33 AM - Zuul-Website (cluster), API Service (origin) started having
connectivity issues at 13.08%
7:12:10 AM - API Service (origin) started failing with timeouts at 1.29%
7:12:22 AM - Zuul-Website (cluster) started failing with timeouts at 1.78%
7:14:50 AM - API Service (origin) started throttling retries (e.g. retry
storm) at 43.86%
7:14:52 AM - Zuul-API (cluster), API Service (origin) started throttling
retries (e.g. retry storm) at 63.12%
7:14:52 AM - Zuul-API (cluster) started throttling retries (e.g. retry
storm) at 60.24%

API
Scripts
H
H
H
H
Platform
App 1
App 2
App 3
App 4

Raju for API
Hundreds of Hystrix commands
Aggregate Hystrix response statuses
Group by command name and API cluster
Same timeline as Zuul!

Raju for API
11:31:13 AM - API - API Service (cluster), Authenticate Customer (hystrix cmd) started failing with
status FAILURE at 21.26%
11:33:50 AM - API - Authenticate Customer (hystrix cmd) started failing with status FAILURE at 21.35%
11:38:16 AM - Zuul started failing with request read timeouts for Zuul-API (cluster), API
Service (origin) at 1.18%
11:39:58 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started throttling requests
at 32.63%
11:40:52 AM - Zuul - Zuul-Website (cluster), Website UI Service (origin) started failing with read
timeouts at 2.83%
11:41:02 AM - API - API Service (cluster), Get Customer Permissions (hystrix cmd) started failing with
status FAILURE at 5.11%
11:43:17 AM - Zuul - Zuul-API (cluster) started throttling retries (e.g. retry storm) at 1.92%

More Start-Play Drop Events
iOS UI Service
Zuul Cluster

Emailing the Culprits
Origins should be aware of upstream issues
May not be aware of any problems
They need the notification more than we do
Follow up with recovery notifications

Is it better?
Machine
Learning

Hard to Train
Comparatively few outages
Very hard to train reinforced model
False positive rate would be very high
Non-stationary data set

Benefits
Small, simple codebase
Predictable behavior
Hand-written rules = high true-positive rate
Cheap processing for thousands of signals

Tradeoffs
Not easily transferable across services
Rules need to be created manually
Alerting decisions need to be tweaked
Need deep operational knowledge

Thank you.
Arthur Gonigberg
Cloud Gateway Team
Twitter @agonigberg
Github artgon

Scalable Anomaly Detection (with Zero Machine Learning)

More Related Content

Scalable Anomaly Detection (with Zero Machine Learning)