Influx/Days 2017 San Francisco | Baron Schwartz

@xaprb
Why Nobody Cares About 
Your Anomaly Detection
Baron Schwartz - November 2017
https://www.ﬂickr.com/photos/muelebius/14113267399

@xaprb
Skepticism From John Allspaw
2
“… your attempts to detect anomalies perfectly, at the right time, is not possible…”
https://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts/

@xaprb
…And Ewaschuk and Beyer
“In general, Google has trended toward simpler and faster monitoring
systems, with better tools for post hoc analysis. We avoid ‘magic’ systems
that try to learn thresholds or automatically detect causality.”

— The Google SRE book: Monitoring Distributed Systems Chapter
3

@xaprb
… But Not This Vendor
4

@xaprb
What Good Is Anomaly Detection?
• How does it work?

• Why is it so hard?

• What’s it good for anyway?
5

@xaprb
A Rose By Any Other Name
• “Machine Learning”

• “Dynamic Baselining”

• “Automatic Thresholds”

• “Adaptive Self-Learning Serverless IoT Big Data Blockchain”
6

@xaprb
How Anomaly Detection Works
• An anomaly is usually deﬁned as “something abnormal.”

• Normal is usually deﬁned by a mathematical model.

• Anomaly detection, in this sense, is really prediction/forecasting.
7

@xaprb
What’s Normal?
• Most people answer this question reﬂexively, with lots of unconscious
biases.

• The answer is usually “if a measurement is ± two standard deviations…”

• What’s implicit/assumed is:

• What’s the model that produces the forecast?

• What assumptions does it make about the data?

• What’s the cost/beneﬁt of correct/incorrect predictions?
8

@xaprb
The Ad Nauseum Anomaly Picture
Pretty pictures with shaded bands! :-)
9

@xaprb
A More Useful Definition of Anomaly
An anomaly is an event that has impact greater than the cost of remediation,
and which is actionable by a person.

Restated: people always think they want to know what’s abnormal/weird, but
they really want to know what’s wrong and what to fix.
They don’t realize this till they experience being notified of abnormalities.
10

@xaprb
#1: Real-Time Often Isn’t
• We often assume anomaly detection “in real time” is possible/desirable.

• But what does that mean? People’s deﬁnitions vary wildly.

“Why checking your KPI several times a day? To detect problems as fast
as possible.”
12

@xaprb
#2: Real-Time Data Is Noisy
The beautiful charts always seem to come from long timescales, on the order
of days or weeks. At the 1-second time scale, systems are incredibly noisy.
13

@xaprb
#3: Cost/Benefit Asymmetry
• What’s the benefit of a true positive or true negative? What’s the cost?

• The sensitivity/specificity tradeoff is very unbalanced.

• And because your systems are much noisier than you think, you’re
probably wrong about the number of false positives/negatives you’ll get.

• The signal-to-noise ratio turns out to be really poor.

• Even if the anomaly detection isn’t wrong, if it’s not actionable, it’s still
damaging.
14

@xaprb
#4: Results Aren’t Interpretable
• Most anomaly detection techniques use complex models that are black
boxes combining many moving pieces, many of which are
nondeterministic.

• It’s often nearly impossible to agree or disagree with the outcome.

• Even a simple exponential moving average can be hard to audit.
15

@xaprb
#5: High Cognitive Load
• Systems that abstract/process data and present black-box outcomes are
difficult for engineering teams to act on.

• In firefights, uncertainty, stress, time pressure, and consequences are all
at very high levels.

• Engineers generally will work to reduce these factors, which means they
ignore abstract, non-auditable conclusions they aren’t sure whether to
trust.

• Engineers usually want interpretable, raw data.
16

@xaprb
#6: Highly Dynamic Systems
• Most systems exhibit trainable periodicity on the scale of weeks, but
many such systems have useful lifetimes in the order of hours or days
before the underlying model disappears or changes.

• This means a lot of anomaly detection techniques are obsolete before
they’re even usable.
17

@xaprb
#7: Stored Baselines
• If a product calculates “baselines,” should it store them or calculate on-
the-ﬂy?

• If stored, they become obsolete if the system’s parameters/model
changes, or if the algorithm is upgraded.

• If derived, they’re often not practically computable, or unavailable for use
in many popular tools that can only read “real” metrics from storage.
18

@xaprb
#8: Anomalies Skew Forecasts
• Most feasible models predict things like trend and seasonality.

• Anomalies will perturb these models and cause them to forecast repeated
anomalies.

• Compensating for these factors makes the models a lot less feasible and
understandable.
19

@xaprb
#9: Vendor Hype
When the vendor obviously uses Holt-Winters Forecasting, but calls it
“machine learning” (presumably ML is used to choose params?)…

When a familiar technique like K-Means Clustering is called Artiﬁcial
Intelligence…

… we all lose conﬁdence and credibility in the eyes of users.

… and our users have expectations we can’t realistically meet.
20

Influx/Days 2017 San Francisco | Baron Schwartz

@xaprb
First - Why Do People Want It?
1. They’ve got a LOT of metrics and can’t look at it all.

2. Vendors and conference thought-leaders told them anomaly detection
worked well.

3. They’ve had problems, noticed a metric spiking, and thought “if only
we’d known sooner about that.”

4. They’re engineers, so they think “this has to be a solvable problem.”
23

@xaprb
#1: Very Speciﬁc, Targeted Uses
• You have an absolutely critical, sensitive high-level KPI like pageviews

• Fast-moving data that’s extremely predictable and consistent

• You have validated the exact behavior and expect it to be immortal
24

@xaprb
#2: Capacity Planning
• This is forecasting, not anomaly detection.

• This is an important use case for Netﬂix,
Twitter, and others.

• Question: is a Christmas 
spike an anomaly?
25

@xaprb
#3: You Have A Team Of Data Scientists
It’s not a coincidence that many of the anomaly detection success stories
have dedicated, full time data science teams. With PhDs.
26

@xaprb
#4: Context, Not Detection
• When you’re troubleshooting an incident, and you see a spike in a metric,
a great question is “what does this metric normally do?��

• On-the-ﬂy calculation and visualization of that answer can be helpful.

• The mistake is to take it one step too far and think “I wish I could set an
alert on this…”
27

@xaprb
“What Does This Metric Normally Do?”
28
1 Hour
12 Hours

@xaprb
#5: You Have A Speciﬁc Question
In my experience, a lot of the ills have come from thinking anomaly detection
is an answer, when the question/problem isn’t clear yet.
29

@xaprb
#6: If You Can’t Get It Any Other Way
Are you sure you need anomaly detection?

• Scenario: “Our rate of new-account signups per minute is a business KPI,
and we want to know if it’s broken for any reason. It’s highly cyclical and
predictable.”

• Solution 1: “This sounds ideal for time-series prediction, maybe with Holt-
Winters, and anomaly detection when there’s a deviation from the
prediction.”

• Solution 2: “Calculate the pageview:signup conversion rate by dividing two
series, and alert if it drops, using a static threshold.” (See also next page)
30

@xaprb
Ask A 2-Dimensional Question
Instead of “what’s this metric’s behavior?” you’re asking 
“what’s this metric’s relationship to another?”
31 https://www.vividcortex.com/blog/correlating-metrics

@xaprb
PerlMonitoring Problems
$problems =~ s/regular expressions?/anomaly detection/gi
32 https://xkcd.com/1171/

@xaprb
A War Story
At VividCortex, we have (had) two kinds of anomaly detection.

• First, we built adaptive fault detection. It applies anomaly detection to a model
based on Little’s Law and queueing theory. It assigns specific meaning to a few
specific metrics that have an underlying physical basis.

• The outcome has a well defined meaning too: “work is queueing up.”

• It turned out to be really hard to get the false positive rate down, even in this well-
controlled setting. It requires machine learning (!!).

• The result is still more difficult for customers to interpret than we’d like. “Can I set
my own threshold? What does it mean for this one to be bigger than that one?
What does the score really mean? What should I do about these? Can’t you just…”
33

@xaprb
Traditional Dynamic Baselines
At VividCortex we also built limited “dynamic baselining” on top of modiﬁed
Holt-Winters prediction.

• We baselined latency and error rate of the most frequent and time-
consuming queries in the system.

• Customers don’t use it, even though it remains a constant hypothetical
request (“I’d like to be alerted when important queries have signiﬁcant
latency spikes.”)

• This is probably a case of customers asking for a faster horse. It’s also
possible that we just didn’t implement it well enough.
34

@xaprb
Okay, There Was A Third…
• The brilliant CEO built “Baggins” anomaly detection, then turned it oﬀ in
horror at the spam it generated.

• The cleverest thing about it was the name.
35

Influx/Days 2017 San Francisco | Baron Schwartz

Related slideshows

More Related Content

Influx/Days 2017 San Francisco | Baron Schwartz