This document discusses observability and incident management. It notes that incidents are expensive and reduce credibility. Common causes of outages include changes, network failures, bugs, human errors, hardware failures, and unspecified issues. The timeline of an outage includes detection, investigation, escalation, and fixing. Many companies have a "zoo" of monitoring solutions that are difficult to manage. Common anti-patterns include an exponential growth of metrics that nobody understands. The document advocates focusing on key performance indicator metrics and using time-series databases, distributed tracing, and machine learning to more quickly detect anomalies and reduce incident timelines. It describes an open source project called Timetrix that combines metrics, events and traces for improved observability.
2. vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
3. What are
Incidents
• Something that has impact
on operational/business level
• Incidents are expensive
• Incidents come with
credibility costs
4. COST OF AN
HOUR OF
DOWNTIME
2017-2018
https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/
5. • Change
• Network Failure
• Bug
• Human Factor
• Unspecified
• Hardware Failure
Causes of outage
7. What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• And your DevOps teams feel
less pain and toil on their
way
8. Overall problem
• Zoo of monitoring solutions in large enterprises often
distributed over the world
• M&A transactions or distributed teams make central
managing impossible or ineffective
• For small enterprises or startups the key question is about
finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
9. Managing a
Zoo
• A lot of independent
teams
• Everyone has some sort
of solution
• It is hard to get overall
picture of operations
• It is hard to orchestrate
and make changes
11. Common
Anti-patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponentially
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
12. Uber case – 9 billion of metrics / 1000 + instances for monitoring
13. IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
14. Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
19. Actually not
• Dashboards are very useful when you
know where and when to watch
• Our brain can recognize and process
visual patterns more effectively
• But only when you know what you
are looking for and when
20. Queries
vs.
Dashboards
Querying your data requires more cognitive
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
22. Metrics
• It is almost impossible to operate on
billions of metrics
• In case of normal system behavior there
will always be outliers in real production
data
• Therefore, not all outliers should be flagged
as anomalous incidents
• Etsy Kale project case
24. Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes, and immutable
infrastructure have changed the way companies build and operate systems
• Virtualization, containerization and orchestration frameworks abstract infra
level
• Moving towards abstraction from the underlying hardware and networking
means that we must focus on ensuring that our applications work as
intended in the context of our business processes.
25. KPI monitoring
• KPI metrics are related to the core business
operations
• It could be logins, active sessions, any domain
specific operations
• Heavily seasoned
• Static thresholds can’t help here
26. Our Solution
• Narrowing down the amount
of metrics required to defined
KPI metrics
• We combined push/pull
model
• Local push
• Central pull
• And we created a ML-based
system, which learns your
metrics’ behavior
28. Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to low
percentage points
29. General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events from
organization internals
(changes/deployments)
• Stream processing architectures
30. Storage
• We need to combine different signals
• One consistent storage gives benefits
• Time-series database is oriented on that requirements
31. Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
32. Why not Kafka and all those classical streaming?
• Frameworks like Storm, Flink - oriented on tuples not time-ordered
events
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we still have signals from the past
33. Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events, traces
• Same streaming paradigm
• There is no mismatch between metaquering and quering
34. Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
35. Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
• Influx v 2.0 – the supported backend storage
36. Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• Every minute approx. 8000
traces
• Performance issue with
limitation on I/O ops
connections
• Bursts of context switches
on the kernel level
37. Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
38. Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
41. • Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
42. Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
49. •It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
50. Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers
Virtualization, containerization, and orchestration frameworks are responsible for providing computational resources and handling failures creates an abstraction layer for hardware and networking.
Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.