Using Time Series for Full Observability of a SaaS Platform

Using Time Series for Full
Observability of a SaaS Platform
Aleksandr Tavgen Playtech, co-founder Timetrix

vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale

Overall problem
• Zoo of monitoring solutions in large enterprises often
distributed over the world
• M&A transactions or distributed teams make central
managing impossible or ineffective
• For small enterprises or startups the key question is
about finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed

Managing a
Zoo
• A lot of independent
teams
• Everyone has some sort
of solution
• It is hard to get overall
picture of operations
• It is hard to orchestrate
and make changes

Common
Anti-patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponen?ally
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well

Uber case – 9 billion of metrics / 1000 + instances for monitoring solution

Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example

IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG

Actually not
• Dashboards are very useful when
you know where and when to watch
• Our brain can recognize and process
visual pa:erns more eﬀec=vely
• But only when you know what you
are looking for and when

Queries
vs.
Dashboards
Querying your data requires more cogni2ve
eﬀort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them

What are
Incidents
• Something that has impact
on operational/business
level
• Incidents are expensive
• Incidents come with
credibility costs

COST OF AN
HOUR OF
DOWNTIME
2017-2018
h#ps://www.sta,sta.com/sta,s,cs/753938/worldwide-enterprise-server-hourly-down,me-cost/

• Change
• Network Failure
• Bug
• Human Factor
• Hardware Failure
• Unspecified
Causes of outage

Timeline of
Outage
Detec%on
Investigation
Escalation
Fixing

What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• And your DevOps teams
feel less pain and toil on
their way

Metrics
• It is almost impossible to operate on
billions of metrics
• In case of normal system behavior there
will always be outliers in real production
data
• Therefore, not all outliers should be
flagged as anomalous incidents
• Etsy Kale project case

Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes, and immutable
infrastructure have changed the way companies build and operate
systems
• Virtualization, containerization and orchestration frameworks abstract
infra level
• Moving towards abstraction from the underlying hardware and
networking means that we must focus on ensuring that our
applications work as intended in the context of our business
processes.

KPI monitoring
• KPI metrics are related to the core business
opera=ons
• It could be logins, ac=ve sessions, any domain
speciﬁc opera=ons
• Heavily seasoned
• Sta=c thresholds can’t help here

Moving average with 60 min window

Moving variance with 60 min window

PredicDve
AlerDng
System
Anomalies combined with rules
Rules are dynamic

Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points

General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events from
organization internals
(changes/deployments)
• Stream processing architectures

Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?

Why not Ka:a and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples not =me-ordered
events
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we s=ll have signals from the past

Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events,
traces
• Streaming paradigm

Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well

Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
• Influx v 2.0 – the supported backend
storage

Jaeger with
Inﬂuxv2.0 as a
Backend storage
• Real prod case
• Every minute approx. 8000
traces
• Performance issue with
limitaDon on I/O ops
connecDons
• Bursts of context switches
on the kernel level

Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms

Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions

Real incident
We need some statistical
models to operate on raw
data

• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model

Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance

Er(t) = X(t) - X(t-1)
Er(t) = discrete deriva=ve of (X)

On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast

Security case
• Failed logins ratio is related to
overall statistical activity
• People make type-o’s
• Simple thresholds not working

One Flux transformation pipeline

Real Alerts related to attacks on Login Service

Combining all
together
Adding Traces and
Events can reduce
Inves2ga2on part
Can pinpoint to
Root Cause

•It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data

Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers

Q&A
http://medium.com/@ATavgen/
www.timetrix.io
hSps://twiSer.com/ATavgen
hSps://habr.com/ru/users/homunculus

Using Time Series for Full Observability of a SaaS Platform

Related slideshows

More Related Content

Using Time Series for Full Observability of a SaaS Platform