Using Time Series for Full Observability of a SaaS Platform
- 1. Using Time Series for Full
Observability of a SaaS Platform
Aleksandr Tavgen Playtech, co-founder Timetrix
- 2. vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
- 4. Overall problem
• Zoo of monitoring solutions in large enterprises often
distributed over the world
• M&A transactions or distributed teams make central
managing impossible or ineffective
• For small enterprises or startups the key question is
about finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
- 5. Managing a
Zoo
• A lot of independent
teams
• Everyone has some sort
of solution
• It is hard to get overall
picture of operations
• It is hard to orchestrate
and make changes
- 7. Common
Anti-patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponen?ally
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
- 8. Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
- 9. Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
- 10. IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
- 15. Actually not
• Dashboards are very useful when
you know where and when to watch
• Our brain can recognize and process
visual pa:erns more effec=vely
• But only when you know what you
are looking for and when
- 16. Queries
vs.
Dashboards
Querying your data requires more cogni2ve
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
- 18. COST OF AN
HOUR OF
DOWNTIME
2017-2018
h#ps://www.sta,sta.com/sta,s,cs/753938/worldwide-enterprise-server-hourly-down,me-cost/
- 19. • Change
• Network Failure
• Bug
• Human Factor
• Hardware Failure
• Unspecified
Causes of outage
- 22. What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• And your DevOps teams
feel less pain and toil on
their way
- 24. Metrics
• It is almost impossible to operate on
billions of metrics
• In case of normal system behavior there
will always be outliers in real production
data
• Therefore, not all outliers should be
flagged as anomalous incidents
• Etsy Kale project case
- 26. Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes, and immutable
infrastructure have changed the way companies build and operate
systems
• Virtualization, containerization and orchestration frameworks abstract
infra level
• Moving towards abstraction from the underlying hardware and
networking means that we must focus on ensuring that our
applications work as intended in the context of our business
processes.
- 27. KPI monitoring
• KPI metrics are related to the core business
opera=ons
• It could be logins, ac=ve sessions, any domain
specific opera=ons
• Heavily seasoned
• Sta=c thresholds can’t help here
- 56. Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
- 57. General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events from
organization internals
(changes/deployments)
• Stream processing architectures
- 58. Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
- 59. Why not Ka:a and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples not =me-ordered
events
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we s=ll have signals from the past
- 60. Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events,
traces
• Streaming paradigm
- 61. Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
- 62. Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
• Influx v 2.0 – the supported backend
storage
- 63. Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• Every minute approx. 8000
traces
• Performance issue with
limitaDon on I/O ops
connecDons
• Bursts of context switches
on the kernel level
- 64. Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
- 65. Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
- 68. • Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
- 69. Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
- 71. On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast
- 72. Security case
• Failed logins ratio is related to
overall statistical activity
• People make type-o’s
• Simple thresholds not working
- 76. •It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
- 77. Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers