Building an Observability Platform in 389 Difficult Steps

Who am I?
David Worth
Sr. SRE and Engineering Manager at Strava
Previously Sr. Engineer and Engineering Manager at
DigitalOcean in Compute

Building an Observability
Platform in 383 Difficult Steps

As you may recall - I’ve talked about
the Venn Euler Diagram of
Observability before:
Logging Distributed Tracing
Error Handling Instrumentation
Error Logs
Error Rates &
Timing Information
Error Rates
Call Traces and
outcomes

Warning.
This won’t be easy.
This is an Investment in Capabilities.
Those capabilities will pay dividends
in operations, customer value, and
business continuity.

Observability Considerations
A brief survey.

Observability
Considerations
“Build, Buy, or Operate”
For each of these services there
are standard engineering
tradeoffs around building your
own, operating an OSS service,
and paying a 3rd party.

Observability
Considerations
“Stack”
The observability space is
extremely polyglot - even in fairly
heterogeneous ecosystems like
Prometheus (Golang) clients will
be “stack” dependant.
Are you comfortable relying on
tools written in a language your
team may have limited familiarity
with?

Observability
Considerations
Where to keep the data?
You have a few options for these
services:
Cloud Provider managed
3rd Party managed
On-Prem externally-managed
On-Prem self-managed

Observability
Considerations
Retention vs. GDPR / CCPA / ...
If you are bound by privacy
compliance ensure your retention
of any controlled data (PII is less
than required by regulations.
If using a 3rd party - how do they
address ensure compliance?

Observability
Considerations
Data Recording vs. Regulations
Never record passwords,
password hashes, CC#s or CVVs
during in transaction in logs or
otherwise for PCI/HIPAA/etc.
If using a 3rd party for logging,
etc. and you ever have
accidentally logged any of those
things how can you remediate
that?

Do you have a Data Protection Officer (DPO?
Your DPO can help you, and your engineers, navigate the complex
requirements of not just observability platforms but requirements in what
you provide your customers, and internal customers such as data
analysts and business development teams.
Find one.

What are we actually talking about?

What even is an
“Observability
Platform”?
An “Observability Platform” is a
set of shared tools, your
organization uses to understand
the state of your system at any
given time, with some historical
context, to diagnose and improve
the product.

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources Platform
Exception
Handling
Logging
Metrics
Tracing
Specialized Domains:
DBs, etc.
APM
Sinks
Humans 👥
Robots 🤖
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIML
Alerting

Other Sources
● Remote Clients: Mobile (Native) Applications and Browsers
● Short lived batch jobs (cron?
● Long lived but inconsistently run batch jobs (Spark!
● Networking Devices (routers, switches, firewalls, load-balancers, etc.)
● IoT Devices

Let’s actually build some observability
tools!

Let’s start with what we all have ...

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Let’s start with what you have:
Hosts / Containers / Functions
each of which produce some or many
Errors / Logs / Metrics / API calls
Data Sources
Now let’s talk about what you can do with them:
Aggregate Errors / Logs / Metrics / API Call information
Into
A Unified Observability Platform

Errors!
We do! We do have Errors!
panic: really bad error
goroutine 1 [running]:
main.main()
$GOPATH/src/github.com/daveworth/foobar/main.go:14
+0x7b
exit status 2

Exception Handling Pipeline
Application
Exception 💥
+
Context:
Inputs (Query Parameters)
Request ID *
Environment Variables
Stack Trace
Exception
Handler Client
Exception
Handler Service

Exception Handling Services
Sentry
Airbrake
Rollbar
Managed or On-Premise
Bugsnag

OK. We also have (lots of) Logs!
Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Starting Daily apt
download activities...
Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Started Daily apt download
activities.
Aug 21 19:17:01 openvpn-access-server-sfo3 CRON[21214]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
run-parts --report /etc/cron.hourly)

Source(s) Log Collector
UI /
Visualization
👥 / 🤖
Overly Simplified) Logging Pipeline Overview

Source
A More Realistic) Logging Pipeline Overview
Filter Log Collector Log Aggregator
Broker
Ad-Hoc
Stream Query
Indexer
and
Query
UI /
Visualization 👥 / 🤖 /
🕵‍♀

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Log Sources Storage / Access
Log Aggregators
ElasticSearch
Loki
Redis
Kafka
Kinesis
Brokers
Log Collectors

Collectors - can “push” to either Brokers or Log Storage
Logspout
FluentD / FluentBit
Filebeat
Promtail
rsyslog
Log Collectors often are Log Aggregators
Aggregators - pull from Brokers or systems and push
to Log Storage
Logstash
FluentD / FluentBit
Promtail

To get the most out of a (Centralized) Logging
Platform you need to Structure your logs in such a way
they can be best consumed by your Sinks.
A standard format is MITRE’s Common Event
Expression (CEE is represented as JSON. JSON is
well supported by essentially every programming
language and has the advantage of being both human
and robot parsable. Emitting Structured Logs means
humans, programs written by humans, and Centralized
Logging Platforms can consume logs on equal footing.
Not all 3rd party applications you run in your
ecosystem may support your logging format - you may
have to ingest suboptimal logs or write transformers
for them.
Logging Aside - Wire Format
{
"level": "error",
"ts": 1598044449.8620532,
"caller": "zappings/main.go:47",
"msg": "This is an ERROR message",
...
}

Log Storage and Usage
Where are the logs actually going?

Storage / Access
ElasticSearch
Loki
UI / Visualization
Kibana
Grafana
Sinks
Humans 👥
Engineers
Product Owners
Analysts
Finance Team
Robots 🤖
Chat Bots
AIML
Alerting
Legal 🕵‍♀
Audit

Log Processing and Collection
How do we get (logs) there?

Log Processing Infrastructure
Storage / Access
Log Aggregators
ElasticSearch
Loki
Redis
Kafka
Kinesis
Brokers
Log Collectors

Log Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Log Collector
running as a
service locally
on the host
Log Processing
Infrastructure

Log Collector Deployment Patterns - Containers
Container
Orchestration
Container +
Application
Log Collector
running as a
service locally
Log Processing
Infrastructure
Logging
Sidecar
Container +
Application
Logging
Sidecar
Container +
Application
Logging
Sidecar

Log Collector Deployment Patterns - Functions
Log Processing
Infrastructure
Log Collector
running as a
separate service
Serverless
Function
Cloud Provider
native logging

OK  What about Metrics?
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation
in the text format:
# HELP http_request_duration_seconds A histogram of the
request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444

Database
Accesses
Source
Application Metrics
“Timers”
Counters
Gauges
“Instruments”
External
API Calls
RPCs
Critical
Code Paths
Runtime
Metrics
Host
Metrics

Source 👥 / 🤖
Overly Simplified) Metrics Pipeline Overview
“Timers”
Counters
Gauges
“Instruments”
Metrics
Collector
UI / Visualization

Source
A More Realistic) Metrics Pipeline Overview
“Timers”
Counters
Gauges
“Instruments”
Metrics
Collector
👥 / 🤖
UI / Visualization
Metrics
Aggregator
Long-Term Metrics
Storage

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Metrics Sources
StatsD
Prometheus
Push-Gateway
Prometheus
Metrics Collectors
Query / UI
Graphite GrafanaPrometheus
Metrics Aggregators /
Long Term Storage
Graphite
Thanos Cortex

Metrics Storage and Usage
Where are we going?

Query / UI
Graphite
GrafanaPrometheus
Metrics Aggregators &
Long Term Storage
Graphite
Thanos Cortex
Sinks
Humans 👥
Engineers
Product Owners
Analysts
Finance Team
Robots 🤖
Chat Bots
AIML
Alerting

Metric Ingest and Collection
How do we get (metrics) there?

Metric Ingest Infrastructure
StatsD
Prometheus
Push-Gateway
Prometheus
Metrics Collectors Metrics Aggregators /
Long Term Storage
Graphite
Thanos Cortex

Metrics Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Metrics
collectors and
exporters
running locally
on the host
Metrics Ingest
Infrastructure

Metrics Collector Deployment Patterns - Containers
Container
Orchestration
Containerized
Application
Metrics
collectors and
exporters
Metrics Ingest
Infrastructure
Service
Sidecar(s)
Service
Sidecar(s)
Service
Sidecar(s)
Service Discovery

Metrics Collector Deployment Patterns - Functions
Metric Ingest
Infrastructure
Metrics Collector
running as a separate
service ingesting
native metrics
Serverless
Function
Cloud Provider
native metrics

But is it a
“Platform”?
It is only a “Platform” when every
aspect of your business from
tactical engineering metrics to
business metrics are enabled “for
free” e.g. your team does not
have to remember to integrate
with the platform.

This is doubly true if you are talking
about “Tracing”

You had this...
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources

Now you have this….
Bare Metal +
Applications or
Services
VMs +
Applications or
Services
Container
Orchestration
Containers +
Application
Serverless
Functions
Whoa! WAY too many Data Sources

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources
Platform
Exception
Handling
Logging
Metrics
Tracing
You need this ...
Sinks
Humans 👥
Robots 🤖
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIML
Alerting
to serve them!

And you build this
by ...
Choosing a unified set of
platform tools for each domain
Exception Handling / Logging /
Metrics / etc…) and building
curated libraries that all of your
applications consume to ensure
they integrate into your platform.

And there just
isn’t one “best”
answer
I have my favorites
… but the answer is you still have
homework to do. I’ve named a
few of my favorites during this
talk - maybe they help?

Let’s briefly talk about tracing...
and why it is a great “forcing function” for
building a true Platform

Tracing only
really works
when
Literally every system in your
entire ecosystem integrates with
it. Every blindspot is magnified in
tracing.

A Distributed Tracing Primer
Start Time
End Time +
Success/Failure
Request - ID: 1234
Cache Miss Service Call (ID: 1234) Render HTMLQuery Wrapper
Database Query DB Query Hard Calculation Coordinator
Calculation Pipelines

Source(s)
Distributed) Tracing Pipeline Overview
Trace Collector Samping Trace Storage
Request-ID Aware
UI /
Visualization
👥

Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Tracing Sources Trace Collectors Trace Storage
UI / Visualizations
These Spaces Intentionally
Left Blank.
Left Blank.
Left Blank.

So why is Tracing
the “forcing
function”?
Every single time you have a
Trace with a “blind spot” it
creates red-herrings and diverts
attention from the real problem.
So tracing lets you fix it by
touching all of your systems and
eliminating those blind spots by
...

… building good and standard Exception Handling libraries.
… logging uniformly, with structure with those libraries.
… exposing good instrumentation primitives in that library.
… adding tracing primitives via that library.
Eliminate those blinds spots by ...

… all of your engineers “get observability for free”
… and they can get more specialized observability with very little work.
Eliminate those blinds spots so ….

Take what you have, process it and centralize it. Ensure that everyone
has the same tools and the same systems to consume them. Solve the
problems you have today and prepare to solve the ones you will have
soon.
Build a platform.
383 Difficult Steps Distilled

And if you’ve done that - you have an Observability Platform!
��

Thank you!
Questions? 🤔🤔🤔

Building an Observability Platform in 389 Difficult Steps

More Related Content

Building an Observability Platform in 389 Difficult Steps