SlideShare a Scribd company logo
Building an Observability Platform in 389 Difficult Steps
Who am I?
David Worth
Sr. SRE and Engineering Manager at Strava
Previously Sr. Engineer and Engineering Manager at
DigitalOcean in Compute
Building an Observability
Platform in 383 Difficult Steps
As you may recall - I’ve talked about
the Venn Euler Diagram of
Observability before:
Logging Distributed Tracing
Error Handling Instrumentation
Error Logs
Error Rates &
Timing Information
Error Rates
Call Traces and
outcomes
Let’s get started!
Warning.
This won’t be easy.
This is an Investment in Capabilities.
Those capabilities will pay dividends
in operations, customer value, and
business continuity.
Observability Considerations
A brief survey.
Observability
Considerations
“Build, Buy, or Operate”
For each of these services there
are standard engineering
tradeoffs around building your
own, operating an OSS service,
and paying a 3rd party.
Observability
Considerations
“Stack”
The observability space is
extremely polyglot - even in fairly
heterogeneous ecosystems like
Prometheus (Golang) clients will
be “stack” dependant.
Are you comfortable relying on
tools written in a language your
team may have limited familiarity
with?
Observability
Considerations
Where to keep the data?
You have a few options for these
services:
Cloud Provider managed
3rd Party managed
On-Prem externally-managed
On-Prem self-managed
Observability
Considerations
Retention vs. GDPR / CCPA / ...
If you are bound by privacy
compliance ensure your retention
of any controlled data (PII is less
than required by regulations.
If using a 3rd party - how do they
address ensure compliance?
Observability
Considerations
Data Recording vs. Regulations
Never record passwords,
password hashes, CC#s or CVVs
during in transaction in logs or
otherwise for PCI/HIPAA/etc.
If using a 3rd party for logging,
etc. and you ever have
accidentally logged any of those
things how can you remediate
that?
Do you have a Data Protection Officer (DPO?
Your DPO can help you, and your engineers, navigate the complex
requirements of not just observability platforms but requirements in what
you provide your customers, and internal customers such as data
analysts and business development teams.
Find one.
What are we actually talking about?
What even is an
“Observability
Platform”?
An “Observability Platform” is a
set of shared tools, your
organization uses to understand
the state of your system at any
given time, with some historical
context, to diagnose and improve
the product.
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources Platform
Exception
Handling
Logging
Metrics
Tracing
Specialized Domains:
DBs, etc.
APM
Sinks
Humans đŸ‘„
Robots đŸ€–
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIML
Alerting
Other Sources
● Remote Clients: Mobile (Native) Applications and Browsers
● Short lived batch jobs (cron?
● Long lived but inconsistently run batch jobs (Spark!
● Networking Devices (routers, switches, firewalls, load-balancers, etc.)
● IoT Devices
Let’s actually build some observability
tools!
Let’s start with what we all have ...
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Let’s start with what you have:
Hosts / Containers / Functions
each of which produce some or many
Errors / Logs / Metrics / API calls
Data Sources
Now let’s talk about what you can do with them:
Aggregate Errors / Logs / Metrics / API Call information
Into
A Unified Observability Platform
Errors!
We do! We do have Errors!
panic: really bad error
goroutine 1 [running]:
main.main()
$GOPATH/src/github.com/daveworth/foobar/main.go:14
+0x7b
exit status 2
Exception Handling Pipeline
Application
Exception đŸ’„
+
Context:
Inputs (Query Parameters)
Request ID *
Environment Variables
Stack Trace
Exception
Handler Client
Exception
Handler Service
Exception Handling Services
Sentry
Airbrake
Rollbar
Managed or On-Premise
Bugsnag
OK. We also have (lots of) Logs!
Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Starting Daily apt
download activities...
Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Started Daily apt download
activities.
Aug 21 19:17:01 openvpn-access-server-sfo3 CRON[21214]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Aug 21 20:17:01 openvpn-access-server-sfo3 CRON[21309]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Aug 21 21:17:01 openvpn-access-server-sfo3 CRON[21352]: (root) CMD ( cd / &&
Source(s) Log Collector
UI /
Visualization
đŸ‘„ / đŸ€–
Overly Simplified) Logging Pipeline Overview
Source
A More Realistic) Logging Pipeline Overview
Filter Log Collector Log Aggregator
Broker
Ad-Hoc
Stream Query
Indexer
and
Query
UI /
Visualization đŸ‘„ / đŸ€– /
đŸ•”â€â™€
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Log Sources Storage / Access
Log Aggregators
ElasticSearch
Loki
Redis
Kafka
Kinesis
Brokers
Log Collectors
Collectors - can “push” to either Brokers or Log Storage
Logspout
FluentD / FluentBit
Filebeat
Promtail
rsyslog
Log Collectors often are Log Aggregators
Aggregators - pull from Brokers or systems and push
to Log Storage
Logstash
FluentD / FluentBit
Promtail
To get the most out of a (Centralized) Logging
Platform you need to Structure your logs in such a way
they can be best consumed by your Sinks.
A standard format is MITRE’s Common Event
Expression (CEE is represented as JSON. JSON is
well supported by essentially every programming
language and has the advantage of being both human
and robot parsable. Emitting Structured Logs means
humans, programs written by humans, and Centralized
Logging Platforms can consume logs on equal footing.
Not all 3rd party applications you run in your
ecosystem may support your logging format - you may
have to ingest suboptimal logs or write transformers
for them.
Logging Aside - Wire Format
{
"level": "error",
"ts": 1598044449.8620532,
"caller": "zappings/main.go:47",
"msg": "This is an ERROR message",
...
}
Log Storage and Usage
Where are the logs actually going?
Storage / Access
ElasticSearch
Loki
UI / Visualization
Kibana
Grafana
Sinks
Humans đŸ‘„
Engineers
Product Owners
Analysts
Finance Team
Robots đŸ€–
Chat Bots
AIML
Alerting
Legal đŸ•”â€â™€
Audit
Log Processing and Collection
How do we get (logs) there?
Log Processing Infrastructure
Storage / Access
Log Aggregators
ElasticSearch
Loki
Redis
Kafka
Kinesis
Brokers
Log Collectors
Log Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Log Collector
running as a
service locally
on the host
Log Processing
Infrastructure
Log Collector Deployment Patterns - Containers
Container
Orchestration
Container +
Application
Log Collector
running as a
service locally
Log Processing
Infrastructure
Logging
Sidecar
Container +
Application
Logging
Sidecar
Container +
Application
Logging
Sidecar
Log Collector Deployment Patterns - Functions
Log Processing
Infrastructure
Log Collector
running as a
separate service
Serverless
Function
Cloud Provider
native logging
OK  What about Metrics?
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation
in the text format:
# HELP http_request_duration_seconds A histogram of the
request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
Database
Accesses
Source
Application Metrics
“Timers”
Counters
Gauges
“Instruments”
External
API Calls
RPCs
Critical
Code Paths
Runtime
Metrics
Host
Metrics
Source đŸ‘„ / đŸ€–
Overly Simplified) Metrics Pipeline Overview
“Timers”
Counters
Gauges
“Instruments”
Metrics
Collector
UI / Visualization
Source
A More Realistic) Metrics Pipeline Overview
“Timers”
Counters
Gauges
“Instruments”
Metrics
Collector
đŸ‘„ / đŸ€–
UI / Visualization
Metrics
Aggregator
Long-Term Metrics
Storage
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Metrics Sources
StatsD
Prometheus
Push-Gateway
Prometheus
Metrics Collectors
Query / UI
Graphite GrafanaPrometheus
Metrics Aggregators /
Long Term Storage
Graphite
Thanos Cortex
Metrics Storage and Usage
Where are we going?
Query / UI
Graphite
GrafanaPrometheus
Metrics Aggregators &
Long Term Storage
Graphite
Thanos Cortex
Sinks
Humans đŸ‘„
Engineers
Product Owners
Analysts
Finance Team
Robots đŸ€–
Chat Bots
AIML
Alerting
Metric Ingest and Collection
How do we get (metrics) there?
Metric Ingest Infrastructure
StatsD
Prometheus
Push-Gateway
Prometheus
Metrics Collectors Metrics Aggregators /
Long Term Storage
Graphite
Thanos Cortex
Metrics Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Metrics
collectors and
exporters
running locally
on the host
Metrics Ingest
Infrastructure
Metrics Collector Deployment Patterns - Containers
Container
Orchestration
Containerized
Application
Metrics
collectors and
exporters
Metrics Ingest
Infrastructure
Service
Sidecar(s)
Service
Sidecar(s)
Service
Sidecar(s)
Service Discovery
Metrics Collector Deployment Patterns - Functions
Metric Ingest
Infrastructure
Metrics Collector
running as a separate
service ingesting
native metrics
Serverless
Function
Cloud Provider
native metrics
But is it a
“Platform”?
It is only a “Platform” when every
aspect of your business from
tactical engineering metrics to
business metrics are enabled “for
free” e.g. your team does not
have to remember to integrate
with the platform.
This is doubly true if you are talking
about “Tracing”
You had this...
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources
Now you have this
.
Bare Metal +
Applications or
Services
VMs +
Applications or
Services
Container
Orchestration
Containers +
Application
Serverless
Functions
Whoa! WAY too many Data Sources
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources
Platform
Exception
Handling
Logging
Metrics
Tracing
You need this ...
Sinks
Humans đŸ‘„
Robots đŸ€–
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIML
Alerting
to serve them!
And you build this
by ...
Choosing a unified set of
platform tools for each domain
Exception Handling / Logging /
Metrics / etc
) and building
curated libraries that all of your
applications consume to ensure
they integrate into your platform.
And there just
isn’t one “best”
answer
I have my favorites

 but the answer is you still have
homework to do. I’ve named a
few of my favorites during this
talk - maybe they help?
Let’s briefly talk about tracing...
and why it is a great “forcing function” for
building a true Platform
Tracing only
really works
when
Literally every system in your
entire ecosystem integrates with
it. Every blindspot is magnified in
tracing.
A Distributed Tracing Primer
Start Time
End Time +
Success/Failure
Request - ID: 1234
Cache Miss Service Call (ID: 1234) Render HTMLQuery Wrapper
Database Query DB Query Hard Calculation Coordinator
Calculation Pipelines
Source(s)
Distributed) Tracing Pipeline Overview
Trace Collector Samping Trace Storage
Request-ID Aware
UI /
Visualization
đŸ‘„
Bare Metal +
Applications or
Services
VM 
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Tracing Sources Trace Collectors Trace Storage
UI / Visualizations
These Spaces Intentionally
Left Blank.
These Spaces Intentionally
Left Blank.
These Spaces Intentionally
Left Blank.
So why is Tracing
the “forcing
function”?
Every single time you have a
Trace with a “blind spot” it
creates red-herrings and diverts
attention from the real problem.
So tracing lets you fix it by
touching all of your systems and
eliminating those blind spots by
...
Building an Observability Platform in 389 Difficult Steps

 building good and standard Exception Handling libraries.

 logging uniformly, with structure with those libraries.

 exposing good instrumentation primitives in that library.

 adding tracing primitives via that library.
Eliminate those blinds spots by ...

 all of your engineers “get observability for free”

 and they can get more specialized observability with very little work.
Eliminate those blinds spots so 
.
Take what you have, process it and centralize it. Ensure that everyone
has the same tools and the same systems to consume them. Solve the
problems you have today and prepare to solve the ones you will have
soon.
Build a platform.
383 Difficult Steps Distilled
And if you’ve done that - you have an Observability Platform!
ïżœïżœ
The End.
Thank you!
Questions? đŸ€”đŸ€”đŸ€”
Building an Observability Platform in 389 Difficult Steps

More Related Content

Building an Observability Platform in 389 Difficult Steps

  • 2. Who am I? David Worth Sr. SRE and Engineering Manager at Strava Previously Sr. Engineer and Engineering Manager at DigitalOcean in Compute
  • 3. Building an Observability Platform in 383 Difficult Steps
  • 4. As you may recall - I’ve talked about the Venn Euler Diagram of Observability before: Logging Distributed Tracing Error Handling Instrumentation Error Logs Error Rates & Timing Information Error Rates Call Traces and outcomes
  • 6. Warning. This won’t be easy. This is an Investment in Capabilities. Those capabilities will pay dividends in operations, customer value, and business continuity.
  • 8. Observability Considerations “Build, Buy, or Operate” For each of these services there are standard engineering tradeoffs around building your own, operating an OSS service, and paying a 3rd party.
  • 9. Observability Considerations “Stack” The observability space is extremely polyglot - even in fairly heterogeneous ecosystems like Prometheus (Golang) clients will be “stack” dependant. Are you comfortable relying on tools written in a language your team may have limited familiarity with?
  • 10. Observability Considerations Where to keep the data? You have a few options for these services: Cloud Provider managed 3rd Party managed On-Prem externally-managed On-Prem self-managed
  • 11. Observability Considerations Retention vs. GDPR / CCPA / ... If you are bound by privacy compliance ensure your retention of any controlled data (PII is less than required by regulations. If using a 3rd party - how do they address ensure compliance?
  • 12. Observability Considerations Data Recording vs. Regulations Never record passwords, password hashes, CC#s or CVVs during in transaction in logs or otherwise for PCI/HIPAA/etc. If using a 3rd party for logging, etc. and you ever have accidentally logged any of those things how can you remediate that?
  • 13. Do you have a Data Protection Officer (DPO? Your DPO can help you, and your engineers, navigate the complex requirements of not just observability platforms but requirements in what you provide your customers, and internal customers such as data analysts and business development teams. Find one.
  • 14. What are we actually talking about?
  • 15. What even is an “Observability Platform”? An “Observability Platform” is a set of shared tools, your organization uses to understand the state of your system at any given time, with some historical context, to diagnose and improve the product.
  • 16. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Data Sources Platform Exception Handling Logging Metrics Tracing Specialized Domains: DBs, etc. APM Sinks Humans đŸ‘„ Robots đŸ€– Engineers Product Owners Analysts Finance Team Chat Bots AIML Alerting
  • 17. Other Sources ● Remote Clients: Mobile (Native) Applications and Browsers ● Short lived batch jobs (cron? ● Long lived but inconsistently run batch jobs (Spark! ● Networking Devices (routers, switches, firewalls, load-balancers, etc.) ● IoT Devices
  • 18. Let’s actually build some observability tools!
  • 19. Let’s start with what we all have ...
  • 20. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Let’s start with what you have: Hosts / Containers / Functions each of which produce some or many Errors / Logs / Metrics / API calls Data Sources Now let’s talk about what you can do with them: Aggregate Errors / Logs / Metrics / API Call information Into A Unified Observability Platform
  • 21. Errors! We do! We do have Errors! panic: really bad error goroutine 1 [running]: main.main() $GOPATH/src/github.com/daveworth/foobar/main.go:14 +0x7b exit status 2
  • 22. Exception Handling Pipeline Application Exception đŸ’„ + Context: Inputs (Query Parameters) Request ID * Environment Variables Stack Trace Exception Handler Client Exception Handler Service
  • 24. OK. We also have (lots of) Logs! Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Starting Daily apt download activities... Aug 21 18:34:39 openvpn-access-server-sfo3 systemd[1]: Started Daily apt download activities. Aug 21 19:17:01 openvpn-access-server-sfo3 CRON[21214]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Aug 21 20:17:01 openvpn-access-server-sfo3 CRON[21309]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Aug 21 21:17:01 openvpn-access-server-sfo3 CRON[21352]: (root) CMD ( cd / &&
  • 25. Source(s) Log Collector UI / Visualization đŸ‘„ / đŸ€– Overly Simplified) Logging Pipeline Overview
  • 26. Source A More Realistic) Logging Pipeline Overview Filter Log Collector Log Aggregator Broker Ad-Hoc Stream Query Indexer and Query UI / Visualization đŸ‘„ / đŸ€– / đŸ•”â€â™€
  • 27. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Log Sources Storage / Access Log Aggregators ElasticSearch Loki Redis Kafka Kinesis Brokers Log Collectors
  • 28. Collectors - can “push” to either Brokers or Log Storage Logspout FluentD / FluentBit Filebeat Promtail rsyslog Log Collectors often are Log Aggregators Aggregators - pull from Brokers or systems and push to Log Storage Logstash FluentD / FluentBit Promtail
  • 29. To get the most out of a (Centralized) Logging Platform you need to Structure your logs in such a way they can be best consumed by your Sinks. A standard format is MITRE’s Common Event Expression (CEE is represented as JSON. JSON is well supported by essentially every programming language and has the advantage of being both human and robot parsable. Emitting Structured Logs means humans, programs written by humans, and Centralized Logging Platforms can consume logs on equal footing. Not all 3rd party applications you run in your ecosystem may support your logging format - you may have to ingest suboptimal logs or write transformers for them. Logging Aside - Wire Format { "level": "error", "ts": 1598044449.8620532, "caller": "zappings/main.go:47", "msg": "This is an ERROR message", ... }
  • 30. Log Storage and Usage Where are the logs actually going?
  • 31. Storage / Access ElasticSearch Loki UI / Visualization Kibana Grafana Sinks Humans đŸ‘„ Engineers Product Owners Analysts Finance Team Robots đŸ€– Chat Bots AIML Alerting Legal đŸ•”â€â™€ Audit
  • 32. Log Processing and Collection How do we get (logs) there?
  • 33. Log Processing Infrastructure Storage / Access Log Aggregators ElasticSearch Loki Redis Kafka Kinesis Brokers Log Collectors
  • 34. Log Collector Deployment Patterns - Hosts Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Log Collector running as a service locally on the host Log Processing Infrastructure
  • 35. Log Collector Deployment Patterns - Containers Container Orchestration Container + Application Log Collector running as a service locally Log Processing Infrastructure Logging Sidecar Container + Application Logging Sidecar Container + Application Logging Sidecar
  • 36. Log Collector Deployment Patterns - Functions Log Processing Infrastructure Log Collector running as a separate service Serverless Function Cloud Provider native logging
  • 37. OK  What about Metrics? # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444
  • 39. Source đŸ‘„ / đŸ€– Overly Simplified) Metrics Pipeline Overview “Timers” Counters Gauges “Instruments” Metrics Collector UI / Visualization
  • 40. Source A More Realistic) Metrics Pipeline Overview “Timers” Counters Gauges “Instruments” Metrics Collector đŸ‘„ / đŸ€– UI / Visualization Metrics Aggregator Long-Term Metrics Storage
  • 41. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Metrics Sources StatsD Prometheus Push-Gateway Prometheus Metrics Collectors Query / UI Graphite GrafanaPrometheus Metrics Aggregators / Long Term Storage Graphite Thanos Cortex
  • 42. Metrics Storage and Usage Where are we going?
  • 43. Query / UI Graphite GrafanaPrometheus Metrics Aggregators & Long Term Storage Graphite Thanos Cortex Sinks Humans đŸ‘„ Engineers Product Owners Analysts Finance Team Robots đŸ€– Chat Bots AIML Alerting
  • 44. Metric Ingest and Collection How do we get (metrics) there?
  • 45. Metric Ingest Infrastructure StatsD Prometheus Push-Gateway Prometheus Metrics Collectors Metrics Aggregators / Long Term Storage Graphite Thanos Cortex
  • 46. Metrics Collector Deployment Patterns - Hosts Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Metrics collectors and exporters running locally on the host Metrics Ingest Infrastructure
  • 47. Metrics Collector Deployment Patterns - Containers Container Orchestration Containerized Application Metrics collectors and exporters Metrics Ingest Infrastructure Service Sidecar(s) Service Sidecar(s) Service Sidecar(s) Service Discovery
  • 48. Metrics Collector Deployment Patterns - Functions Metric Ingest Infrastructure Metrics Collector running as a separate service ingesting native metrics Serverless Function Cloud Provider native metrics
  • 49. But is it a “Platform”? It is only a “Platform” when every aspect of your business from tactical engineering metrics to business metrics are enabled “for free” e.g. your team does not have to remember to integrate with the platform.
  • 50. This is doubly true if you are talking about “Tracing”
  • 51. You had this... Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Data Sources
  • 52. Now you have this
. Bare Metal + Applications or Services VMs + Applications or Services Container Orchestration Containers + Application Serverless Functions Whoa! WAY too many Data Sources
  • 53. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Data Sources Platform Exception Handling Logging Metrics Tracing You need this ... Sinks Humans đŸ‘„ Robots đŸ€– Engineers Product Owners Analysts Finance Team Chat Bots AIML Alerting to serve them!
  • 54. And you build this by ... Choosing a unified set of platform tools for each domain Exception Handling / Logging / Metrics / etc
) and building curated libraries that all of your applications consume to ensure they integrate into your platform.
  • 55. And there just isn’t one “best” answer I have my favorites 
 but the answer is you still have homework to do. I’ve named a few of my favorites during this talk - maybe they help?
  • 56. Let’s briefly talk about tracing... and why it is a great “forcing function” for building a true Platform
  • 57. Tracing only really works when Literally every system in your entire ecosystem integrates with it. Every blindspot is magnified in tracing.
  • 58. A Distributed Tracing Primer Start Time End Time + Success/Failure Request - ID: 1234 Cache Miss Service Call (ID: 1234) Render HTMLQuery Wrapper Database Query DB Query Hard Calculation Coordinator Calculation Pipelines
  • 59. Source(s) Distributed) Tracing Pipeline Overview Trace Collector Samping Trace Storage Request-ID Aware UI / Visualization đŸ‘„
  • 60. Bare Metal + Applications or Services VM  Applications or Services Container Orchestration Container + Application Serverless Function Tracing Sources Trace Collectors Trace Storage UI / Visualizations These Spaces Intentionally Left Blank. These Spaces Intentionally Left Blank. These Spaces Intentionally Left Blank.
  • 61. So why is Tracing the “forcing function”? Every single time you have a Trace with a “blind spot” it creates red-herrings and diverts attention from the real problem. So tracing lets you fix it by touching all of your systems and eliminating those blind spots by ...
  • 63. 
 building good and standard Exception Handling libraries. 
 logging uniformly, with structure with those libraries. 
 exposing good instrumentation primitives in that library. 
 adding tracing primitives via that library. Eliminate those blinds spots by ...
  • 64. 
 all of your engineers “get observability for free” 
 and they can get more specialized observability with very little work. Eliminate those blinds spots so 
.
  • 65. Take what you have, process it and centralize it. Ensure that everyone has the same tools and the same systems to consume them. Solve the problems you have today and prepare to solve the ones you will have soon. Build a platform. 383 Difficult Steps Distilled
  • 66. And if you’ve done that - you have an Observability Platform! ïżœïżœ