Watch this Tech Talk: https://do.co/video_dworth
Dave Worth, Engineering Manager at Strava, lays out a strategy for choosing the right tech stack depending on your business and team need. Watch as he guides you through tool sets that navigate around business constraints and regulatory concerns.
About the Presenter
Dave Worthâs professional life consists of being a web and backend engineer who developed specialization in observability through building reliable distributed systems at Strava, and previously DigitalOcean. In his spare time, Dave loves cycling, jiu jitsu, and searching for another great math book to only read the first 50 pages of.
New to DigitalOcean? Get US $100 in credit when you sign up: https://do.co/deploytoday
To learn more about DigitalOcean: https://www.digitalocean.com/
Follow us on Twitter: https://twitter.com/digitalocean
Like us on Facebook: https://www.facebook.com/DigitalOcean
Follow us on Instagram: https://www.instagram.com/thedigitalocean/
We're hiring: http://do.co/careers
Report
Share
Report
Share
1 of 69
Download to read offline
More Related Content
Building an Observability Platform in 389 Difficult Steps
2. Who am I?
David Worth
Sr. SRE and Engineering Manager at Strava
Previously Sr. Engineer and Engineering Manager at
DigitalOcean in Compute
4. As you may recall - Iâve talked about
the Venn Euler Diagram of
Observability before:
Logging Distributed Tracing
Error Handling Instrumentation
Error Logs
Error Rates &
Timing Information
Error Rates
Call Traces and
outcomes
6. Warning.
This wonât be easy.
This is an Investment in Capabilities.
Those capabilities will pay dividends
in operations, customer value, and
business continuity.
8. Observability
Considerations
âBuild, Buy, or Operateâ
For each of these services there
are standard engineering
tradeoffs around building your
own, operating an OSS service,
and paying a 3rd party.
9. Observability
Considerations
âStackâ
The observability space is
extremely polyglot - even in fairly
heterogeneous ecosystems like
Prometheus (Golang) clients will
be âstackâ dependant.
Are you comfortable relying on
tools written in a language your
team may have limited familiarity
with?
10. Observability
Considerations
Where to keep the data?
You have a few options for these
services:
Cloud Provider managed
3rd Party managed
On-Prem externally-managed
On-Prem self-managed
11. Observability
Considerations
Retention vs. GDPR / CCPA / ...
If you are bound by privacy
compliance ensure your retention
of any controlled data (PIIî is less
than required by regulations.
If using a 3rd party - how do they
address ensure compliance?
12. Observability
Considerations
Data Recording vs. Regulations
Never record passwords,
password hashes, CC#s or CVVs
during in transaction in logs or
otherwise for PCI/HIPAA/etc.
If using a 3rd party for logging,
etc. and you ever have
accidentally logged any of those
things how can you remediate
that?
13. Do you have a Data Protection Officer (DPOî?
Your DPO can help you, and your engineers, navigate the complex
requirements of not just observability platforms but requirements in what
you provide your customers, and internal customers such as data
analysts and business development teams.
Find one.
15. What even is an
âObservability
Platformâ?
An âObservability Platformâ is a
set of shared tools, your
organization uses to understand
the state of your system at any
given time, with some historical
context, to diagnose and improve
the product.
16. Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources Platform
Exception
Handling
Logging
Metrics
Tracing
Specialized Domains:
DBs, etc.
APM
Sinks
Humans đ„
Robots đ€
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIîML
Alerting
17. Other Sources
â Remote Clients: Mobile (Native) Applications and Browsers
â Short lived batch jobs (cron?î
â Long lived but inconsistently run batch jobs (Spark!î
â Networking Devices (routers, switches, firewalls, load-balancers, etc.)
â IoT Devices
20. Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Letâs start with what you have:
Hosts / Containers / Functions
each of which produce some or many
Errors / Logs / Metrics / API calls
Data Sources
Now letâs talk about what you can do with them:
Aggregate Errors / Logs / Metrics / API Call information
Into
A Unified Observability Platform
21. Errors!
We do! We do have Errors!
panic: really bad error
goroutine 1 [running]:
main.main()
$GOPATH/src/github.com/daveworth/foobar/main.go:14
+0x7b
exit status 2
27. Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Log Sources Storage / Access
Log Aggregators
ElasticSearch
Loki
Redis
Kafka
Kinesis
Brokers
Log Collectors
28. Collectors - can âpushâ to either Brokers or Log Storage
Logspout
FluentD / FluentBit
Filebeat
Promtail
rsyslog
Log Collectors often are Log Aggregators
Aggregators - pull from Brokers or systems and push
to Log Storage
Logstash
FluentD / FluentBit
Promtail
29. To get the most out of a (Centralized) Logging
Platform you need to Structure your logs in such a way
they can be best consumed by your Sinks.
A standard format is MITREâs Common Event
Expression (CEEî is represented as JSON. JSON is
well supported by essentially every programming
language and has the advantage of being both human
and robot parsable. Emitting Structured Logs means
humans, programs written by humans, and Centralized
Logging Platforms can consume logs on equal footing.
Not all 3rd party applications you run in your
ecosystem may support your logging format - you may
have to ingest suboptimal logs or write transformers
for them.
Logging Aside - Wire Format
{
"level": "error",
"ts": 1598044449.8620532,
"caller": "zappings/main.go:47",
"msg": "This is an ERROR message",
...
}
34. Log Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Log Collector
running as a
service locally
on the host
Log Processing
Infrastructure
36. Log Collector Deployment Patterns - Functions
Log Processing
Infrastructure
Log Collector
running as a
separate service
Serverless
Function
Cloud Provider
native logging
37. OK î What about Metrics?
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation
in the text format:
# HELP http_request_duration_seconds A histogram of the
request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
46. Metrics Collector Deployment Patterns - Hosts
Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Metrics
collectors and
exporters
running locally
on the host
Metrics Ingest
Infrastructure
47. Metrics Collector Deployment Patterns - Containers
Container
Orchestration
Containerized
Application
Metrics
collectors and
exporters
Metrics Ingest
Infrastructure
Service
Sidecar(s)
Service
Sidecar(s)
Service
Sidecar(s)
Service Discovery
48. Metrics Collector Deployment Patterns - Functions
Metric Ingest
Infrastructure
Metrics Collector
running as a separate
service ingesting
native metrics
Serverless
Function
Cloud Provider
native metrics
49. But is it a
âPlatformâ?
It is only a âPlatformâ when every
aspect of your business from
tactical engineering metrics to
business metrics are enabled âfor
freeâ e.g. your team does not
have to remember to integrate
with the platform.
50. This is doubly true if you are talking
about âTracingâ
51. You had this...
Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources
52. Now you have thisâŠ.
Bare Metal +
Applications or
Services
VMs +
Applications or
Services
Container
Orchestration
Containers +
Application
Serverless
Functions
Whoa! WAY too many Data Sources
53. Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Data Sources
Platform
Exception
Handling
Logging
Metrics
Tracing
You need this ...
Sinks
Humans đ„
Robots đ€
Engineers
Product Owners
Analysts
Finance Team
Chat Bots
AIîML
Alerting
to serve them!
54. And you build this
by ...
Choosing a unified set of
platform tools for each domain
îException Handling / Logging /
Metrics / etcâŠ) and building
curated libraries that all of your
applications consume to ensure
they integrate into your platform.
55. And there just
isnât one âbestâ
answer
I have my favorites
⊠but the answer is you still have
homework to do. Iâve named a
few of my favorites during this
talk - maybe they help?
56. Letâs briefly talk about tracing...
and why it is a great âforcing functionâ for
building a true Platform
58. A Distributed Tracing Primer
Start Time
End Time +
Success/Failure
Request - ID: 1234
Cache Miss Service Call (ID: 1234) Render HTMLQuery Wrapper
Database Query DB Query Hard Calculation Coordinator
Calculation Pipelines
60. Bare Metal +
Applications or
Services
VM î
Applications or
Services
Container
Orchestration
Container +
Application
Serverless
Function
Tracing Sources Trace Collectors Trace Storage
UI / Visualizations
These Spaces Intentionally
Left Blank.
These Spaces Intentionally
Left Blank.
These Spaces Intentionally
Left Blank.
61. So why is Tracing
the âforcing
functionâ?
Every single time you have a
Trace with a âblind spotâ it
creates red-herrings and diverts
attention from the real problem.
So tracing lets you fix it by
touching all of your systems and
eliminating those blind spots by
...
63. ⊠building good and standard Exception Handling libraries.
⊠logging uniformly, with structure with those libraries.
⊠exposing good instrumentation primitives in that library.
⊠adding tracing primitives via that library.
Eliminate those blinds spots by ...
64. ⊠all of your engineers âget observability for freeâ
⊠and they can get more specialized observability with very little work.
Eliminate those blinds spots so âŠ.
65. Take what you have, process it and centralize it. Ensure that everyone
has the same tools and the same systems to consume them. Solve the
problems you have today and prepare to solve the ones you will have
soon.
Build a platform.
383 Difficult Steps Distilled
66. And if youâve done that - you have an Observability Platform!
ïżœïżœ