DockerCon SF 2019 - Observability Workshop

KEVIN CRAWLEY
Instana + Single Music
Observability Workshop

w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop

A system is observable if the behavior of the
entire system can be determined by only looking
at its inputs and outputs.
Lesson: control theory is a well-documented
approach which people can learn from vs trying
to reinvent
What is Observability?
Kalman, 1961 paper
On the general theory of control systems

"Observability aims to provide highly granular insights
into the behavior of production systems along with
rich context, perfect for debugging and performance
analysis purposes.” – Cindy Sridharan @copyconstruct
What is the goal of observability?

How many of you are running
staging environments?
Why does my organization need
Observability?

Now, how many of you
actually trust your staging
environments?
Why does my organization need
Observability?

This is your staging environment

… and this is your prod environment

DockerCon SF 2019 - Observability Workshop

• Gain a basic understanding of Distributed Tracing and
“How it works”
• Implement Metrics and Tracing in a small microservice
app using FOSS tools
• Understand how metrics and distributed tracing can help
your organization manage complexity
• Understand the limitations of FOSS and the challenges
ahead
What are the goals of this workshop?

• Workshop (1-1.5 hours)
• How Does Distributed Tracing Work
• Challenges with FOSS monitoring
• Advanced Use Cases w/ Distributed Tracing
(Single Music)
• Q&A
Agenda

Lab 01 - Setting up Kubernetes in Docker Enterprise Edition Lab
Lab 02 - Setting up Gitlab and our Microservice Application
Repository Kubernetes Integration
Lab 03 - Deploying our Microservice Application and Adding
Observability
Lab 04 - Monitoring Application Metrics with Grafana / Prometheus
Lab 05 - Observing with Jaeger and Breaking Things
Lab 06 - Advanced Analytics and Use Cases with Automated
Distributed Tracing
Workshop Overview

How does distributed tracing work?

At runtime custom headers / metadata are injected into
each request which includes identifiers that enable trace
backends to correlate spans between requests
• X-B3-TraceId: 128 or 64 lower-hex encoded bits
• X-B3-SpanId: 64 lower-hex encoded bits
• X-B3-ParentSpanId: 64 lower-hex encoded bits
• X-B3-Sampled: Bool
• X-B3-Flags: “1” includes DEBUG
It’s literally just headers / meta data

HTTP Request Example
service-a requests:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b receives:
GET service-b:8080/api/groceries
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b requests:
GET service-c:8080/api/products
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
service-c receives:
GET service-b:8080/api/products
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca

Gannt Chart
GET /api/groceries 800ms
GET /api/groceries 550ms
GET /api/products 400ms

• Correlation is nearly impossible across
multiple vendors / solutions (Logging, Metrics,
Traces)
• Large scale applications require equally large
scale monitoring (cpu/mem, i/o, distributed
systems, clustered storage, sharded TSDB)
Challenges of FOSS monitoring
Is anything ever truly free?

• Distributed tracing exposes a lot of data which
goes unanalyzed by FOSS tools
• The same holds true for Metrics and Logging
• … and Alerting
Actually, can I just show you what is possible?
Current solutions only collect / display
There is no analysis of the data

How Distributed Tracing and
Log Insights empowers Single
Music
Advanced Use Cases

• Operated by 3 engineers (1 FE/1 BE/1 SRE)
• Over 20k transaction / hour, 20+ integrations,
100k LOC, with less than 15% test coverage
• Launched in 2018 with 15 microservices on
Docker Swarm – has since expanded to over 28
microservices with zero additional engineering
personnel

Visualizing Large and Complex
Visualizing Large
and Complex
Environments

What happens if we aggregate
timing, error rate, and # of reqs
What can analyzing
Distributed Traces
tell us?

Database Optimizations, Caching,
and Concurrency
What problems have
Distributed Tracing
helped solve?

• DBO (Hibernate Query) causing O(n log n)
rise in latency and processing time
• Application Dashboard indicated an issue with
overall latency increasing
• Fix deployed and improvement was observed
immediatly
Rise in Latency + Processing Time

• We implemented Redis for caching, and
processing time went down
• However, we didn’t account for token policies
changing and they suddenly began to expire
after 30 seconds
• Alerting around error rates for this endpoint
raised our awareness around this issue
Caching Solved one problem
… but caused another

Context is critical when doing
Contributing Factor Analysis
Metrics are not
standalone, they
have relationships

Logs can benefit from analytics too!
Let’s not forget about
Logging

We utilize a mix of Instana, Logz.io
and Grafana to manage our systems
Custom Dashboards
deliver peace of mind

• Using FOSS monitoring is a great way to both
learn and demonstrate the value of
observability to your peers
• Understand the limitations of FOSS and be
prepared to invest in either 3rd party tooling our
managing your own monitoring infrastructure
Focus on what matters to your business
… at Single Music we focus on delivering music

Schedule a meeting with me!
Want to learn more?
Come visit our booth@
Instana Booth #S23

Rate & Share
Rate this session in the
DockerCon App
Follow me @notsureifkevin
and share #DockerCon

DockerCon SF 2019 - Observability Workshop

Related slideshows

More Related Content

DockerCon SF 2019 - Observability Workshop