SlideShare a Scribd company logo
KEVIN CRAWLEY
Instana + Single Music
Observability Workshop
w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
A system is observable if the behavior of the
entire system can be determined by only looking
at its inputs and outputs.
Lesson: control theory is a well-documented
approach which people can learn from vs trying
to reinvent
What is Observability?
Kalman, 1961 paper
On the general theory of control systems
"Observability aims to provide highly granular insights
into the behavior of production systems along with
rich context, perfect for debugging and performance
analysis purposes.” – Cindy Sridharan @copyconstruct
What is the goal of observability?
How many of you are running
staging environments?
Why does my organization need
Observability?
Now, how many of you
actually trust your staging
environments?
Why does my organization need
Observability?
This is your staging environment
… and this is your prod environment
DockerCon SF 2019 - Observability Workshop
• Gain a basic understanding of Distributed Tracing and
“How it works”
• Implement Metrics and Tracing in a small microservice
app using FOSS tools
• Understand how metrics and distributed tracing can help
your organization manage complexity
• Understand the limitations of FOSS and the challenges
ahead
What are the goals of this workshop?
• Workshop (1-1.5 hours)
• How Does Distributed Tracing Work
• Challenges with FOSS monitoring
• Advanced Use Cases w/ Distributed Tracing
(Single Music)
• Q&A
Agenda
Lab 01 - Setting up Kubernetes in Docker Enterprise Edition Lab
Lab 02 - Setting up Gitlab and our Microservice Application
Repository Kubernetes Integration
Lab 03 - Deploying our Microservice Application and Adding
Observability
Lab 04 - Monitoring Application Metrics with Grafana / Prometheus
Lab 05 - Observing with Jaeger and Breaking Things
Lab 06 - Advanced Analytics and Use Cases with Automated
Distributed Tracing
Workshop Overview
w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
How does distributed tracing work?
At runtime custom headers / metadata are injected into
each request which includes identifiers that enable trace
backends to correlate spans between requests
• X-B3-TraceId: 128 or 64 lower-hex encoded bits
• X-B3-SpanId: 64 lower-hex encoded bits
• X-B3-ParentSpanId: 64 lower-hex encoded bits
• X-B3-Sampled: Bool
• X-B3-Flags: “1” includes DEBUG
It’s literally just headers / meta data
HTTP Request Example
service-a requests:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b receives:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b requests:
GET service-c:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
service-c receives:
GET service-b:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
Gannt Chart
GET /api/groceries 800ms
GET /api/groceries 550ms
GET /api/products 400ms
• Correlation is nearly impossible across
multiple vendors / solutions (Logging, Metrics,
Traces)
• Large scale applications require equally large
scale monitoring (cpu/mem, i/o, distributed
systems, clustered storage, sharded TSDB)
Challenges of FOSS monitoring
Is anything ever truly free?
• Distributed tracing exposes a lot of data which
goes unanalyzed by FOSS tools
• The same holds true for Metrics and Logging
• … and Alerting
Actually, can I just show you what is possible?
Current solutions only collect / display
There is no analysis of the data
How Distributed Tracing and
Log Insights empowers Single
Music
Advanced Use Cases
• Operated by 3 engineers (1 FE/1 BE/1 SRE)
• Over 20k transaction / hour, 20+ integrations,
100k LOC, with less than 15% test coverage
• Launched in 2018 with 15 microservices on
Docker Swarm – has since expanded to over 28
microservices with zero additional engineering
personnel
Visualizing Large and Complex
Visualizing Large
and Complex
Environments
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
What happens if we aggregate
timing, error rate, and # of reqs
What can analyzing
Distributed Traces
tell us?
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Database Optimizations, Caching,
and Concurrency
What problems have
Distributed Tracing
helped solve?
Slow Death
of a Service
• DBO (Hibernate Query) causing O(n log n)
rise in latency and processing time
• Application Dashboard indicated an issue with
overall latency increasing
• Fix deployed and improvement was observed
immediatly
Rise in Latency + Processing Time
Issue
Resolved
DockerCon SF 2019 - Observability Workshop
• We implemented Redis for caching, and
processing time went down
• However, we didn’t account for token policies
changing and they suddenly began to expire
after 30 seconds
• Alerting around error rates for this endpoint
raised our awareness around this issue
Caching Solved one problem
… but caused another
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Context is critical when doing
Contributing Factor Analysis
Metrics are not
standalone, they
have relationships
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Logs can benefit from analytics too!
Let’s not forget about
Logging
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
We utilize a mix of Instana, Logz.io
and Grafana to manage our systems
Custom Dashboards
deliver peace of mind
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
• Using FOSS monitoring is a great way to both
learn and demonstrate the value of
observability to your peers
• Understand the limitations of FOSS and be
prepared to invest in either 3rd party tooling our
managing your own monitoring infrastructure
Focus on what matters to your business
… at Single Music we focus on delivering music
Schedule a meeting with me!
Want to learn more?
Come visit our booth@
Instana Booth #S23
Rate & Share
Rate this session in the
DockerCon App
Follow me @notsureifkevin
and share #DockerCon

More Related Content

DockerCon SF 2019 - Observability Workshop

  • 1. KEVIN CRAWLEY Instana + Single Music Observability Workshop
  • 2. w/ Jaeger and Prometheus Observability Workshop https://bit.ly/ot-ee-workshop
  • 3. A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs. Lesson: control theory is a well-documented approach which people can learn from vs trying to reinvent What is Observability? Kalman, 1961 paper On the general theory of control systems
  • 4. "Observability aims to provide highly granular insights into the behavior of production systems along with rich context, perfect for debugging and performance analysis purposes.” – Cindy Sridharan @copyconstruct What is the goal of observability?
  • 5. How many of you are running staging environments? Why does my organization need Observability?
  • 6. Now, how many of you actually trust your staging environments? Why does my organization need Observability?
  • 7. This is your staging environment
  • 8. … and this is your prod environment
  • 10. • Gain a basic understanding of Distributed Tracing and “How it works” • Implement Metrics and Tracing in a small microservice app using FOSS tools • Understand how metrics and distributed tracing can help your organization manage complexity • Understand the limitations of FOSS and the challenges ahead What are the goals of this workshop?
  • 11. • Workshop (1-1.5 hours) • How Does Distributed Tracing Work • Challenges with FOSS monitoring • Advanced Use Cases w/ Distributed Tracing (Single Music) • Q&A Agenda
  • 12. Lab 01 - Setting up Kubernetes in Docker Enterprise Edition Lab Lab 02 - Setting up Gitlab and our Microservice Application Repository Kubernetes Integration Lab 03 - Deploying our Microservice Application and Adding Observability Lab 04 - Monitoring Application Metrics with Grafana / Prometheus Lab 05 - Observing with Jaeger and Breaking Things Lab 06 - Advanced Analytics and Use Cases with Automated Distributed Tracing Workshop Overview
  • 13. w/ Jaeger and Prometheus Observability Workshop https://bit.ly/ot-ee-workshop
  • 14. How does distributed tracing work?
  • 15. At runtime custom headers / metadata are injected into each request which includes identifiers that enable trace backends to correlate spans between requests • X-B3-TraceId: 128 or 64 lower-hex encoded bits • X-B3-SpanId: 64 lower-hex encoded bits • X-B3-ParentSpanId: 64 lower-hex encoded bits • X-B3-Sampled: Bool • X-B3-Flags: “1” includes DEBUG It’s literally just headers / meta data
  • 16. HTTP Request Example service-a requests: GET service-b:8080/api/groceries X-B3-TraceId: af38bc9 X-B3-SpanId: b9ca X-B3-ParentSpanId: nil service-b receives: GET service-b:8080/api/groceries X-B3-TraceId: af38bc9 X-B3-SpanId: b9ca X-B3-ParentSpanId: nil service-b requests: GET service-c:8080/api/products X-B3-TraceId: af38bc9 X-B3-SpanId: a3bc X-B3-ParentSpanId: b9ca service-c receives: GET service-b:8080/api/products X-B3-TraceId: af38bc9 X-B3-SpanId: a3bc X-B3-ParentSpanId: b9ca
  • 17. Gannt Chart GET /api/groceries 800ms GET /api/groceries 550ms GET /api/products 400ms
  • 18. • Correlation is nearly impossible across multiple vendors / solutions (Logging, Metrics, Traces) • Large scale applications require equally large scale monitoring (cpu/mem, i/o, distributed systems, clustered storage, sharded TSDB) Challenges of FOSS monitoring Is anything ever truly free?
  • 19. • Distributed tracing exposes a lot of data which goes unanalyzed by FOSS tools • The same holds true for Metrics and Logging • … and Alerting Actually, can I just show you what is possible? Current solutions only collect / display There is no analysis of the data
  • 20. How Distributed Tracing and Log Insights empowers Single Music Advanced Use Cases
  • 21. • Operated by 3 engineers (1 FE/1 BE/1 SRE) • Over 20k transaction / hour, 20+ integrations, 100k LOC, with less than 15% test coverage • Launched in 2018 with 15 microservices on Docker Swarm – has since expanded to over 28 microservices with zero additional engineering personnel
  • 22. Visualizing Large and Complex Visualizing Large and Complex Environments
  • 25. What happens if we aggregate timing, error rate, and # of reqs What can analyzing Distributed Traces tell us?
  • 29. Database Optimizations, Caching, and Concurrency What problems have Distributed Tracing helped solve?
  • 30. Slow Death of a Service
  • 31. • DBO (Hibernate Query) causing O(n log n) rise in latency and processing time • Application Dashboard indicated an issue with overall latency increasing • Fix deployed and improvement was observed immediatly Rise in Latency + Processing Time
  • 34. • We implemented Redis for caching, and processing time went down • However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds • Alerting around error rates for this endpoint raised our awareness around this issue Caching Solved one problem … but caused another
  • 38. Context is critical when doing Contributing Factor Analysis Metrics are not standalone, they have relationships
  • 41. Logs can benefit from analytics too! Let’s not forget about Logging
  • 45. We utilize a mix of Instana, Logz.io and Grafana to manage our systems Custom Dashboards deliver peace of mind
  • 48. • Using FOSS monitoring is a great way to both learn and demonstrate the value of observability to your peers • Understand the limitations of FOSS and be prepared to invest in either 3rd party tooling our managing your own monitoring infrastructure Focus on what matters to your business … at Single Music we focus on delivering music
  • 49. Schedule a meeting with me! Want to learn more? Come visit our booth@ Instana Booth #S23
  • 50. Rate & Share Rate this session in the DockerCon App Follow me @notsureifkevin and share #DockerCon