DockerCon SF 2019 - Observability Workshop
- 2. w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
- 3. A system is observable if the behavior of the
entire system can be determined by only looking
at its inputs and outputs.
Lesson: control theory is a well-documented
approach which people can learn from vs trying
to reinvent
What is Observability?
Kalman, 1961 paper
On the general theory of control systems
- 4. "Observability aims to provide highly granular insights
into the behavior of production systems along with
rich context, perfect for debugging and performance
analysis purposes.” – Cindy Sridharan @copyconstruct
What is the goal of observability?
- 5. How many of you are running
staging environments?
Why does my organization need
Observability?
- 6. Now, how many of you
actually trust your staging
environments?
Why does my organization need
Observability?
- 10. • Gain a basic understanding of Distributed Tracing and
“How it works”
• Implement Metrics and Tracing in a small microservice
app using FOSS tools
• Understand how metrics and distributed tracing can help
your organization manage complexity
• Understand the limitations of FOSS and the challenges
ahead
What are the goals of this workshop?
- 11. • Workshop (1-1.5 hours)
• How Does Distributed Tracing Work
• Challenges with FOSS monitoring
• Advanced Use Cases w/ Distributed Tracing
(Single Music)
• Q&A
Agenda
- 12. Lab 01 - Setting up Kubernetes in Docker Enterprise Edition Lab
Lab 02 - Setting up Gitlab and our Microservice Application
Repository Kubernetes Integration
Lab 03 - Deploying our Microservice Application and Adding
Observability
Lab 04 - Monitoring Application Metrics with Grafana / Prometheus
Lab 05 - Observing with Jaeger and Breaking Things
Lab 06 - Advanced Analytics and Use Cases with Automated
Distributed Tracing
Workshop Overview
- 13. w/ Jaeger and Prometheus
Observability Workshop
https://bit.ly/ot-ee-workshop
- 15. At runtime custom headers / metadata are injected into
each request which includes identifiers that enable trace
backends to correlate spans between requests
• X-B3-TraceId: 128 or 64 lower-hex encoded bits
• X-B3-SpanId: 64 lower-hex encoded bits
• X-B3-ParentSpanId: 64 lower-hex encoded bits
• X-B3-Sampled: Bool
• X-B3-Flags: “1” includes DEBUG
It’s literally just headers / meta data
- 16. HTTP Request Example
service-a requests:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b receives:
GET service-b:8080/api/groceries
X-B3-TraceId: af38bc9
X-B3-SpanId: b9ca
X-B3-ParentSpanId: nil
service-b requests:
GET service-c:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
service-c receives:
GET service-b:8080/api/products
X-B3-TraceId: af38bc9
X-B3-SpanId: a3bc
X-B3-ParentSpanId: b9ca
- 18. • Correlation is nearly impossible across
multiple vendors / solutions (Logging, Metrics,
Traces)
• Large scale applications require equally large
scale monitoring (cpu/mem, i/o, distributed
systems, clustered storage, sharded TSDB)
Challenges of FOSS monitoring
Is anything ever truly free?
- 19. • Distributed tracing exposes a lot of data which
goes unanalyzed by FOSS tools
• The same holds true for Metrics and Logging
• … and Alerting
Actually, can I just show you what is possible?
Current solutions only collect / display
There is no analysis of the data
- 21. • Operated by 3 engineers (1 FE/1 BE/1 SRE)
• Over 20k transaction / hour, 20+ integrations,
100k LOC, with less than 15% test coverage
• Launched in 2018 with 15 microservices on
Docker Swarm – has since expanded to over 28
microservices with zero additional engineering
personnel
- 25. What happens if we aggregate
timing, error rate, and # of reqs
What can analyzing
Distributed Traces
tell us?
- 31. • DBO (Hibernate Query) causing O(n log n)
rise in latency and processing time
• Application Dashboard indicated an issue with
overall latency increasing
• Fix deployed and improvement was observed
immediatly
Rise in Latency + Processing Time
- 34. • We implemented Redis for caching, and
processing time went down
• However, we didn’t account for token policies
changing and they suddenly began to expire
after 30 seconds
• Alerting around error rates for this endpoint
raised our awareness around this issue
Caching Solved one problem
… but caused another
- 38. Context is critical when doing
Contributing Factor Analysis
Metrics are not
standalone, they
have relationships
- 45. We utilize a mix of Instana, Logz.io
and Grafana to manage our systems
Custom Dashboards
deliver peace of mind
- 48. • Using FOSS monitoring is a great way to both
learn and demonstrate the value of
observability to your peers
• Understand the limitations of FOSS and be
prepared to invest in either 3rd party tooling our
managing your own monitoring infrastructure
Focus on what matters to your business
… at Single Music we focus on delivering music
- 49. Schedule a meeting with me!
Want to learn more?
Come visit our booth@
Instana Booth #S23
- 50. Rate & Share
Rate this session in the
DockerCon App
Follow me @notsureifkevin
and share #DockerCon