SlideShare a Scribd company logo
Debugging Microservices
Key challenges and techniques
• Staff engineer at Lohika
• More than 17 years in IT
• Primary focus on JVM based languages, Big data and Microservices
About me
• Debugging microservices key challenges
• Observability
– Logging
– Monitoring
– Tracing
• Debugging tools for Kubernetes
– Telepresence v1 and v2
Agenda
The challenge
Monolithic application
• Single process
• Holystic view
• Simple infrastructure
• Can be deployed/debugged
locally
Microservice application
• Multiple processes
• Fractional view
• Complex infrastructure
• Local deployment/debug
can be an issue
The challenge (most optimistic figures)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.370.9611&rep=rep1&type=pdf
The challenge
Monolithic application
• Single process
• Holystic view
• Simple infrastructure
• Can be deployed/debugged
locally
Microservice application
• Multiple processes
• Fractional view
• Complex infrastructure
• Local deployment/debug
can be an issue
Observability
Monitoring
• Provides high level view of the
system health and performance
(Grafana, Prometheus,
VictoriaMetrics)
Logging
• Keep record of input data,
processing and results in the
application (Elasticsearch, Fluent
bit, Kibana)
Tracing
• Insights about specific operation
(Open tracing, Jaeger)
Monitoring
• A way to get bird eye view on
the infrastructure and services
health
• A way to get information about
system and individual
components performance
• A way to be alerted on
SLA/SLO
• Infrastructure health and
resource utilization
• Application and individual
services health and resource
utilization
• Application and individual
services performance
• Application and individual
services errors
Why What
Monitoring- How?
• Define naming conventions for the metrics
• Structure dashboards
• Build dashboards to be used with predefined techniques e.g., Layer
peeling, Exemplars
• Dashboards for infrastructure and applications i.e., follow the methodology
(USE, RED)
• Dashboards for specific services e.g., Java and Spring
• Avoid having a lot of custom dashboards and too much data
• Avoid high data cardinality when using tags
• Avoid having false positive alerts
• Look for predefined dashboards e.g., Spring
USE and RED
• Utilization: the
proportion of the
resource that is used,
so 100% utilization
means no more work
can be accepted;
• Saturation: the degree
to which the resource
has extra work which it
can’t service, often
queued;
• Errors: the count of
error events;
RED
• Rate: the number of
requests our service
is serving;
• Error: the number of
failed requests;
• Duration: the amount
of time it takes to
process a request;
USE
Logging
• Monitoring and troubleshooting
application for engineers
• Helping operations
• Security, compliance
• A way to be alerted on
SLA/SLO
• Application events:
• Availability events
(startup/shutdown)
• Resources (connectivity issues)
• Threats
• Errors
• Processing events
• Highly depends on security/audit
and compliance requirements:
• Login/Logout
• Attempting accessing unauthorized
data
• User actions
Why What
Logging – How?
• Centralized logging
• Align on the log format and levels
• Use structured logs
• Ability to correlate request inter services
• Log messages same as code would be read by other engineers think of them
and help them
• Do not trust clocks
• Do not log sensitive information
Tracing with Jaeger
• A way to get details about
individual request/event
• A way to get insights into
performance
• A way to get cross service
dependencies
• Statistics on time spent
• Compare traces
• Share traces
• Timings and logs for:
• Database calls
• Calls to other services
• Messages queues
• Heavy processing
Why What
Tracing – How?
• Pick either open tracing or open telemetry
• Open telemetry is a merge of open tracing and open census
• Open telemetry is newer and provides metrics API as well
• Key concepts:
• Spans:
• Named, timed operation representing a piece of the workflow.
• Contains: operation name, start and finish timestamps, tags, logs and context
• May contain other spans
• Tracers
• The Tracer interface creates Spans and understands how to Inject (serialize)
and Extract (deserialize) their metadata across process boundaries
• A new trace is started whenever a new Span is created without references to a
parent Span.
Tracing – How?
• Add open tracing support to your application e.g., opentracing-spring-jaeger-cloud-
starter
• Add additional libraries e.g., gRPC(opentracing-grpc)
• In case application contains few languages align on span tags, names and implement
decorators
• Ensure trace id and span id are used as correlation id in logs
• If you have service mesh then interservice communication can be received for
free and integrated with Jaeger or you may look at tools like Kiali
• If your application uses Zipkin it still can be easily switched to Jaeger
Tracing – How?
• Install and configure Jaeger
• Client - libraries that implement open tracing API and send data further
• Agent – network daemon that listens to UDP and sends data to collector
• Collector – Stores data in the storage
• Storage – Storage with the span (Cassandra, Elasticsearch, Kafka)
• Query – provides API to read trace data from storage
• Ingester – reads data from Kafka and stores it in the storage
Tech talk microservices debugging
Tracing – How?
• Configure sampling
• Constant
• Probabilistic
• Rate limiting
• Remote
• Configure autoscaling for collectors
• Provide enough resources to the storage
Recap
• So:
• Complex infrastructure is monitored and there is visibility in
• Attempt to provide holistic view is provided by the Jaeger
• Centralized logging and open tracing allow to trace request through multiple processes
• How to troubleshoot then:
• Identify what version is deployed
• Punish people which use latest instead of specific deployment version
• Use metrics to check service and infrastructure health, resource consumption
• Find error(s) in the logs and by filtering by trace id find root operation
• Find corresponding operations in Jaeger and analyze the calls and compare with logs
• Build the hypothesis and test it or debug it
How can we debug services in Kubernetes?
• Port forward and remote debugging
• Tools like Telepresence and Squash
• Use cases:
• Issues reproduced only on the cluster
• Services accessible only on the cluster
• No ability to run service(s) locally
• Cloud native technologies
What telepresence tries to solve?
How does it solve it?
• Telepresence v1
• Provides ability to export env vars and swap deployment in the container with the
proxy
• Forwards ports that service exposes
• Routes all traffic through the proxy
• To achieve that:
• Run telepresence --swap-deployment {serviceName} --namespace
{namespaceName} --env-json ~/telepresence-legacy.json
• In other words:
• Service will run locally but would have access to all the resources in the cluster
and no debugging information will be passed via network and no time is spent
for container build/upload and deploy
Telepresence v1
• Telepresence v1 is cool and reliable tool which does not require
any cluster configuration
• Telepresence v1 is great but has a lot of limitations:
• Only one service at a time can be debugged
• Service is fully replaced and thus all traffic goes to your machine
• Thus, telepresence v2 was implemented
Telepresence v2
• Access all resources in cluster like your machine is deployed there
• telepresence connect
• Debug multiple services at a time
• Execute multiple intercept commands and point them to different local ports
• Intercept specific ports
• telepresence list
• kubectl get service example-service –output.yaml
• telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env
• Intercept specific requests
• telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env –preview-url=true
• Share dev environments
Telepresence v2
• Requires cluster level configuration
• Is not that stable as v1
Telepresence v2
Telepresence v2 cons
• Brew by default updates you to the latest which may require cluster
configuration
• It cannot intercept more than one port on the service
• It does not substitute the pod and thus if you consume messages
your breakpoint may not work
• It does not work with certain service meshes
So, what should I use?
• Use both 
• V1 suits for the cases when:
• there is more than one port to intercept
• you need to consume messages from the queues or Kafka
• It is ok to swap the deployment
• V2 suits for the cases:
• Connect to cluster resources without extra port forwards
• Intercept specific port
• Intercept specific requests
So, how would I do that?
• Install v2
• To install specific version (2.3.5), please use that command line:
• sudo curl -fL
https://app.getambassador.io/download/tel2/darwin/amd64/2.3.5/telepresence -
o /usr/local/bin/telepresence
• sudo chmod a+x /usr/local/bin/telepresence
• Install V1:
• brew install --cask macfuse
• brew install datawire/blackbird/telepresence-legacy
• ln -s /usr/local/Cellar/telepresence-legacy/0.109/bin/telepresence
/usr/local/bin/tel
Thank You!
Tracing – Trace timeline
Tracing – Dependency graph

More Related Content

Tech talk microservices debugging

  • 2. • Staff engineer at Lohika • More than 17 years in IT • Primary focus on JVM based languages, Big data and Microservices About me
  • 3. • Debugging microservices key challenges • Observability – Logging – Monitoring – Tracing • Debugging tools for Kubernetes – Telepresence v1 and v2 Agenda
  • 4. The challenge Monolithic application • Single process • Holystic view • Simple infrastructure • Can be deployed/debugged locally Microservice application • Multiple processes • Fractional view • Complex infrastructure • Local deployment/debug can be an issue
  • 5. The challenge (most optimistic figures) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.370.9611&rep=rep1&type=pdf
  • 6. The challenge Monolithic application • Single process • Holystic view • Simple infrastructure • Can be deployed/debugged locally Microservice application • Multiple processes • Fractional view • Complex infrastructure • Local deployment/debug can be an issue
  • 7. Observability Monitoring • Provides high level view of the system health and performance (Grafana, Prometheus, VictoriaMetrics) Logging • Keep record of input data, processing and results in the application (Elasticsearch, Fluent bit, Kibana) Tracing • Insights about specific operation (Open tracing, Jaeger)
  • 8. Monitoring • A way to get bird eye view on the infrastructure and services health • A way to get information about system and individual components performance • A way to be alerted on SLA/SLO • Infrastructure health and resource utilization • Application and individual services health and resource utilization • Application and individual services performance • Application and individual services errors Why What
  • 9. Monitoring- How? • Define naming conventions for the metrics • Structure dashboards • Build dashboards to be used with predefined techniques e.g., Layer peeling, Exemplars • Dashboards for infrastructure and applications i.e., follow the methodology (USE, RED) • Dashboards for specific services e.g., Java and Spring • Avoid having a lot of custom dashboards and too much data • Avoid high data cardinality when using tags • Avoid having false positive alerts • Look for predefined dashboards e.g., Spring
  • 10. USE and RED • Utilization: the proportion of the resource that is used, so 100% utilization means no more work can be accepted; • Saturation: the degree to which the resource has extra work which it can’t service, often queued; • Errors: the count of error events; RED • Rate: the number of requests our service is serving; • Error: the number of failed requests; • Duration: the amount of time it takes to process a request; USE
  • 11. Logging • Monitoring and troubleshooting application for engineers • Helping operations • Security, compliance • A way to be alerted on SLA/SLO • Application events: • Availability events (startup/shutdown) • Resources (connectivity issues) • Threats • Errors • Processing events • Highly depends on security/audit and compliance requirements: • Login/Logout • Attempting accessing unauthorized data • User actions Why What
  • 12. Logging – How? • Centralized logging • Align on the log format and levels • Use structured logs • Ability to correlate request inter services • Log messages same as code would be read by other engineers think of them and help them • Do not trust clocks • Do not log sensitive information
  • 13. Tracing with Jaeger • A way to get details about individual request/event • A way to get insights into performance • A way to get cross service dependencies • Statistics on time spent • Compare traces • Share traces • Timings and logs for: • Database calls • Calls to other services • Messages queues • Heavy processing Why What
  • 14. Tracing – How? • Pick either open tracing or open telemetry • Open telemetry is a merge of open tracing and open census • Open telemetry is newer and provides metrics API as well • Key concepts: • Spans: • Named, timed operation representing a piece of the workflow. • Contains: operation name, start and finish timestamps, tags, logs and context • May contain other spans • Tracers • The Tracer interface creates Spans and understands how to Inject (serialize) and Extract (deserialize) their metadata across process boundaries • A new trace is started whenever a new Span is created without references to a parent Span.
  • 15. Tracing – How? • Add open tracing support to your application e.g., opentracing-spring-jaeger-cloud- starter • Add additional libraries e.g., gRPC(opentracing-grpc) • In case application contains few languages align on span tags, names and implement decorators • Ensure trace id and span id are used as correlation id in logs • If you have service mesh then interservice communication can be received for free and integrated with Jaeger or you may look at tools like Kiali • If your application uses Zipkin it still can be easily switched to Jaeger
  • 16. Tracing – How? • Install and configure Jaeger • Client - libraries that implement open tracing API and send data further • Agent – network daemon that listens to UDP and sends data to collector • Collector – Stores data in the storage • Storage – Storage with the span (Cassandra, Elasticsearch, Kafka) • Query – provides API to read trace data from storage • Ingester – reads data from Kafka and stores it in the storage
  • 18. Tracing – How? • Configure sampling • Constant • Probabilistic • Rate limiting • Remote • Configure autoscaling for collectors • Provide enough resources to the storage
  • 19. Recap • So: • Complex infrastructure is monitored and there is visibility in • Attempt to provide holistic view is provided by the Jaeger • Centralized logging and open tracing allow to trace request through multiple processes • How to troubleshoot then: • Identify what version is deployed • Punish people which use latest instead of specific deployment version • Use metrics to check service and infrastructure health, resource consumption • Find error(s) in the logs and by filtering by trace id find root operation • Find corresponding operations in Jaeger and analyze the calls and compare with logs • Build the hypothesis and test it or debug it
  • 20. How can we debug services in Kubernetes? • Port forward and remote debugging • Tools like Telepresence and Squash • Use cases: • Issues reproduced only on the cluster • Services accessible only on the cluster • No ability to run service(s) locally • Cloud native technologies
  • 22. How does it solve it? • Telepresence v1 • Provides ability to export env vars and swap deployment in the container with the proxy • Forwards ports that service exposes • Routes all traffic through the proxy • To achieve that: • Run telepresence --swap-deployment {serviceName} --namespace {namespaceName} --env-json ~/telepresence-legacy.json • In other words: • Service will run locally but would have access to all the resources in the cluster and no debugging information will be passed via network and no time is spent for container build/upload and deploy
  • 23. Telepresence v1 • Telepresence v1 is cool and reliable tool which does not require any cluster configuration • Telepresence v1 is great but has a lot of limitations: • Only one service at a time can be debugged • Service is fully replaced and thus all traffic goes to your machine • Thus, telepresence v2 was implemented
  • 24. Telepresence v2 • Access all resources in cluster like your machine is deployed there • telepresence connect • Debug multiple services at a time • Execute multiple intercept commands and point them to different local ports • Intercept specific ports • telepresence list • kubectl get service example-service –output.yaml • telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env • Intercept specific requests • telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env –preview-url=true • Share dev environments
  • 25. Telepresence v2 • Requires cluster level configuration • Is not that stable as v1
  • 27. Telepresence v2 cons • Brew by default updates you to the latest which may require cluster configuration • It cannot intercept more than one port on the service • It does not substitute the pod and thus if you consume messages your breakpoint may not work • It does not work with certain service meshes
  • 28. So, what should I use? • Use both  • V1 suits for the cases when: • there is more than one port to intercept • you need to consume messages from the queues or Kafka • It is ok to swap the deployment • V2 suits for the cases: • Connect to cluster resources without extra port forwards • Intercept specific port • Intercept specific requests
  • 29. So, how would I do that? • Install v2 • To install specific version (2.3.5), please use that command line: • sudo curl -fL https://app.getambassador.io/download/tel2/darwin/amd64/2.3.5/telepresence - o /usr/local/bin/telepresence • sudo chmod a+x /usr/local/bin/telepresence • Install V1: • brew install --cask macfuse • brew install datawire/blackbird/telepresence-legacy • ln -s /usr/local/Cellar/telepresence-legacy/0.109/bin/telepresence /usr/local/bin/tel
  • 31. Tracing – Trace timeline