SlideShare a Scribd company logo
Observability
Shivagami Gugan
Technology Transformation Leader
SRE * DevOps * Practitioner
2
Performance Impacts the Business
1. Walmart found that for every 1
second improvement in page
load time, conversions
increased by 2%
2. Mobify found that each 100ms
improvement in their
homepage's load time resulted
in a 1.11% increase in
conversion
“SLOW is the new DOWN”
3
Performance in Complex Architectures
● Systems have become inherently very complex
● There is a whitespace in the area of “Integrated Visibility”
Distributedness
Practitioner’s view of Observability If you miss the State changes, you will
not know which workload is being
serviced by which resource.
With Transience, with every spin up of
resources, entity changes with every
state change
Remember, Aggregation is the biggest
enemy that will “kill” variety, making the
information totally useless
Technology and frameworks, analytics to
cater to high cardinality, fast retrieval and
meaningful deciphering in near real time
is key
Complexity
Metadata Variance due to high cardinality
Distributedness & Transaction depth
Logs, Events, Metrics and Tracing
Digital Business
• Business Metrics View
– Checkout Abandonment
– Customer Churn
– Revenue per Location
Demand & Workload
• RED Metrics View
– Request throughput
– Errors
– Duration (Latency, Response
time)
Resources
• USE Metrics View
– Utilization
– Saturation
– Errors
Context
• Distributed Tracing
– Dependency on downstream
– Service Maps
– End-to-End Transaction (hotspots,
logic flaws)
Satura
tion
Latency
Errors
Traffic
Google’s Golden Signals
As applications become more
distributed, multiple dependencies,
and ephemeral
BUILD BETTER INSIGHTS INTO
YOUR SYSTEM
6
Law of Requisite variety
“If a system is to be stable, the number of states of its control
mechanism must be greater than or equal to the number of states in
the system being controlled”
- W. Ross Ashby
What are the Varieties?
Version changes: deployed upgrades of service versions
Topological changes: new components that appear and disappear in the system
landscape and affect dependencies between existing running components.
Component property changes: changing labels and tags of components
Instrumenting with Agents vs. Instrumenting with Libraries
• Instrumenting with Agents
• Outside-in Approach
• External Agent logs with your
application to introspect your
code at run time
• Decides what calls to measure
and what metadata to extract
based on specifications from
external configs
• More complete but often loses
context
● Instrumenting with Libraries
● Inside-out Approach
● Developer includes a Trace library and configures
spans that allows code to participate in
distributed tracing
● When App runs, trace spans are generated
asynchronously and dispatched to a persistence
store preferably hooked to backend analytics
engine
● Highly context driven and hence breadcrumbs
code path
“These are not cannibalistic approaches, they can well play in
concert”
Inflection point - Observability-driven development ?
• Evolving Technology, Evolving user friendly analytics – The Observer technology should be
more competent than the Observed technology
• Dev and Ops? the Dev way - due to so many fundamental changes that Ops can’t keep
pace
• What is Staging like Prod? Does it exist?
• Developers needs to own the code, with the ability to deploy it and debug/test in Prod
• Good practice - Merge will happen only when proper Observability hooks are baked in the
code
• Never accept a PR until you learn the instrumentation
• Distributed tracing and building breadcrumbs fundamental for building reliable systems
• Observability Driven Design makes DevOps and SRE principles fuller
• Give Developers the privilege to “ You Build, You Run, You Monitor”

More Related Content

Observability Shivagami Gugan

  • 1. Observability Shivagami Gugan Technology Transformation Leader SRE * DevOps * Practitioner
  • 2. 2 Performance Impacts the Business 1. Walmart found that for every 1 second improvement in page load time, conversions increased by 2% 2. Mobify found that each 100ms improvement in their homepage's load time resulted in a 1.11% increase in conversion “SLOW is the new DOWN”
  • 3. 3 Performance in Complex Architectures ● Systems have become inherently very complex ● There is a whitespace in the area of “Integrated Visibility” Distributedness
  • 4. Practitioner’s view of Observability If you miss the State changes, you will not know which workload is being serviced by which resource. With Transience, with every spin up of resources, entity changes with every state change Remember, Aggregation is the biggest enemy that will “kill” variety, making the information totally useless Technology and frameworks, analytics to cater to high cardinality, fast retrieval and meaningful deciphering in near real time is key Complexity Metadata Variance due to high cardinality Distributedness & Transaction depth
  • 5. Logs, Events, Metrics and Tracing Digital Business • Business Metrics View – Checkout Abandonment – Customer Churn – Revenue per Location Demand & Workload • RED Metrics View – Request throughput – Errors – Duration (Latency, Response time) Resources • USE Metrics View – Utilization – Saturation – Errors Context • Distributed Tracing – Dependency on downstream – Service Maps – End-to-End Transaction (hotspots, logic flaws) Satura tion Latency Errors Traffic Google’s Golden Signals As applications become more distributed, multiple dependencies, and ephemeral BUILD BETTER INSIGHTS INTO YOUR SYSTEM
  • 6. 6 Law of Requisite variety “If a system is to be stable, the number of states of its control mechanism must be greater than or equal to the number of states in the system being controlled” - W. Ross Ashby What are the Varieties? Version changes: deployed upgrades of service versions Topological changes: new components that appear and disappear in the system landscape and affect dependencies between existing running components. Component property changes: changing labels and tags of components
  • 7. Instrumenting with Agents vs. Instrumenting with Libraries • Instrumenting with Agents • Outside-in Approach • External Agent logs with your application to introspect your code at run time • Decides what calls to measure and what metadata to extract based on specifications from external configs • More complete but often loses context ● Instrumenting with Libraries ● Inside-out Approach ● Developer includes a Trace library and configures spans that allows code to participate in distributed tracing ● When App runs, trace spans are generated asynchronously and dispatched to a persistence store preferably hooked to backend analytics engine ● Highly context driven and hence breadcrumbs code path “These are not cannibalistic approaches, they can well play in concert”
  • 8. Inflection point - Observability-driven development ? • Evolving Technology, Evolving user friendly analytics – The Observer technology should be more competent than the Observed technology • Dev and Ops? the Dev way - due to so many fundamental changes that Ops can’t keep pace • What is Staging like Prod? Does it exist? • Developers needs to own the code, with the ability to deploy it and debug/test in Prod • Good practice - Merge will happen only when proper Observability hooks are baked in the code • Never accept a PR until you learn the instrumentation • Distributed tracing and building breadcrumbs fundamental for building reliable systems • Observability Driven Design makes DevOps and SRE principles fuller • Give Developers the privilege to “ You Build, You Run, You Monitor”

Editor's Notes

  1. A strong performance on website has a direct impact on Conversion rates, customer experience is load of 2 sec or lesser Pages that loaded in 2.4 seconds had a 1.9% conversion rate At 3.3 seconds, conversion rate was 1.5% At 4.2 seconds, conversion rate was less than 1% At 5.7+ seconds, conversion rate was 0.6%
  2. An application is made up of multiple services and service instances that are running on multiple machines. Requests often span multiple service instances. A single workload is carried out by 1000 nodes
  3. Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs Its transforming a Blackbox to whitebox Observability means different things to different people to developers, customers, analysts, bloggers BETTER INSIGHT INTO THE APPLICATION Logging – A immutable timestamp record of events to help understand what changed in the system/application behavior when things went wrong. For example, using Grafana Loki to log certain events. Metrics – A value pertaining to your system/application at a point in time. For example, using Grafana to understand resource utilization, or app performance metrics like throughput and response-time. Number that represents the data measured over time Tracing – A representation of a single user’s journey through an application transaction. For example, using Jaeger to understand the call flows between services or how much time it takes a user to finish a transaction. Brendan Gregg - USE A summary of USE is “For every resource, check utilization, saturation, and errors.” What do those things mean? Brendan defines the terminology: Utilization: the average time the resource was busy servicing work Saturation: the degree to which the resource has extra work which it can’t service, often queued (baklog) Errors: the count of error events This disambiguates utilization and saturation, making it clear utilization is “busy time %” and saturation is “backlog.” RED The RED method, on the other hand, is about the workload itself, and treats the service as a black box. It’s an externally-visible view of the behavior of the workload as serviced by the resources. RATE ERROR DURATION R = Request Throughput, in requests per second E = Request Error Rate, as either a throughput metric or a fraction of overall throughput D = Latency, Residence Time, or Response Time; all three are widely used
  4. Observability is about understanding the internal state of the system by looking at external outputs Moving from blackbox to whitebox monitoring Tool or technology you use should be able to represent the variety of states of the observed