The document discusses the importance of observability in complex distributed systems. It makes three key points:
1. Performance impacts business outcomes, so it is important to monitor performance metrics like page load times and how improvements can increase conversions.
2. As systems become more distributed and complex, there is a need for "integrated visibility" to understand how workloads are being handled by resources across the system.
3. Instrumenting code with libraries allows for gathering more context about transactions and dependencies compared to using external agents, but both approaches can be used together to gain better insights.
2. 2
Performance Impacts the Business
1. Walmart found that for every 1
second improvement in page
load time, conversions
increased by 2%
2. Mobify found that each 100ms
improvement in their
homepage's load time resulted
in a 1.11% increase in
conversion
“SLOW is the new DOWN”
3. 3
Performance in Complex Architectures
● Systems have become inherently very complex
● There is a whitespace in the area of “Integrated Visibility”
Distributedness
4. Practitioner’s view of Observability If you miss the State changes, you will
not know which workload is being
serviced by which resource.
With Transience, with every spin up of
resources, entity changes with every
state change
Remember, Aggregation is the biggest
enemy that will “kill” variety, making the
information totally useless
Technology and frameworks, analytics to
cater to high cardinality, fast retrieval and
meaningful deciphering in near real time
is key
Complexity
Metadata Variance due to high cardinality
Distributedness & Transaction depth
5. Logs, Events, Metrics and Tracing
Digital Business
• Business Metrics View
– Checkout Abandonment
– Customer Churn
– Revenue per Location
Demand & Workload
• RED Metrics View
– Request throughput
– Errors
– Duration (Latency, Response
time)
Resources
• USE Metrics View
– Utilization
– Saturation
– Errors
Context
• Distributed Tracing
– Dependency on downstream
– Service Maps
– End-to-End Transaction (hotspots,
logic flaws)
Satura
tion
Latency
Errors
Traffic
Google’s Golden Signals
As applications become more
distributed, multiple dependencies,
and ephemeral
BUILD BETTER INSIGHTS INTO
YOUR SYSTEM
6. 6
Law of Requisite variety
“If a system is to be stable, the number of states of its control
mechanism must be greater than or equal to the number of states in
the system being controlled”
- W. Ross Ashby
What are the Varieties?
Version changes: deployed upgrades of service versions
Topological changes: new components that appear and disappear in the system
landscape and affect dependencies between existing running components.
Component property changes: changing labels and tags of components
7. Instrumenting with Agents vs. Instrumenting with Libraries
• Instrumenting with Agents
• Outside-in Approach
• External Agent logs with your
application to introspect your
code at run time
• Decides what calls to measure
and what metadata to extract
based on specifications from
external configs
• More complete but often loses
context
● Instrumenting with Libraries
● Inside-out Approach
● Developer includes a Trace library and configures
spans that allows code to participate in
distributed tracing
● When App runs, trace spans are generated
asynchronously and dispatched to a persistence
store preferably hooked to backend analytics
engine
● Highly context driven and hence breadcrumbs
code path
“These are not cannibalistic approaches, they can well play in
concert”
8. Inflection point - Observability-driven development ?
• Evolving Technology, Evolving user friendly analytics – The Observer technology should be
more competent than the Observed technology
• Dev and Ops? the Dev way - due to so many fundamental changes that Ops can’t keep
pace
• What is Staging like Prod? Does it exist?
• Developers needs to own the code, with the ability to deploy it and debug/test in Prod
• Good practice - Merge will happen only when proper Observability hooks are baked in the
code
• Never accept a PR until you learn the instrumentation
• Distributed tracing and building breadcrumbs fundamental for building reliable systems
• Observability Driven Design makes DevOps and SRE principles fuller
• Give Developers the privilege to “ You Build, You Run, You Monitor”
Editor's Notes
A strong performance on website has a direct impact on Conversion rates, customer experience is load of 2 sec or lesser
Pages that loaded in 2.4 seconds had a 1.9% conversion rate
At 3.3 seconds, conversion rate was 1.5%
At 4.2 seconds, conversion rate was less than 1%
At 5.7+ seconds, conversion rate was 0.6%
An application is made up of multiple services and service instances that are running on multiple machines. Requests often span multiple service instances.
A single workload is carried out by 1000 nodes
Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputsIts transforming a Blackbox to whitebox
Observability means different things to different people to developers, customers, analysts, bloggers
BETTER INSIGHT INTO THE APPLICATION
Logging – A immutable timestamp record of events to help understand what changed in the system/application behavior when things went wrong. For example, using Grafana Loki to log certain events.
Metrics – A value pertaining to your system/application at a point in time. For example, using Grafana to understand resource utilization, or app performance metrics like throughput and response-time. Number that represents the data measured over time
Tracing – A representation of a single user’s journey through an application transaction. For example, using Jaeger to understand the call flows between services or how much time it takes a user to finish a transaction.
Brendan Gregg - USE
A summary of USE is “For every resource, check utilization, saturation, and errors.” What do those things mean? Brendan defines the terminology:
Utilization: the average time the resource was busy servicing work
Saturation: the degree to which the resource has extra work which it can’t service, often queued (baklog)
Errors: the count of error events
This disambiguates utilization and saturation, making it clear utilization is “busy time %” and saturation is “backlog.”
RED
The RED method, on the other hand, is about the workload itself, and treats the service as a black box. It’s an externally-visible view of the behavior of the workload as serviced by the resources.
RATE
ERROR
DURATION
R = Request Throughput, in requests per second
E = Request Error Rate, as either a throughput metric or a fraction of overall throughput
D = Latency, Residence Time, or Response Time; all three are widely used
Observability is about understanding the internal state of the system by looking at external outputs
Moving from blackbox to whitebox monitoring
Tool or technology you use should be able to represent the variety of states of the observed