The Bumpy Road to Actionable SLOs

Conﬁdential and Proprietary Information for Instana, Inc.
The Bumpy
Journey to
Actionable SLOs
Curtis Hrischuk, Ph.D.
Yury Oleynik, Ph.D.
Technical Product Managers

“In theory there is no
difference between
theory and practice.
In practice, there is.”
- Yogi Berra
Considering the SLO journey of 20
companies here

Agenda:
● Promise of SLO Methodology
● The Practical Journey to SLO use
● Lessons Learned Along the Way
● Polling question: Where are you ..

Promises of SLO Methodology
● Meet customer SLAs for higher satisfaction
● Company wide, consistent reliability reporting
● Decision making tool
● Investment becomes data-driven

v
1: Getting Excited About SLOs
Anti-pattern:
● Thinking it is only a technical problem
● Assuming it is a straight line to the end goal
● Greatly underestimating the eﬀort and time
Decrease toil
of on-call
colleagues
Address
reliability in a
more systematic
way
Focus on end-user/
business impact
and not just alerts
Reliability always
comes last (vs
new features)
Show that we
are doing good
in terms of
reliability

v
2: Let’s Hack Some SLOs Quickly!
Lets define SLO for
each user-facing
service
Looks like we are
missing some data
to measure SLIs…
Anti-pattern:
● Not involving Product Owner / Product Manager
● Signiﬁcant time invested into getting additional measurements
● A straight path to SLO without teeth...
Where is this
going … what is
the outcome
This is more
complex
than I thought...
But when user facing
SLOs are broken,
how do we know
why? Let’s add some
SLOs for backend
services!
Lets add more
instrumentation, so
that we have more
data
Let’s start with what we
have

Yeah, right… I am not
committing on anything
here
3: Lets Pitch SLOs To PM
We should stop all
feature development if
the SLO for GET
/cat-frontend is
violated!
What is this SLO?
Which service? Why is this
relevant?
Anti-pattern:
● Technical focus of the SLO methodology is difficult for the business side
● There are two different languages here - no alignment in terms of value
● SRE’s choices are: use up political capital, find an executive sponsor, or win the PM’s interest
● Many SLO initiatives die here or the SLI becomes yet another metric to alert on
No. No. No. No. How
is it going to help us
make our business
more successful?
Fine, we will just
use it in the
engineering
department then!
BTW, since you are here, I
would like a new business
KPI on a dashboard?

How To interest PO/PM/BS Stakeholders?
● Being able to set more correct expectations on features taking into account
reliability work which needs to happen
● Easier to discuss, compromise, and prioritise features against technical debt
● Show that SLIs are also related to important business KPIs
○ Which countries are burning error budget?
● Combine Implementing SLIs with other relevant business KPIs ;)
● SLIs and SLO fulfillment can inform business decisions
○ e.g. whether to release a marketing campaign now or wait for fixes to increase the site
reliability

How To Get Stakeholder Interest?
Pattern:
● Show that SLIs are related to important business KPIs
● Combine Implementing SLIs with other relevant business KPIs ;)
● Show that SLO fulfillment can inform business decisions
● Possibility of discussion around compromise, trade-off, and priorities is evident
Dang. Those
things are eating
into our
conversion rate.
Do you know which
countries are
burning error
budget?
Are you sure we
should do the big
marketing
campaign now or
after the fixes for
reliability are in?

4: Let’s Restart From the Beginning
Can you describe the most
critical user-journeys for
you?
but you have to promise me
to also add this business
KPIs
Pattern:
● Recover by reusing already modeled user-journeys, e.g. for testing purposes or synthetics
● SREs need to listen carefully and patiently because this will take time
● Discussion will evolve so budget for the time and be ﬂexible
Maybe this could be
helpful .. I still need my
KPIs. This could
increase our
conversion rate $$$$
We need a common
abstraction level to talk
about. They always talk
about “user-journey” so ...
ok, so SLIs, are like
business KPIs for
reliability
… What if user journey are
unavailable? …aren’t
performant?

4: How Some Did it Right
1. Review customer facing user journeys together with the Product
Manager
2. Rank them by business criticality
3. Correlate SLIs with existing business KPIs (e.g. conversion rates for
purchase) to understand the signal quality
4. Correlate past SLI violations with an increase in support or incident
tickets to validate the signal
5. Address top 3 user journeys

v
Prometheus isn’t enough.
Mr. Technical Leader let’s ﬁgure out
how to map and measure the
user-journeys, KPIs, and SLOs using
our existing observability model /
tools.
We’ll need APM, synthetics, logs, ...
This is a lot of work
…we’re going to have
blind spots and have
missing data
Isn't there a tool that
does this already
5. Map User- Journeys to an Observability
Model

v
Prometheus isn’t enough.
We’ll need APM, synthetics, logs, ...
Isn't there a tool that
does this already
5. Map User- Journeys to an Observability
Model
Anti-Pattern:
● Model getting too complex with too much eﬀort to implement and validate
● Data is getting too granular
● Not leveraging Synthetics or EUM / Mobile because a fear of RCAing noisy violation

6. Implement SLIs & KPIs, Instrument,
Validate, Create Dashboards
Our current availability is 82%.
Only 75% of requests are
meeting the latency SLO.
I need to explain this
carefully or it will be a
political disaster.
I’ll get ahead of this by
building a plan with
dedicated resources
to meet our SLOs
Do you know if they have
been higher or lower in the
past year?
This is a lot of work
How to troubleshoot this
… user journey, service
dependencies,call graphs,
infrastructure?

Step 6 Activities
● Gathering necessary data adding instrumentation
● Implementing SLIs and business KPIs requested by PM
● Validate SLIs correlate with business KPIs and/or major
incidents / problem tickets
● SRE is building dashboards as well as validating the results
● Implementing/extending business dashboards with KPIs
● You want to make sure you can root-cause service level
violations (downstream and infra)!!!
● First insights into actual service level numbers are available

7: Deﬁne SLOs, Share First Results
If I ask for 110% success
rate then maybe we’ll get
to 99.9%.
What do you mean we don’t
want 100% success rate?
I hope this SLO isn’t too low
or too high but just right
Anti-Pattern:
● SLIs need to be redefined since they do not capture the expected user impact
● Product manager wants to be aggressive in SLO values to make sure the KPIs are met
● Historical data doesn’t exist to provide context and avoid unreasonable SLOs
● Premature org-wide promotion of SLOs before value is proven or context is available
● Uncertain how to explain the first results to business stakeholders without causing panic

7: Deﬁne SLOs, Share First Results
Historically availability was
82% and latency was 75%
successful. So let’s start
there.
Pattern:
● Use past data to derive thresholds for SLOs
● Engage PMs with the historical data to create sustainable SLOs
● Discuss if activities are needed to increase the reliability
● Then this is the time to go forward and demonstrate the first success to build support for the project
We’re leaving money $$$$
on the table so lets do
some reliability work ...

8. Deﬁne Error Budget Policy
Disclaimer: Only a few
organizations got here so far

8. Deﬁne Error Budget Policy
Pattern:
● The product team commits to defending the SLOs
● Business stakeholders, product team, and SREs agree to a formal action policy
● Need regular check-ins on SLO fulfillment inside the product team and to biz stakeholders
Let’s add some teeth to turn
this into a decision making
tool … We’ll need buy in
from ...

9. Set Up and Tune Alerting
You need to be alerted
when the error budget
burn rate increases
greatly
I’m glad I got that
Ph.D. for “Error
Budget Burn Rate
Dynamic Adjustment
Engineering”
But won’t that be
noisy? Will it give
us enough
mitigation time?
Pattern:
● Teams will be asked to defend SLOs so they need Alerting
● Start with simple threshold alerting on SLI before it reaches SLO
● In order to be able to defend SLO you need to alert on trends (e.g., error budget burn rate changes)
● Use a tool that can help you adjust the error burn rate alert so you don’t need a Ph.D.

10: Congratulations, You Have Made It!
Now we can iterate,
improve, and scale.

Expectations and Lessons Learned

Expectations and Lessons Learned
Aside from the lessons learned already mentioned ...
● It will take several weeks for a single team to get started
● It takes 2-3 quarters for a larger organization
● Plan to spend 2-4 weeks for education of the first team
● If there is no action upon SLO violation, the whole thing is wasted
● You need buy in / support from the executive / VP level
● You need enough historical data to derive, setup, and validate SLIs
and SLOs

What You Need To Make Your SLO Journey Smooth
Pattern:
● Low cost discovery, setup, and care of user-journeys
● Easy dashboard creation and maintenance
○ Show services that are the biggest EB offenders
○ Show impacted user-journeys and by how much
● Alert on error budget burn rate or an endangered SLO
● Troubleshooting that drills down from user journeys
I’m glad we have
tools and processes
that support the SLI /
SLO methodology as
a first class citizen

The Bumpy Road to Actionable SLOs

The Bumpy Road to Actionable SLOs

More Related Content

The Bumpy Road to Actionable SLOs