SlideShare a Scribd company logo
Confidential and Proprietary Information for Instana, Inc.
The Bumpy
Journey to
Actionable SLOs
Curtis Hrischuk, Ph.D.
Yury Oleynik, Ph.D.
Technical Product Managers
Confidential and Proprietary Information for Instana, Inc.
“In theory there is no
difference between
theory and practice.
In practice, there is.”
- Yogi Berra
Considering the SLO journey of 20
companies here
Confidential and Proprietary Information for Instana, Inc.
Agenda:
● Promise of SLO Methodology
● The Practical Journey to SLO use
● Lessons Learned Along the Way
● Polling question: Where are you ..
Promises of SLO Methodology
● Meet customer SLAs for higher satisfaction
● Company wide, consistent reliability reporting
● Decision making tool
● Investment becomes data-driven
Confidential and Proprietary Information for Instana, Inc.
v
1: Getting Excited About SLOs
Anti-pattern:
● Thinking it is only a technical problem
● Assuming it is a straight line to the end goal
● Greatly underestimating the effort and time
Decrease toil
of on-call
colleagues
Address
reliability in a
more systematic
way
Focus on end-user/
business impact
and not just alerts
Reliability always
comes last (vs
new features)
Show that we
are doing good
in terms of
reliability
Confidential and Proprietary Information for Instana, Inc.
v
2: Let’s Hack Some SLOs Quickly!
Lets define SLO for
each user-facing
service
Looks like we are
missing some data
to measure SLIs…
Anti-pattern:
● Not involving Product Owner / Product Manager
● Significant time invested into getting additional measurements
● A straight path to SLO without teeth...
Where is this
going … what is
the outcome
This is more
complex
than I thought...
But when user facing
SLOs are broken,
how do we know
why? Let’s add some
SLOs for backend
services!
Lets add more
instrumentation, so
that we have more
data
Let’s start with what we
have
Yeah, right… I am not
committing on anything
here
3: Lets Pitch SLOs To PM
We should stop all
feature development if
the SLO for GET
/cat-frontend is
violated!
What is this SLO?
Which service? Why is this
relevant?
Anti-pattern:
● Technical focus of the SLO methodology is difficult for the business side
● There are two different languages here - no alignment in terms of value
● SRE’s choices are: use up political capital, find an executive sponsor, or win the PM’s interest
● Many SLO initiatives die here or the SLI becomes yet another metric to alert on
No. No. No. No. How
is it going to help us
make our business
more successful?
Fine, we will just
use it in the
engineering
department then!
BTW, since you are here, I
would like a new business
KPI on a dashboard?
How To interest PO/PM/BS Stakeholders?
● Being able to set more correct expectations on features taking into account
reliability work which needs to happen
● Easier to discuss, compromise, and prioritise features against technical debt
● Show that SLIs are also related to important business KPIs
○ Which countries are burning error budget?
● Combine Implementing SLIs with other relevant business KPIs ;)
● SLIs and SLO fulfillment can inform business decisions
○ e.g. whether to release a marketing campaign now or wait for fixes to increase the site
reliability
How To Get Stakeholder Interest?
Pattern:
● Show that SLIs are related to important business KPIs
● Combine Implementing SLIs with other relevant business KPIs ;)
● Show that SLO fulfillment can inform business decisions
● Possibility of discussion around compromise, trade-off, and priorities is evident
Dang. Those
things are eating
into our
conversion rate.
Do you know which
countries are
burning error
budget?
Are you sure we
should do the big
marketing
campaign now or
after the fixes for
reliability are in?
4: Let’s Restart From the Beginning
Can you describe the most
critical user-journeys for
you?
but you have to promise me
to also add this business
KPIs
Pattern:
● Recover by reusing already modeled user-journeys, e.g. for testing purposes or synthetics
● SREs need to listen carefully and patiently because this will take time
● Discussion will evolve so budget for the time and be flexible
Maybe this could be
helpful .. I still need my
KPIs. This could
increase our
conversion rate $$$$
We need a common
abstraction level to talk
about. They always talk
about “user-journey” so ...
ok, so SLIs, are like
business KPIs for
reliability
… What if user journey are
unavailable? …aren’t
performant?
4: How Some Did it Right
1. Review customer facing user journeys together with the Product
Manager
2. Rank them by business criticality
3. Correlate SLIs with existing business KPIs (e.g. conversion rates for
purchase) to understand the signal quality
4. Correlate past SLI violations with an increase in support or incident
tickets to validate the signal
5. Address top 3 user journeys
Confidential and Proprietary Information for Instana, Inc.
v
Prometheus isn’t enough.
Mr. Technical Leader let’s figure out
how to map and measure the
user-journeys, KPIs, and SLOs using
our existing observability model /
tools.
We’ll need APM, synthetics, logs, ...
This is a lot of work
…we’re going to have
blind spots and have
missing data
Isn't there a tool that
does this already
5. Map User- Journeys to an Observability
Model
Confidential and Proprietary Information for Instana, Inc.
v
Prometheus isn’t enough.
We’ll need APM, synthetics, logs, ...
Isn't there a tool that
does this already
5. Map User- Journeys to an Observability
Model
Anti-Pattern:
● Model getting too complex with too much effort to implement and validate
● Data is getting too granular
● Not leveraging Synthetics or EUM / Mobile because a fear of RCAing noisy violation
6. Implement SLIs & KPIs, Instrument,
Validate, Create Dashboards
Our current availability is 82%.
Only 75% of requests are
meeting the latency SLO.
I need to explain this
carefully or it will be a
political disaster.
I’ll get ahead of this by
building a plan with
dedicated resources
to meet our SLOs
Do you know if they have
been higher or lower in the
past year?
This is a lot of work
How to troubleshoot this
… user journey, service
dependencies,call graphs,
infrastructure?
Step 6 Activities
● Gathering necessary data adding instrumentation
● Implementing SLIs and business KPIs requested by PM
● Validate SLIs correlate with business KPIs and/or major
incidents / problem tickets
● SRE is building dashboards as well as validating the results
● Implementing/extending business dashboards with KPIs
● You want to make sure you can root-cause service level
violations (downstream and infra)!!!
● First insights into actual service level numbers are available
7: Define SLOs, Share First Results
If I ask for 110% success
rate then maybe we’ll get
to 99.9%.
What do you mean we don’t
want 100% success rate?
I hope this SLO isn’t too low
or too high but just right
Anti-Pattern:
● SLIs need to be redefined since they do not capture the expected user impact
● Product manager wants to be aggressive in SLO values to make sure the KPIs are met
● Historical data doesn’t exist to provide context and avoid unreasonable SLOs
● Premature org-wide promotion of SLOs before value is proven or context is available
● Uncertain how to explain the first results to business stakeholders without causing panic
7: Define SLOs, Share First Results
Historically availability was
82% and latency was 75%
successful. So let’s start
there.
Pattern:
● Use past data to derive thresholds for SLOs
● Engage PMs with the historical data to create sustainable SLOs
● Discuss if activities are needed to increase the reliability
● Then this is the time to go forward and demonstrate the first success to build support for the project
We’re leaving money $$$$
on the table so lets do
some reliability work ...
8. Define Error Budget Policy
Disclaimer: Only a few
organizations got here so far
8. Define Error Budget Policy
Pattern:
● The product team commits to defending the SLOs
● Business stakeholders, product team, and SREs agree to a formal action policy
● Need regular check-ins on SLO fulfillment inside the product team and to biz stakeholders
Let’s add some teeth to turn
this into a decision making
tool … We’ll need buy in
from ...
9. Set Up and Tune Alerting
You need to be alerted
when the error budget
burn rate increases
greatly
I’m glad I got that
Ph.D. for “Error
Budget Burn Rate
Dynamic Adjustment
Engineering”
But won’t that be
noisy? Will it give
us enough
mitigation time?
Pattern:
● Teams will be asked to defend SLOs so they need Alerting
● Start with simple threshold alerting on SLI before it reaches SLO
● In order to be able to defend SLO you need to alert on trends (e.g., error budget burn rate changes)
● Use a tool that can help you adjust the error burn rate alert so you don’t need a Ph.D.
10: Congratulations, You Have Made It!
Now we can iterate,
improve, and scale.
Expectations and Lessons Learned
Expectations and Lessons Learned
Aside from the lessons learned already mentioned ...
● It will take several weeks for a single team to get started
● It takes 2-3 quarters for a larger organization
● Plan to spend 2-4 weeks for education of the first team
● If there is no action upon SLO violation, the whole thing is wasted
● You need buy in / support from the executive / VP level
● You need enough historical data to derive, setup, and validate SLIs
and SLOs
What You Need To Make Your SLO Journey Smooth
Pattern:
● Low cost discovery, setup, and care of user-journeys
● Easy dashboard creation and maintenance
○ Show services that are the biggest EB offenders
○ Show impacted user-journeys and by how much
● Alert on error budget burn rate or an endangered SLO
● Troubleshooting that drills down from user journeys
I’m glad we have
tools and processes
that support the SLI /
SLO methodology as
a first class citizen
The Bumpy Road to Actionable SLOs
Now for your questions ….
The Bumpy Road to Actionable SLOs

More Related Content

The Bumpy Road to Actionable SLOs

  • 1. Confidential and Proprietary Information for Instana, Inc. The Bumpy Journey to Actionable SLOs Curtis Hrischuk, Ph.D. Yury Oleynik, Ph.D. Technical Product Managers
  • 2. Confidential and Proprietary Information for Instana, Inc. “In theory there is no difference between theory and practice. In practice, there is.” - Yogi Berra Considering the SLO journey of 20 companies here
  • 3. Confidential and Proprietary Information for Instana, Inc. Agenda: ● Promise of SLO Methodology ● The Practical Journey to SLO use ● Lessons Learned Along the Way ● Polling question: Where are you ..
  • 4. Promises of SLO Methodology ● Meet customer SLAs for higher satisfaction ● Company wide, consistent reliability reporting ● Decision making tool ● Investment becomes data-driven
  • 5. Confidential and Proprietary Information for Instana, Inc. v 1: Getting Excited About SLOs Anti-pattern: ● Thinking it is only a technical problem ● Assuming it is a straight line to the end goal ● Greatly underestimating the effort and time Decrease toil of on-call colleagues Address reliability in a more systematic way Focus on end-user/ business impact and not just alerts Reliability always comes last (vs new features) Show that we are doing good in terms of reliability
  • 6. Confidential and Proprietary Information for Instana, Inc. v 2: Let’s Hack Some SLOs Quickly! Lets define SLO for each user-facing service Looks like we are missing some data to measure SLIs… Anti-pattern: ● Not involving Product Owner / Product Manager ● Significant time invested into getting additional measurements ● A straight path to SLO without teeth... Where is this going … what is the outcome This is more complex than I thought... But when user facing SLOs are broken, how do we know why? Let’s add some SLOs for backend services! Lets add more instrumentation, so that we have more data Let’s start with what we have
  • 7. Yeah, right… I am not committing on anything here 3: Lets Pitch SLOs To PM We should stop all feature development if the SLO for GET /cat-frontend is violated! What is this SLO? Which service? Why is this relevant? Anti-pattern: ● Technical focus of the SLO methodology is difficult for the business side ● There are two different languages here - no alignment in terms of value ● SRE’s choices are: use up political capital, find an executive sponsor, or win the PM’s interest ● Many SLO initiatives die here or the SLI becomes yet another metric to alert on No. No. No. No. How is it going to help us make our business more successful? Fine, we will just use it in the engineering department then! BTW, since you are here, I would like a new business KPI on a dashboard?
  • 8. How To interest PO/PM/BS Stakeholders? ● Being able to set more correct expectations on features taking into account reliability work which needs to happen ● Easier to discuss, compromise, and prioritise features against technical debt ● Show that SLIs are also related to important business KPIs ○ Which countries are burning error budget? ● Combine Implementing SLIs with other relevant business KPIs ;) ● SLIs and SLO fulfillment can inform business decisions ○ e.g. whether to release a marketing campaign now or wait for fixes to increase the site reliability
  • 9. How To Get Stakeholder Interest? Pattern: ● Show that SLIs are related to important business KPIs ● Combine Implementing SLIs with other relevant business KPIs ;) ● Show that SLO fulfillment can inform business decisions ● Possibility of discussion around compromise, trade-off, and priorities is evident Dang. Those things are eating into our conversion rate. Do you know which countries are burning error budget? Are you sure we should do the big marketing campaign now or after the fixes for reliability are in?
  • 10. 4: Let’s Restart From the Beginning Can you describe the most critical user-journeys for you? but you have to promise me to also add this business KPIs Pattern: ● Recover by reusing already modeled user-journeys, e.g. for testing purposes or synthetics ● SREs need to listen carefully and patiently because this will take time ● Discussion will evolve so budget for the time and be flexible Maybe this could be helpful .. I still need my KPIs. This could increase our conversion rate $$$$ We need a common abstraction level to talk about. They always talk about “user-journey” so ... ok, so SLIs, are like business KPIs for reliability … What if user journey are unavailable? …aren’t performant?
  • 11. 4: How Some Did it Right 1. Review customer facing user journeys together with the Product Manager 2. Rank them by business criticality 3. Correlate SLIs with existing business KPIs (e.g. conversion rates for purchase) to understand the signal quality 4. Correlate past SLI violations with an increase in support or incident tickets to validate the signal 5. Address top 3 user journeys
  • 12. Confidential and Proprietary Information for Instana, Inc. v Prometheus isn’t enough. Mr. Technical Leader let’s figure out how to map and measure the user-journeys, KPIs, and SLOs using our existing observability model / tools. We’ll need APM, synthetics, logs, ... This is a lot of work …we’re going to have blind spots and have missing data Isn't there a tool that does this already 5. Map User- Journeys to an Observability Model
  • 13. Confidential and Proprietary Information for Instana, Inc. v Prometheus isn’t enough. We’ll need APM, synthetics, logs, ... Isn't there a tool that does this already 5. Map User- Journeys to an Observability Model Anti-Pattern: ● Model getting too complex with too much effort to implement and validate ● Data is getting too granular ● Not leveraging Synthetics or EUM / Mobile because a fear of RCAing noisy violation
  • 14. 6. Implement SLIs & KPIs, Instrument, Validate, Create Dashboards Our current availability is 82%. Only 75% of requests are meeting the latency SLO. I need to explain this carefully or it will be a political disaster. I’ll get ahead of this by building a plan with dedicated resources to meet our SLOs Do you know if they have been higher or lower in the past year? This is a lot of work How to troubleshoot this … user journey, service dependencies,call graphs, infrastructure?
  • 15. Step 6 Activities ● Gathering necessary data adding instrumentation ● Implementing SLIs and business KPIs requested by PM ● Validate SLIs correlate with business KPIs and/or major incidents / problem tickets ● SRE is building dashboards as well as validating the results ● Implementing/extending business dashboards with KPIs ● You want to make sure you can root-cause service level violations (downstream and infra)!!! ● First insights into actual service level numbers are available
  • 16. 7: Define SLOs, Share First Results If I ask for 110% success rate then maybe we’ll get to 99.9%. What do you mean we don’t want 100% success rate? I hope this SLO isn’t too low or too high but just right Anti-Pattern: ● SLIs need to be redefined since they do not capture the expected user impact ● Product manager wants to be aggressive in SLO values to make sure the KPIs are met ● Historical data doesn’t exist to provide context and avoid unreasonable SLOs ● Premature org-wide promotion of SLOs before value is proven or context is available ● Uncertain how to explain the first results to business stakeholders without causing panic
  • 17. 7: Define SLOs, Share First Results Historically availability was 82% and latency was 75% successful. So let’s start there. Pattern: ● Use past data to derive thresholds for SLOs ● Engage PMs with the historical data to create sustainable SLOs ● Discuss if activities are needed to increase the reliability ● Then this is the time to go forward and demonstrate the first success to build support for the project We’re leaving money $$$$ on the table so lets do some reliability work ...
  • 18. 8. Define Error Budget Policy Disclaimer: Only a few organizations got here so far
  • 19. 8. Define Error Budget Policy Pattern: ● The product team commits to defending the SLOs ● Business stakeholders, product team, and SREs agree to a formal action policy ● Need regular check-ins on SLO fulfillment inside the product team and to biz stakeholders Let’s add some teeth to turn this into a decision making tool … We’ll need buy in from ...
  • 20. 9. Set Up and Tune Alerting You need to be alerted when the error budget burn rate increases greatly I’m glad I got that Ph.D. for “Error Budget Burn Rate Dynamic Adjustment Engineering” But won’t that be noisy? Will it give us enough mitigation time? Pattern: ● Teams will be asked to defend SLOs so they need Alerting ● Start with simple threshold alerting on SLI before it reaches SLO ● In order to be able to defend SLO you need to alert on trends (e.g., error budget burn rate changes) ● Use a tool that can help you adjust the error burn rate alert so you don’t need a Ph.D.
  • 21. 10: Congratulations, You Have Made It! Now we can iterate, improve, and scale.
  • 23. Expectations and Lessons Learned Aside from the lessons learned already mentioned ... ● It will take several weeks for a single team to get started ● It takes 2-3 quarters for a larger organization ● Plan to spend 2-4 weeks for education of the first team ● If there is no action upon SLO violation, the whole thing is wasted ● You need buy in / support from the executive / VP level ● You need enough historical data to derive, setup, and validate SLIs and SLOs
  • 24. What You Need To Make Your SLO Journey Smooth Pattern: ● Low cost discovery, setup, and care of user-journeys ● Easy dashboard creation and maintenance ○ Show services that are the biggest EB offenders ○ Show impacted user-journeys and by how much ● Alert on error budget burn rate or an endangered SLO ● Troubleshooting that drills down from user journeys I’m glad we have tools and processes that support the SLI / SLO methodology as a first class citizen
  • 26. Now for your questions ….