SlideShare a Scribd company logo
Practical Service Level
Objectives With Error
Budgeting
Fred Moyer @phredmoyer
BayLISA May 16, 2019
Are Errors important?
@phredmoyer
Is Latency Important?
@phredmoyer
How many errors in your app last week?
@phredmoyer
How many requests over 500ms last week?
@phredmoyer
Your error/request ratio last week?
@phredmoyer
Are slow requests errors?
@phredmoyer
Hi I’m Fred
● @phredmoyer
● Monitoring Nerd
● Writing code 20 years
● And breaking prod
● Likes Go, Perl, C, Pg
● Likes SLOs
● Doesn’t like errors
@phredmoyer
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
What is an Error Budget?
@phredmoyer
Zero Errors!
Happy Users!
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Unhappy users
Just enough risk = Happy users
Too little risk = Unhappy users
What is an Error Budget?
@phredmoyer
Error budget = Acceptable risk
Acceptable risk = 100%-SLO
Error budget = 100%-SLO
@phredmoyer
SLOs, How Do They Work?
SLOs, How Do They Work?
@phredmoyer
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE
@lizthegrey ⇔ @sethvargo
SLI: 95th %ile requests over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
SLA: 95th %ile SLI for 1 month succeeds 99.5%
or you have to refund money
What is an Error Budget?
@phredmoyer
SLI: 95th %ile req over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
1M reqs in one month
Error Budget = (1-0.999)*1M = 1k requests
1k requests can exceed 300ms
What is an Error Budget?
@phredmoyer
Chapter 3
Embracing Risk
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Calculating Error Budgets with Logs
@phredmoyer
Latency
Calculating Error Budgets with Logs - Latency
@phredmoyer
Error Budget = 100%-SLO = (1-0.999)*1M = 1k
Error Budget = 1k requests/day > 300ms
EventLog "%h %l %u %O "%{User-Agent}i" %D"
%D - Request duration in milliseconds
For each request:
If duration > SLI (300ms), error_budget++
Calculating Error Budgets with Logs - Errors
@phredmoyer
Errors
Calculating Error Budgets with Logs - Errors
@phredmoyer
Error Budget = 1k requests/day > 300ms
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1]
client denied by server: /export/home/live/ap/htdocs/test
For each error log entry, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Logs
@phredmoyer
Cumulative sum functionality required
● Splunk
● ELK
● Mtail
○ https://github.com/google/mtail
● Honeycomb.io
● Circonus Logwatch
○ https://github.com/circonus-
labs/circonus-logwatch
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Calculating Error Budgets with Metrics
@phredmoyer
Errors
Calculating Error Budgets with Metrics
@phredmoyer
Use a counter metric (uint32/uint64)
Error Budget = 1k requests/day > 300ms
For each app error, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Metrics (and Logs)
@phredmoyer
Problems:
● SLI fixed threshold
● Inability to introspect historical data
● Difficult to compare different SLI behavior
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Use a histogram
Image source
http://www.brendangregg.com/FrequencyTrails/modes.html
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Linear, Cumulative, Log-Linear, Approximate…
High dynamic range, log-linear recommended
http://hdrhistogram.org/
https://github.com/circonus/-labs/circonusllhist
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget = 1k requests/day > Xms
For each histogram bin >= X:
error_budget += bin_sample_count
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Choose bin boundary for SLI (preferred) or
interpolate within boundaries
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget ~ 1k requests/day > 1,800µs
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget ~ 1k requests/day > 2,400µs
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Benefits:
● SLI variable threshold
● Ability to analyze historical data
● Examine error budgets for different SLIs
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Questions?
? @phredmoyer
Thanks!
https://slideshare.net/redhotpenguin
https://twitter.com/phredmoyer
https://linkedin.com/in/redhotpenguin
https://github.com/redhotpenguin
@phredmoyer
Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 4
○ Service Level Objectives
● 99% Get RPC calls < 100ms
● https://landing.google.com/sre/sre-book/toc/index.html
@phredmoyer
● Ch 2: Implementing SLOs
● Ch 3: SLO Eng case studies
● Ch 5: Alerting on SLOs
● https://landing.google.com/sre/workbook/toc
Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 21
○ The Art and Science of
The Service Level
Objective
Appendix - SLOs, How Do They Work?

More Related Content

What's hot

Sre summary
Sre summarySre summary
Sre summary
Yogesh Shah
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
Squadcast Inc
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
NUS-ISS
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
Rene Van Osnabrugge
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
Chris Huang
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
Clément Michaud
 
SRE 101
SRE 101SRE 101
SRE 101
Diego Pacheco
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Tori Wieldt
 

What's hot (20)

Sre summary
Sre summarySre summary
Sre summary
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
 
SRE 101
SRE 101SRE 101
SRE 101
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 

Similar to Practical service level objectives with error budgeting

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
Fred Moyer
 
Performance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresPerformance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and Stores
Andrew Sharpe
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
Fred Moyer
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
Fred Moyer
 
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Naresh Chintalcheru
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
Ninefold
 
MeasureWorks - Windesheim Almere - Why Performance matters?
MeasureWorks  - Windesheim Almere - Why Performance matters?MeasureWorks  - Windesheim Almere - Why Performance matters?
MeasureWorks - Windesheim Almere - Why Performance matters?
MeasureWorks
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
Fred Moyer
 
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Big Compass
 
2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice
Alejandro Correa Bahnsen, PhD
 
2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer 2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer
marcolof
 
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
 GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res... GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
James Anderson
 
We Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge ArchitectureWe Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge Architecture
Apigee | Google Cloud
 
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev ConferenceWhat is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
Geoffrey De Smet
 
Experimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.comExperimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.com
webwinkelvakdag
 
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
TigerGraph
 
Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.
Grass Roots Meetings and Events
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
T. Alexander Lystad
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
Fred Moyer
 
Mobile App User Experience Myths, Debunked
Mobile App User Experience Myths, DebunkedMobile App User Experience Myths, Debunked
Mobile App User Experience Myths, Debunked
Apteligent
 

Similar to Practical service level objectives with error budgeting (20)

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
 
Performance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresPerformance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and Stores
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
 
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
 
MeasureWorks - Windesheim Almere - Why Performance matters?
MeasureWorks  - Windesheim Almere - Why Performance matters?MeasureWorks  - Windesheim Almere - Why Performance matters?
MeasureWorks - Windesheim Almere - Why Performance matters?
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
 
2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice
 
2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer 2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer
 
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
 GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res... GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
 
We Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge ArchitectureWe Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge Architecture
 
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev ConferenceWhat is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
 
Experimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.comExperimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.com
 
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
 
Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Mobile App User Experience Myths, Debunked
Mobile App User Experience Myths, DebunkedMobile App User Experience Myths, Debunked
Mobile App User Experience Myths, Debunked
 

More from Fred Moyer

Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
Fred Moyer
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Fred Moyer
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
Fred Moyer
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
Fred Moyer
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
Fred Moyer
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
Fred Moyer
 
Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016
Fred Moyer
 
Better service monitoring through histograms
Better service monitoring through histogramsBetter service monitoring through histograms
Better service monitoring through histograms
Fred Moyer
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
Fred Moyer
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
Fred Moyer
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
Fred Moyer
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
Fred Moyer
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
Fred Moyer
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
Fred Moyer
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
Fred Moyer
 

More from Fred Moyer (15)

Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016
 
Better service monitoring through histograms
Better service monitoring through histogramsBetter service monitoring through histograms
Better service monitoring through histograms
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
 

Recently uploaded

Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdfResponsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
Trackobit
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Softwares
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
Hironori Washizaki
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
ssuser2b426d1
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
Ortus Solutions, Corp
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
sachin chaurasia
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
akshesh doshi
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
Mitchell Marsh
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
sheqnetworkmarketing
 
Top 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your WebsiteTop 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your Website
e-Definers Technology
 
dachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdfdachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdf
DNUG e.V.
 

Recently uploaded (20)

Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdfResponsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
Responsibilities of Fleet Managers and How TrackoBit Can Assist.pdf
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
COMPSAC 2024 D&I Panel: Charting a Course for Equity: Strategies for Overcomi...
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
 
Top 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your WebsiteTop 10 Tips To Get Google AdSense For Your Website
Top 10 Tips To Get Google AdSense For Your Website
 
dachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdfdachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdf
 

Practical service level objectives with error budgeting

  • 1. Practical Service Level Objectives With Error Budgeting Fred Moyer @phredmoyer BayLISA May 16, 2019
  • 4. How many errors in your app last week? @phredmoyer
  • 5. How many requests over 500ms last week? @phredmoyer
  • 6. Your error/request ratio last week? @phredmoyer
  • 7. Are slow requests errors? @phredmoyer
  • 8. Hi I’m Fred ● @phredmoyer ● Monitoring Nerd ● Writing code 20 years ● And breaking prod ● Likes Go, Perl, C, Pg ● Likes SLOs ● Doesn’t like errors @phredmoyer
  • 9. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 10. What is an Error Budget? @phredmoyer Zero Errors! Happy Users!
  • 11. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 12. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 13. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 14. What is an Error Budget? @phredmoyer Too much risk = Unhappy users Just enough risk = Happy users Too little risk = Unhappy users
  • 15. What is an Error Budget? @phredmoyer Error budget = Acceptable risk Acceptable risk = 100%-SLO Error budget = 100%-SLO
  • 17. SLOs, How Do They Work? @phredmoyer SLIs, SLOs, SLAs, oh my! https://www.youtube.com/watch?v=tEylFyxbDLE @lizthegrey ⇔ @sethvargo SLI: 95th %ile requests over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% SLA: 95th %ile SLI for 1 month succeeds 99.5% or you have to refund money
  • 18. What is an Error Budget? @phredmoyer SLI: 95th %ile req over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% 1M reqs in one month Error Budget = (1-0.999)*1M = 1k requests 1k requests can exceed 300ms
  • 19. What is an Error Budget? @phredmoyer Chapter 3 Embracing Risk
  • 20. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 21. Calculating Error Budgets with Logs @phredmoyer Latency
  • 22. Calculating Error Budgets with Logs - Latency @phredmoyer Error Budget = 100%-SLO = (1-0.999)*1M = 1k Error Budget = 1k requests/day > 300ms EventLog "%h %l %u %O "%{User-Agent}i" %D" %D - Request duration in milliseconds For each request: If duration > SLI (300ms), error_budget++
  • 23. Calculating Error Budgets with Logs - Errors @phredmoyer Errors
  • 24. Calculating Error Budgets with Logs - Errors @phredmoyer Error Budget = 1k requests/day > 300ms [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server: /export/home/live/ap/htdocs/test For each error log entry, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  • 25. Calculating Error Budgets with Logs @phredmoyer Cumulative sum functionality required ● Splunk ● ELK ● Mtail ○ https://github.com/google/mtail ● Honeycomb.io ● Circonus Logwatch ○ https://github.com/circonus- labs/circonus-logwatch
  • 26. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 27. Calculating Error Budgets with Metrics @phredmoyer Errors
  • 28. Calculating Error Budgets with Metrics @phredmoyer Use a counter metric (uint32/uint64) Error Budget = 1k requests/day > 300ms For each app error, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  • 29. Calculating Error Budgets with Metrics (and Logs) @phredmoyer Problems: ● SLI fixed threshold ● Inability to introspect historical data ● Difficult to compare different SLI behavior
  • 30. Calculating Error Budgets with Metrics - Histograms @phredmoyer Use a histogram Image source http://www.brendangregg.com/FrequencyTrails/modes.html
  • 31. Calculating Error Budgets with Metrics - Histograms @phredmoyer Linear, Cumulative, Log-Linear, Approximate… High dynamic range, log-linear recommended http://hdrhistogram.org/ https://github.com/circonus/-labs/circonusllhist
  • 32. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget = 1k requests/day > Xms For each histogram bin >= X: error_budget += bin_sample_count Alert if error_budget/total_reqs > 80% * 1-SLO
  • 33. Calculating Error Budgets with Metrics - Histograms @phredmoyer Choose bin boundary for SLI (preferred) or interpolate within boundaries
  • 34. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 1,800µs
  • 35. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 2,400µs
  • 36. Calculating Error Budgets with Metrics - Histograms @phredmoyer Benefits: ● SLI variable threshold ● Ability to analyze historical data ● Examine error budgets for different SLIs
  • 37. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 40. Appendix - SLOs, How Do They Work? @phredmoyer ● Chapter 4 ○ Service Level Objectives ● 99% Get RPC calls < 100ms ● https://landing.google.com/sre/sre-book/toc/index.html
  • 41. @phredmoyer ● Ch 2: Implementing SLOs ● Ch 3: SLO Eng case studies ● Ch 5: Alerting on SLOs ● https://landing.google.com/sre/workbook/toc Appendix - SLOs, How Do They Work?
  • 42. @phredmoyer ● Chapter 21 ○ The Art and Science of The Service Level Objective Appendix - SLOs, How Do They Work?