Practical service level objectives with error budgeting

Practical Service Level
Objectives With Error
Budgeting
Fred Moyer @phredmoyer
BayLISA May 16, 2019

Are Errors important?
@phredmoyer

Is Latency Important?
@phredmoyer

How many errors in your app last week?
@phredmoyer

How many requests over 500ms last week?
@phredmoyer

Your error/request ratio last week?
@phredmoyer

Are slow requests errors?
@phredmoyer

Hi I’m Fred
● @phredmoyer
● Monitoring Nerd
● Writing code 20 years
● And breaking prod
● Likes Go, Perl, C, Pg
● Likes SLOs
● Doesn’t like errors
@phredmoyer

Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer

What is an Error Budget?
@phredmoyer
Zero Errors!
Happy Users!

@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users

@phredmoyer
Too much risk = Unhappy users
Just enough risk = Happy users
Too little risk = Unhappy users

@phredmoyer
Error budget = Acceptable risk
Acceptable risk = 100%-SLO
Error budget = 100%-SLO

@phredmoyer
SLOs, How Do They Work?

SLOs, How Do They Work?
@phredmoyer
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE
@lizthegrey ⇔ @sethvargo
SLI: 95th %ile requests over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
SLA: 95th %ile SLI for 1 month succeeds 99.5%
or you have to refund money

@phredmoyer
SLI: 95th %ile req over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
1M reqs in one month
Error Budget = (1-0.999)*1M = 1k requests
1k requests can exceed 300ms

@phredmoyer
Chapter 3
Embracing Risk

Calculating Error Budgets with Logs
@phredmoyer
Latency

Calculating Error Budgets with Logs - Latency
@phredmoyer
Error Budget = 100%-SLO = (1-0.999)*1M = 1k
Error Budget = 1k requests/day > 300ms
EventLog "%h %l %u %O "%{User-Agent}i" %D"
%D - Request duration in milliseconds
For each request:
If duration > SLI (300ms), error_budget++

Calculating Error Budgets with Logs - Errors
@phredmoyer
Errors

Calculating Error Budgets with Logs - Errors
@phredmoyer
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1]
client denied by server: /export/home/live/ap/htdocs/test
For each error log entry, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO

Calculating Error Budgets with Logs
@phredmoyer
Cumulative sum functionality required
● Splunk
● ELK
● Mtail
○ https://github.com/google/mtail
● Honeycomb.io
● Circonus Logwatch
○ https://github.com/circonus-
labs/circonus-logwatch

Calculating Error Budgets with Metrics
@phredmoyer
Errors

Calculating Error Budgets with Metrics
@phredmoyer
Use a counter metric (uint32/uint64)
For each app error, error_budget++
If req duration > SLI (300ms), error_budget++

Calculating Error Budgets with Metrics (and Logs)
@phredmoyer
Problems:
● SLI fixed threshold
● Inability to introspect historical data
● Difficult to compare different SLI behavior

Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Use a histogram
Image source
http://www.brendangregg.com/FrequencyTrails/modes.html

@phredmoyer
Linear, Cumulative, Log-Linear, Approximate…
High dynamic range, log-linear recommended
http://hdrhistogram.org/
https://github.com/circonus/-labs/circonusllhist

@phredmoyer
Error Budget = 1k requests/day > Xms
For each histogram bin >= X:
error_budget += bin_sample_count

@phredmoyer
Choose bin boundary for SLI (preferred) or
interpolate within boundaries

@phredmoyer
Error Budget ~ 1k requests/day > 1,800µs

@phredmoyer
Error Budget ~ 1k requests/day > 2,400µs

@phredmoyer
Benefits:
● SLI variable threshold
● Ability to analyze historical data
● Examine error budgets for different SLIs

Thanks!
https://slideshare.net/redhotpenguin
https://twitter.com/phredmoyer
https://linkedin.com/in/redhotpenguin
https://github.com/redhotpenguin
@phredmoyer

Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 4
○ Service Level Objectives
● 99% Get RPC calls < 100ms
● https://landing.google.com/sre/sre-book/toc/index.html

@phredmoyer
● Ch 2: Implementing SLOs
● Ch 3: SLO Eng case studies
● Ch 5: Alerting on SLOs
● https://landing.google.com/sre/workbook/toc

@phredmoyer
● Chapter 21
○ The Art and Science of
The Service Level
Objective

Practical service level objectives with error budgeting

More Related Content

What's hot

What's hot (20)

Similar to Practical service level objectives with error budgeting

Similar to Practical service level objectives with error budgeting (20)

More from Fred Moyer

More from Fred Moyer (15)

Recently uploaded

Recently uploaded (20)

Practical service level objectives with error budgeting