3

All,

We're just started on SRE journey and trying to define SLI / SLO for our application. It is an ETL application where 1. feeds (e.g. start of day, end of day data feeds) comes from various upstream and gets loaded with some transformation. 2. Once feeds get loaded there are some jobs which process the data and populate other tables 3. based on point 1 and 2 the data is made available for downstream applications and users.

In this case, we considered User experience as,

  1. Data availability for downstream application at specific time for specific region
  2. Data availability for Users at specific time for specific region

We defined time for above two points by when the data would be required to be available. Based on this, we created SLOs for us that in case we have to make sure we delivery data at right time we would need at least 1.5 hrs to investigate, reprocess the feeds i.e. SLA for downstream/users - 1.5 hrs --> this we considered as SLI.

On Measuring the same, we capture how many times we are unable to get the feeds processed by SLO timings and put % i.e. 95% of the time we should be able to deliver data before SLO time.

Is this the right approach to define SLI / SLO?

If we look at most of the documentation / videos etc. on SLI / SLO they are more focused on micro service responses and measuring them on success / failure or latency etc. but couldn't find something around ETL or Reporting applications.

I might be wrong in above approach hence looking forward to an expert advice to get more understanding.

Thank you in advance for your help.

PS: couldn't find tag for SRE or any practices near it so have tagged question with DevOps.

1 Answer 1

2

The terms "SLI," "SLO," and "SLA" have precise meanings that apply across the spectrum of scale, domain, and abstraction. Although most literature focuses on microservices, that's because microservices are "hot" right now. To understand the concepts more fundamentally, look at the last words in each acronym:

  • An indicator is a measure. It's something that you look at, a piece of data that can answer a question. "How fast are our responses?" "How many errors are occurring in the ETL process?" "What is our cache hit ratio?"
  • An objective is a goal. This is where you want to be. They can be aspirational (e.g. we currently have 20 errors per day in the ETL process, but our goal is 5), or they can be steady-state (e.g. our response time is 200ms and it cannot fall below 250ms). There is no enforcement or liability tied to these goals.
  • An agreement is a legally binding statement of intent. If an agreement is broken, financial (or even legal) penalties can be on the line. For example, if you agree to an annual SLA of five nines of uptime (99.999%), a single outage of more than five minutes will blow your SLA and may constitute a breach of contract.

Hopefully that helps.

Not the answer you're looking for? Browse other questions tagged or ask your own question.