All,
We're just started on SRE journey and trying to define SLI / SLO for our application. It is an ETL application where 1. feeds (e.g. start of day, end of day data feeds) comes from various upstream and gets loaded with some transformation. 2. Once feeds get loaded there are some jobs which process the data and populate other tables 3. based on point 1 and 2 the data is made available for downstream applications and users.
In this case, we considered User experience as,
- Data availability for downstream application at specific time for specific region
- Data availability for Users at specific time for specific region
We defined time for above two points by when the data would be required to be available. Based on this, we created SLOs for us that in case we have to make sure we delivery data at right time we would need at least 1.5 hrs to investigate, reprocess the feeds i.e. SLA for downstream/users - 1.5 hrs --> this we considered as SLI.
On Measuring the same, we capture how many times we are unable to get the feeds processed by SLO timings and put % i.e. 95% of the time we should be able to deliver data before SLO time.
Is this the right approach to define SLI / SLO?
If we look at most of the documentation / videos etc. on SLI / SLO they are more focused on micro service responses and measuring them on success / failure or latency etc. but couldn't find something around ETL or Reporting applications.
I might be wrong in above approach hence looking forward to an expert advice to get more understanding.
Thank you in advance for your help.
PS: couldn't find tag for SRE or any practices near it so have tagged question with DevOps.