This document summarizes a presentation about implementing service level objectives (SLOs) and error budgets at scale. It discusses establishing service level indicators (SLIs) to define good and bad service, setting SLOs as targets for SLIs over time periods, and calculating error budgets as the complement of SLOs. The presentation provides examples of SLIs, SLOs, and error budgets for latency and availability. It also discusses challenges including variance from real users and different stakeholders' needs, and recommends approaches like flexible latency metrics and measuring as close to users as possible.
Testing & deploying microservices - XP Days Ukraine 2014Sam Newman
The document discusses testing and deploying microservices. It describes microservices as small autonomous services that work together. It discusses testing strategies including unit, service, and UI testing following Mike Cohn's test pyramid. It also discusses deploying microservices using container virtualization with Docker, immutable infrastructure, and deployment techniques like having one server per host and using image/Docker-based artifacts.
Preparing for CDN failure: Why and howAaron Peters
This document discusses preparing for content delivery network (CDN) failures and how to monitor CDN performance. It provides examples of past CDN outages and failures. It then covers different methods for monitoring CDN performance, including synthetic monitoring and real user monitoring. It emphasizes the importance of measuring failure rates not just speeds. The document also discusses mitigating CDN failures through a multi-CDN approach with dynamic traffic steering based on performance data. It notes some challenges in decision making and with low volume data. Finally, it shares a story about responding to an outage at a company.
Presented at NDC London, December 2014
Microservice architectures can lead to easier to change, more maintainable systems which can be more secure, performant and stable than previous designs. But what are the practical concerns associated with running more fine-grained systems, and what are the new things you’ll need to know if you want to embrace the power of smaller services without the new sources of complexity making your life a nightmare? This talk will delve deeper into the characteristics of well-behaved services, and will define some clear principles your services should follow. It will also discuss in more depth some of the challenges associated with managing and monitoring more complex distributed systems. We’ll discuss how you can design services to be more fault-tolerant, what technologies may exist in your own platform to get you started. We’ll end by giving some pointers as to when you should consider microservice architectures, and how you should go about introducing them in your own organisation.
Forecasting using monte carlo simulationsDaniel Ploeg
In a combined meetup between the LimitedWiP Society Melbourne and the Leadership and Project Delivery group, this presentation on forecasting will help the maturity of conversations and catalyse change around predicting the likely outcome of project and product development knowledge work.
This document discusses optimization of web applications. It begins with an introduction of the author and their background. It then provides an overview of how the Apache Prefork model handles requests and responses. It discusses past projects that failed due to various challenges like component failure, peaks in traffic, unproven code, and stress tests not being representative. Key learnings are summarized as designing for 60% usage, having backups, and understanding hardware, software, human, and marketing factors. The document concludes that optimization requires understanding how all pieces work together and is an ongoing process.
These are the slides I gave from my recent talk at Javazone. It's an update of my 'Practical Considerations For Microservices' talk. You can see the accompanying video here: http://vimeo.com/105751281
Bulk API allows for parallel loading of large amounts of data into Salesforce faster than sequential loading. It does this by splitting the data into separate jobs that are processed concurrently by multiple threads. This reduces the total load time compared to loading all of the data with a single thread. In a demo, 1 million records were loaded in 41 minutes using Bulk API, 24,300 records per second, much faster than sequential loading would have been.
The deck for Practical Microservices as presented at YOW 2013 in Brisbane. Minor changes from the Melbourne event.
Bonus point if you can spot the typo!
This document appears to be a transcript from a presentation on application security and microservices. The summary includes:
1) The presentation discusses security challenges and strategies for microservices architectures, including transport security, authentication, authorization, encryption of data at rest, and perimeter security approaches.
2) Prevention, detection, response and recovery are emphasized as important aspects of a security strategy, along with practices like short-lived credentials, patching, and "repaving" or rebuilding systems on deployments.
3) Managing security risks across polyglot systems is highlighted as a challenge, as is the need to automate security practices and conduct thorough post-mortem analyses of incidents.
MeasureWorks - Why people hate to wait for your website to load (and how to f...MeasureWorks
My slides from DrupalJam 2014... About why users abandon your website and best practices to align content and speed to create a fast user experience, and continue to keep it aligned for every release
The more we are connected and the more others are connected to us, the more important reliability of your sites becomes. Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products. But what does this mean, and how do get started with this? In this session I will talk about the concepts of Site Reliability Engineering and use Microsoft Azure to implement some of the concepts and practices
VSLive Orlando 2019 - When "We are down" is not good enough. SRE on AzureRene Van Osnabrugge
The more we are connected and the more others are connected to us, the more important reliability of your sites becomes.
Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products. But what does this mean, and how do get started with this?
In this session I will talk about the concepts of Site Reliability Engineering and use Microsoft Azure to implement some of the concepts and practices.
You will learn:
What is Site Reliability Engineering?
How can you get started with SRE?
How to use Azure to implement some of the SRE concepts?
DOES16 London - Better Faster Cheaper .. How? John Willis
This document discusses how to achieve better, faster, and cheaper outcomes through DevOps practices. It argues that high-performing organizations deploy software 30x to 200x more frequently with 60x to 168x higher success rates compared to average performers. The document outlines several strategies to achieve these outcomes, including: establishing a culture of collaboration between Dev and Ops; automating processes; measuring outcomes; and promoting sharing of knowledge. It also discusses adopting service-aligned delivery teams, building everything through a standardized software development lifecycle (SDLC), making work visible, using immutable infrastructure, developing using a microservices architecture, and respecting people. The overall message is that DevOps practices can enable organizations to deliver value faster at higher quality and
Ruby on Rails Performance Tuning. Make it faster, make it better (WindyCityRa...John McCaffrey
(reposting with clearer title)
Performance tuning presentation from WindyCityRails 2010.
Why performance matters
The right way to approach it
Front end testing tools
Automated testing tools
Common problems and the ways to solve them in Rails
Rails specific tools
bullet
slim_scrooge
rack bug
request log analyzer
rails indexes
The more we are connected and the more others are connected to us, the more important reliability of your sites becomes. Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products. But what does this mean, and how do get started with this? In this session I will talk about the concepts of Site Reliability Engineering and use Microsoft Azure to implement some of the concepts and practices
Mobile User Experience:Auto Drive through Performance MetricsAndreas Grabner
Believe it or not - 85% of mobile apps are removed after first usage! In this presentation - given at the APM Meetup in Singapore in April 2015 - I talked about the challenges, best practices and especially metrics to avoid this situation.
Key Points of the Presentation
The two key trends "Internet of Things" and "DevOps" play a big role in our life when we talk about User Experience and especially mobile user experience. In this presentation I tell you what metrics to use to make sure you deliver your ideas faster to your mobile end users but also ensuring the right quality and user experience so that your users stay loyal and dont delete the mobile app after first usage.
Hidden Costs of Chasing the Mythical 'Five Nines'DevOpsDays DFW
“Five Nines” refers to the five nines in 99.999% available that is often synonymous with highly available. Does every highly available service require five nines? Not by a long shot. Yet the general state of the practice is to chase after this typically unrealistic goal almost blindly in many cases, often leading to unnecessarily high costs in both operational and development resources. Even less aggressive availability goals are often over-specified compared to true business drivers.
This talk will cover:
* The history of “five nines”
Common reasons why many organizations often inadvertently over-specify availability requirements
* The costs of such over-specification
* How service agility is negatively affected
* Examples of highly available systems with reasonable availability requirements
* Techniques on how to avoid over-specification based on Site Reliability Engineering principles
* Ways to spend your Error Budget (once you have one) most effectively
Applying these techniques should result in a more cost-effective service that keeps end users and management happy, and fewer alerts to the on-call DevOps engineer.
The document provides an overview of Daniel Austin's Web Performance Boot Camp. The class aims to (1) provide an understanding of web performance, (2) empower attendees to identify and resolve performance issues, and (3) demonstrate common performance tools. The class covers topics such as the impact of performance on business, definitions of performance, statistical analysis, queuing theory, the OSI model, and the MPPC model for analyzing the multiple components that determine overall web performance. Attendees will learn tools like Excel, web testing tools, browser debugging tools, and optional tools like R and Mathematica.
Understanding what happens on the client side is not easy. When you user visits your website you need to check his location, his device, connection speed, browser, and what page he is visiting.
After gathering all this data, you also need to check what happened. How long it takes for him to see the page? How long it takes until the page is fully loaded and working? If there was a JS error what was it and why can’t you replicate it? Most of the users don’t have powerful machines, with fast-connections. In this talk we will analyze the tools you can use to profile the client, synthetic and RUM analysis and how you can improve the performance on the client side. Basic and more advanced tips with real examples.
Case Study: Appriss Supercharges ITSM Efficiency With Process Automation to...CA Technologies
Learn how Appriss leverages advanced process automation in IT service management to save lives of crime victims across the United States. Time equals lives. Process efficiency saves time. Learn how you can automate CA Service Desk processes for optimal efficiency.
For more information, please visit http://cainc.to/Nv2VOe
London web perfug_performancefocused_devops_feb2014Andreas Grabner
The document discusses best practices for performance-focused DevOps including metrics for measuring performance throughout the development and deployment process. It provides examples of companies that deploy software frequently and with few errors and outlines the importance of testing, monitoring and addressing performance issues. The document advocates taking a data-driven approach to identifying and resolving problems in order to improve development efficiency and software quality.
This document discusses how scaling teams to support big data growth at Yelp can negatively impact deployment speed due to an exponential increase in the probability of failures as the number of developers increases. It proposes service-oriented architecture and focusing on mean time to recovery rather than just preventing failures as ways to mitigate these risks and maintain rapid iteration. Continuous delivery, reliable but not exhaustive testing, and treating all processes as distributed are also recommended to support scaling teams while preserving deployment speed.
Nagios Conference 2014 - Nate Broderick - SLA - The Marriage of an Effective ...Nagios
Nate Broderick's presentation on SLA - The Marriage of an Effective Tool With a Well Planned Architecture.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
The size of the pull request is more important than you thinkRodrigo Miguel
Does the size of the pull request matter? If yes, what should be its ideal size? In this talk, I will explain why it is important to be concerned with the size of the pull requests and what should be taken into account to understand what is the ideal size. We will look at the impact on the quality and speed of development that the size of the pull requests can cause, understand the costs associated with working with large or small PRs, and show the relationship with the queueing theory.
Principles of microservices XP Days UkraineSam Newman
The document outlines principles of microservices, including modeling services around business domains, having a culture of automation, hiding implementation details, decentralizing systems, isolating failures, deploying independently, making systems highly observable, and other principles. The presentation provides examples and discusses strategic goals and architectural practices for designing fine-grained microservice systems.
If you are working on a serious project, you want it to scale. The thing about scale is, you only focus on it once you really need it. I’m the CTO of an soccer social network based in Brazil. To put it mildly, soccer is big in my country. This summer, we focused our marketing on the World Cup, preparing our application to support as many users as possible. To do that, we had to benchmark and improve, but how could we load test? What tool should we use? Those are just some questions that I'll go through in this talk, that will show youhot to address this challenge so stress test you app.
Performance hosting with Ninefold for Spree Apps and StoresAndrew Sharpe
Our relationship started when Ninefold chose Spree as the App to performance test our platform. We chose Spree because for Spree apps every millisecond matters. As a part of the trip, I presented at Spree inaugural webinar series on why performance matters for your Spree app and how hosting can help.
This is just the start of exciting work we are doing together. We discussed how Ninefold and Spree can bring better performance to Spree stores: Spree on the technology and Hub side, Ninefold from hosting.
Decreasing false positives in automated testingSauce Labs
QASource presented on reducing false positives in automated testing. Some key points:
1. False positives occur when tests are incorrectly marked as failed when they should have passed. Common causes include reliance on UI elements, synchronization issues, and unstable test code.
2. False positives can impact automation by wasting time investigating failures, decreasing productivity, and obscuring real bugs.
3. Strategies to reduce false positives include using stable locators, short independent tests, dynamic synchronization, teardown logic, and re-execution of failed tests.
4. Eliminating false positives leads to more certainty in test results and reduced costs of automation.
Similar to Reliable observability at scale: Error Budgets for 1,000+ (20)
This document discusses best practices for defining and measuring latency service level objectives (SLOs). It recommends computing SLOs directly from raw log data using histograms, which allow arbitrary percentiles to be derived and are better than averaging sample percentiles. Histograms can also be aggregated over time and used to count the number of requests above a latency threshold regardless of what the threshold was set to initially. Common histogram implementations like HDR-Histogram and t-digest are suggested.
Comprehensive Container Based Service Monitoring with Kubernetes and IstioFred Moyer
This document summarizes Fred Moyer's talk on comprehensive container-based service monitoring with Kubernetes and Istio. The talk covered Istio architecture and deployment, using the Istio sample bookinfo application, and monitoring the application with Istio metrics and Grafana dashboards. It also discussed Istio Mixer metrics adapters, math and statistics concepts like histograms and quantiles, and monitoring concepts like service level objectives, indicators, and agreements. The talk provided exercises for attendees to deploy sample applications and create custom metrics adapters.
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
The document provides an overview of using Kubernetes and Istio to monitor microservices. It discusses using Istio to collect telemetry data on requests, including rate, errors, and duration. This data can be visualized in Grafana dashboards to monitor key performance indicators. Histograms are recommended to capture request durations as they allow calculating percentiles over time for service level indicators. An Istio metrics adapter is also described that sends telemetry data to Circonus for long-term storage and alerting.
This document provides an overview of key statistical concepts including:
1. The average (arithmetic mean) is calculated by summing all values and dividing by the number of samples.
2. The median is the middle value of a data set when values are sorted from lowest to highest.
3. The 90th percentile represents the value where 90% of values are below it.
4. Standard deviation measures how spread out values are from the average and 68% of values fall within one standard deviation of the average in a normal distribution.
Fred Moyer from Circonus presented on IRONdb and Grafana. IRONdb is a time series database that can replace existing TSDBs without changes to ingestion or visualizations. It provides scale, reliability, and ease of operations. IRONdb is distributed, replicated across multiple datacenters for reliability, and can store years of high-cardinality histogram and metric data. The upcoming IRONdb data source for Grafana will support histograms, stream tags, and Prometheus storage. Attendees could sign up for early access and preview accounts.
Better service monitoring through histogramsFred Moyer
This document discusses using histograms and percentiles to better monitor service performance. It begins by noting the limitations of synthetic monitoring and outlines how real user data can provide a more accurate picture. Percentiles like the median and 90th percentile are explained as useful metrics for understanding performance. Histograms of request latency data over time are presented as a way to detect non-normal distributions that could indicate issues. Calculating alerting thresholds based on percentiles rather than averages is advocated to avoid missing multiple high samples. Examples are given of how percentile-based alerting can more effectively detect performance problems and avoid unnecessary alerts.
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
The document discusses the process of logically sharding a growing PostgreSQL database. It describes the stages involved: diagnosing which tables are largest; evaluating options like account, geographic or hardware-based sharding; scoping the solution by separating tables between a main and marks database; implementing changes including managing transactions and configuration across databases; releasing the changes; and cleaning up afterwards. It emphasizes testing rollback processes, managing technical debt, and bringing empathy to understanding legacy code and configurations.
The document discusses differences between Perl and Go for Perl programmers. It covers Go topics like goroutines (threads), channels (queues), formatting code with gofmt, defining structs instead of hashes/objects, using slices instead of arrays, maps instead of hashes, error handling, importing packages instead of using Perl modules, writing tests with godoc instead of perldoc, and getting code with go get instead of cpanminus. It also provides Golang web resources for learning more.
Netfilter was used to solve performance and scalability issues with an existing captive portal solution. A netfilter module was developed that removed port numbers from HTTP requests, allowing most static content to be fetched directly from origin servers rather than through a proxy. This avoided proxying all traffic and achieved better performance than alternatives like Tinyproxy. The netfilter solution worked well technically but did not prove viable long-term for business reasons.
This document discusses Apache::Dispatch, a lightweight abstraction layer for mod_perl applications. It maps URIs to application resources via method handlers, providing the power of mod_perl handlers with a painless migration. The document reviews how Apache::Dispatch works, provides examples of configuration, method handlers, and testing with Apache::Test. It also covers additional Apache::Dispatch features like pre/post-dispatch handlers, inheritance, autoloading, and filtering.
This document discusses the Data::FormValidator module, which provides a simplified way to validate form data in Perl. It allows defining validation profiles that specify required and optional fields, as well as custom and built-in constraint methods. The module takes request parameters, runs validation according to the profile, and returns results that can be easily integrated into templates to display error messages.
An MVP (Minimum Viable Product) mobile application is a streamlined version of a mobile app that includes only the core features necessary to address the primary needs of its users. The purpose of an MVP is to validate the app concept with minimal resources, gather user feedback, and identify any areas for improvement before investing in a full-scale development. This approach allows businesses to quickly launch their app, test its market viability, and make data-driven decisions for future enhancements, ensuring a higher likelihood of success and user satisfaction.
Ansys Mechanical enables you to solve complex structural engineering problems and make better, faster design decisions. With the finite element analysis (FEA) solvers available in the suite, you can customize and automate solutions for your structural mechanics problems and parameterize them to analyze multiple design scenarios. Ansys Mechanical is a dynamic tool that has a complete range of analysis tools.
IN Dubai [WHATSAPP:Only (+971588192166**)] Abortion Pills For Sale In Dubai** UAE** Mifepristone and Misoprostol Tablets Available In Dubai** UAE
CONTACT DR. SINDY Whatsapp +971588192166* We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai** Sharjah** Abudhabi** Ajman** Alain** Fujairah** Ras Al Khaimah** Umm Al Quwain** UAE** Buy cytotec in Dubai +971588192166* '''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol** Cytotec” +971588192166* ' Dr.SINDY ''BUY ABORTION PILLS MIFEGEST KIT** MISOPROSTOL** CYTOTEC PILLS IN DUBAI** ABU DHABI**UAE'' Contact me now via What's App… abortion pills in dubai Mtp-Kit Prices
abortion pills available in dubai/abortion pills for sale in dubai/abortion pills in uae/cytotec dubai/abortion pills in abu dhabi/abortion pills available in abu dhabi/abortion tablets in uae
… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all** Cytotec Abortion Pills are Available In Dubai / UAE** you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pills in Dubai** UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if it's beyond 6 months. Our Abu Dhabi** Ajman** Al Ain** Dubai** Fujairah** Ras Al Khaimah (RAK)** Sharjah** Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical** medical and surgical abortion methods for early through late second trimester** including the Abortion By Pill Procedure (RU 486** Mifeprex** Mifepristone** early options French Abortion Pill)** Tamoxifen** Methotrexate and Cytotec (Misoprostol). The Abu Dhabi** United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used** 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need for surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi** United Arab Emirates** uses the latest medications for medical abortions (RU-486** Mifeprex** Mifegyne** Mifepristone** early options French abortion pill)** Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi** United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our
A captivating AI chatbot PowerPoint presentation is made with a striking backdrop in order to attract a wider audience. Select this template featuring several AI chatbot visuals to boost audience engagement and spontaneity. With the aid of this multi-colored template, you may make a compelling presentation and get extra bonuses. To easily elucidate your ideas, choose a typeface with vibrant colors. You can include your data regarding utilizing the chatbot methodology to the remaining half of the template.
Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple StepsEstuary Flow
Unlock the full potential of your data by effortlessly migrating from PostgreSQL to Snowflake, the leading cloud data warehouse. This comprehensive guide presents an easy-to-follow 8-step process using Estuary Flow, an open-source data operations platform designed to simplify data pipelines.
Discover how to seamlessly transfer your PostgreSQL data to Snowflake, leveraging Estuary Flow's intuitive interface and powerful real-time replication capabilities. Harness the power of both platforms to create a robust data ecosystem that drives business intelligence, analytics, and data-driven decision-making.
Key Takeaways:
1. Effortless Migration: Learn how to migrate your PostgreSQL data to Snowflake in 8 simple steps, even with limited technical expertise.
2. Real-Time Insights: Achieve near-instantaneous data syncing for up-to-the-minute analytics and reporting.
3. Cost-Effective Solution: Lower your total cost of ownership (TCO) with Estuary Flow's efficient and scalable architecture.
4. Seamless Integration: Combine the strengths of PostgreSQL's transactional power with Snowflake's cloud-native scalability and data warehousing features.
Don't miss out on this opportunity to unlock the full potential of your data. Read & Download this comprehensive guide now and embark on a seamless data journey from PostgreSQL to Snowflake with Estuary Flow!
Try it Free: https://dashboard.estuary.dev/register
Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.
React Native vs Flutter - SSTech SystemSSTech System
Your project needs and long-term objectives will ultimately choose which of React Native and Flutter to use. For applications using JavaScript and current web technologies in particular, React Native is a mature and trustworthy choice. For projects that value performance and customizability across many platforms, Flutter, on the other hand, provides outstanding performance and a unified UI development experience.
In this talk, we will explore strategies to optimize the success rate of storing and retaining new information. We will discuss scientifically proven ideal learning intervals and content structures. Additionally, we will examine how to create an environment that improves our focus while you remain in the “flow”. Lastly we will also address the influence of AI on learning capabilities.
In the dynamic field of software development, this knowledge will empower you to accelerate your learning curve and support others in their learning journeys.
20. 95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
EXAMPLE SLIS
#observabilitysummit @phredmoyer
21. 95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
Metric Identifier
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
22. 95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Homepage request served in < 100ms
Operator
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
23. 95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
Metric Value
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
24. 95th percentile home page latency over 5
minutes < 500ms
Home page request response code != 5xx
Home page request served in < 100ms
[Metric Identifier] [Operator] [Metric Value]
EXAMPLE SLIS
#observabilitysummit @phredmoyer
27. 99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
EXAMPLE SLOS
#observabilitysummit @phredmoyer
28. [Success Objective] [SLI] [Period]
Success Objective
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
EXAMPLE SLOS
#observabilitysummit @phredmoyer
29. EXAMPLE SLOS
[Success Objective] [SLI] [Period]
SLI
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
#observabilitysummit @phredmoyer
30. EXAMPLE SLOS
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
[Success Objective] [SLI] [Period]
Period
#observabilitysummit @phredmoyer
31. EXAMPLE SLOS
99% of 95th percentile home page latency
over 5 minutes < 500ms over the trailing
month
99% of home page request response code
!= 5xx over last 7 days
95% of home page requests served in <
100ms over last 24 hours
[Success Objective] [SLI] [Period]
#observabilitysummit @phredmoyer
34. EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
#observabilitysummit @phredmoyer
35. EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
Error Budget
#observabilitysummit @phredmoyer
36. EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
SLI
#observabilitysummit @phredmoyer
37. EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
Period
#observabilitysummit @phredmoyer
38. EXAMPLE EBS
Allow 1% failure of 95th percentile home
page latency over 5 minutes < 500ms over
the trailing month
Allow 1% failure of home page request
response code != 5xx over last 7 days
Allow 5% failure of home page requests
served in < 100ms over last 24 hours
[Error Budget] [SLI] [Period]
#observabilitysummit @phredmoyer
39. Keys to Error Budget Democratization
Real world examples that are easy to reference
Formulas that can be parsed by humans and code
Be explicit; small details make big differences
#observabilitysummit @phredmoyer
43. StatsD - not just for servers
Measuring service performance is (mostly) easy
Client apps are more difficult
Disconnects
Caching (CDN, Proxy)
Large browser & device variance
#observabilitysummit @phredmoyer
44. Logs, Traces, Metrics
Conway’s Law; experts for each ‘pillar’
Democratize Expertise
#ask-sre
Reliability Champions
`Observability 101`
`Hands On With Datadog`
#observabilitysummit @phredmoyer
46. Metrics for SLIs
Lies, Darn Lies, and Percentiles
Easy to get the math wrong
Missing the X Factor - Sample Volume
Many vendors have bugs in percentile tools
Can’t aggregate them (well, most of them)
#observabilitysummit @phredmoyer
47. Metrics for SLIs
Counters
Easy to understand
Easy to implement
Easy to aggregate
Easy to get the math right
#observabilitysummit @phredmoyer
48. Metrics for SLIs
Latency SLIs via counters
Request time < 500ms
Count em’ up, divide by total reqs
Add success objective and time range for SLO
99% of request times < 500ms over trailing week
#observabilitysummit @phredmoyer
50. Metrics for SLIs
Flexible Latency SLIs
Histogram based
# reqs 100-200ms, 200-300ms, etc
One time series for each latency band
zen.app.request.sli{path:/foo;bin:gt_500_le_600}
#observabilitysummit @phredmoyer
51. Metrics for SLIs
Flexible Latency SLIs
10..20...100ms
100..200...1,000ms
1,000..1,500...10,000ms
10,000..15,000...60,000ms
Latency == 547ms, metric tag `le_600`, `gt_500_le_600`
#observabilitysummit @phredmoyer
52. Metrics for SLIs
Flexible Latency SLIs
Low errors per latency band
Not as precise as HDR Histograms
Possible cardinality expansion issues
Can implement on any monitoring vendor or TSDB
#observabilitysummit @phredmoyer
62. Different SLOs/EBs for Different Folks
99% of home page requests < 500ms over...
5 minutes - NOC / SRE
1 hour - Product Engineers
1 week - Product Managers
1 month - VPs
1 quarter - CXOs
#observabilitysummit @phredmoyer
63. Keys to Error Budgets at Scale
Give everyone a formula to follow for SLIs/SLOs/EBs
Use simple tools that can deliver rich results
Use latency bands (histograms) for duration data
Measure SLIs as close to the client as possible
Use EBs with appropriate time ranges for audiences
#observabilitysummit @phredmoyer