SlideShare a Scribd company logo
True Observability
Why, What & Architecture
Jeremy Proffitt
Ally Financial
Director DevOps & SRE C3
Even the most
dependable employee,
is not 100% Reliable
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
Logs vs
Metrics Imagine collecting information such as when your
application throws an error, recording time required to
perform actions or client activity such as logging in.
Logs allow us to capture events, in detail, and store
them for analysis. One of the most common log are
access logs for web services, i.e. a record of who has
visited your website, what page they visited, browser
type and even how long it took to return information to a
user.
Metrics are the aggregation of events in a time period,
for example, 1,045 web site visits in the last minute.

Recommended for you

2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction

The document discusses quality measurement in production using DevOps principles. It introduces concepts like Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to measure quality beyond a simple pass/fail. As a test engineer, opportunities exist to help identify SLIs, drive SLOs through scenario writing, and collaborate across teams to understand customer expectations. Monitoring features in production through SLOs allows issues to be found before significant user impact.

slo
Defect Tracking Tool
Defect Tracking ToolDefect Tracking Tool
Defect Tracking Tool

final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P

 
by ncct
The Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web ServiceThe Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web Service

Mint.com started as a prototype created by the author using open source tools with no prior startup experience. The initial prototype focused on differentiating features like aggregating financial accounts and transactions. As users grew, performance issues arose due to increased load on servers and databases. To address these growing pains, the architecture was optimized by separating tiers, adding caching, database sharding, and more. Key lessons were to focus first on critical user problems in prototypes, continuously measure performance, and optimize based on demand to balance latency, throughput, and quality as the user base expanded.

product development
Logging
The raw truth
The individualism of logs allow us
to pick out singlar events, such
as a single error or delayed
response and are often chained
together with other logs to form a
story to see how users and
information flow through your
system.
We don’t just look for errors, the absence of logs from a source,
either looked at individually or as a count by server can provide us
valuable insight when a server or service goes offline.
Logs Provide us raw data to aggregate in differing ways to
allow us to generate queries to answer specific questions.
By watching for specific events in logs, for example, high response times,
errors, we can build alerting to inform us of when systems are not
performing correctly.
Not only can we look for the absence of logs, but we can track
users, page to page to determine where users leaving, or even if a
website isn’t allowing users to advance to the next step.
Metrics
The truth in
aggregate
When we look at logs, the
majority of the time, we’re looking
at an aggregate view, how many
of what in what amount of time.
Multiple cloud providers now
provide information is an easy to
use, metric. For example, the
number of visitors, average
response time or even number of
times error pages are displayed.
In the cloud, the same metric is often available over multiple provider
offerings, for example, 500 web errors are a metric common to Load
Balancers, CloudFront and other web services in AWS.
Unlike logs, aggregated metrics are specifically set to only
count in a singular, boilerplate, out of the box way.
Alerting on metrics provide us with a quick and simple method of saying, do
I have any errors? Is my web server running slow? Am I using too much
memory, cpu or disk space?
Metrics are less expensive to generate and keep than logs, they are
a low cost method of wrapping our systems in layers of alerting, with
the downside of not being able to drill into a singular event.
Watching Infrastructure - aka “The Servers”.
90% 90% 20% Pass / Fail 90%
Memory Usage on a server should be monitoring,
some systems will automatically reset your
container or application when out of memory, and
yet others simply provide unpredictable results.
Hard Drives, or
where the data often
goes, are critical,
imagine not being
able to write a new
customer’s order to
a database because
you ran out of space
Host Health - is the
physical box
healthy, is the
operating system
working?
CPU on a server should be monitored, low
CPU availability can lead to longer wait
times for customers and even timeouts or
failure to process items in a set amount of
time
Serverless architecture regardless of provider or advertised features, must still run
on a server. Lambda’s still require memory and processing time - metrics which
must be kept in check. Because you pay for what you use, if your Lambda’s take
twice as long now as they did yesterday, they will cost you twice as much. And
serverless databases are often constrained by queries per minute, concurrent
connectivity and can automatically scale without your consent.
While there are many advantages to Serverless - in some ways, it’s like giving a
teenager a cell phone - be careful, the overages can kill you!!
Monitor Memory usage for your
serverless applications.
Time is literally money, the longer
your process takes, the more it
costs.
Be very mindful of the limitations
and cost of serverless data
storage mechanism
When traffic increases, the more
serverless resources increase,
which means the more you pay.
Memory Processing Time Connections & Queries Scaling up is Paying Up
Hey! But ...
I’m SERVERLESS!
YO! It still runs on a server.

Recommended for you

Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide

This document provides a summary of key considerations for choosing and implementing an ERP system. It discusses the value of ERP systems in integrating business processes and providing real-time data. It also compares on-premise versus cloud-based ERP solutions, noting advantages of cloud solutions like lower costs, automatic updates, mobility, and scalability. The document provides questions for evaluating different ERP options and identifies factors that indicate a business could benefit from a cloud-based ERP system.

Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide

An ERP system integrates various business processes like inventory, order management, accounting, human resources and customer relationship management into a single system. Cloud ERP solutions provide various benefits over on-premise systems like lower costs, easier upgrades, mobility, flexibility and scalability. Businesses are moving to Cloud ERP for cost savings, flexibility, mobility, automatic updates, security and compliance. The document examines factors to consider when choosing between on-premise and cloud ERP and concludes that cloud ERP now provides opportunities for businesses of all sizes.

6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies

Incorporating Robotic Process Automation (RPA) in your IT system will help your business operation become more effective and efficient. Read on how to in this article.

robotic process automationrpa solutionsrpa services
APM’s often will not capture any error a
developer has already captured in a try
catch statement! Imagine the frustration of
not seeing all the errors you were promised!
Most APM’s provide a method of making a
code change to send the exception to your
APM - but it increases substantially the roll
out cost of an APM.
APM
Monitoring Your Code
WARNING!!!!
APM, or Application Performance Monitoring is fast becoming a
new tool in increasing reliability and development cycles. With
the ability to simply drop modules into your code, server or
container, APM’s promise the visibility to track errors and slow
downs in your code along with inter system operability.
What?
APM often hooks into the “back door” of code, providing us
essential information such as how long an application waits for
an API call or database call takes to complete, the memory
usage, often broken out by different types or heaps of memory
and of course, application exceptions and errors.
APM can also allow you to see transactions as they flow from
and to different applications if they all have the same APM
instrumentation, allow you to see a map or path traveled and
assist in pinpointing issues in larger systems.
It should be noted, APM’s are often Metric based systems, and
are often known to only using a sampling or subset of the actual
data to represent information on the screen. While still useful,
this can introduce settle differences in APMs making on the
fence or hard to find issues, even more difficult to track.
Active Monitoring
Active
Monitoring
Synthetic Monitoring
Actively accessing your site, simulating a user
experience. Often includes a robotic login,
and clicking about to validate functionality of a
site
Watch Dogs
To ensure a system is alive, we often send
an outbound signal, or make an http web
call, every x minutes. If no call comes in
after a period of time, an alert is fired.
Simple Web Monitoring
Simply accessing a web page or API and
checking for specific results or result codes (like
200 success http responses) provides a simple
way of saying, is my website or api alive.
Certificate Validation
Actively checking the HTTPS secure
certificates and providing notification 15 to 30
days before they expire, will save the
embarrassment of expired certificates causing
revenue and user experience impacts.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
The IMPACT!
Revenue & Customer Experience
The false belief that downtime is preventable!
Repeat after me,
“Downtime will happen.”,
say it again,
“Downtime will happen!”
There is only one promise I can make as we go through this presentation today - at
some point, systems have downtime - it’s absolutely unavoidable. How we plan to
recover from those outages, the decisions to build out redundancy and impact not just
in revenue but reputation are all part of a complex equation.
Beyond the discussion of architecture of application and systems, redundancy through
the use of multiple servers, resources, data centers and networking layers are all part
of a larger business discussion. I’ll say that again, for the most part, reliability due to
redundant resources, is a business, not a technical decision.
When reviewing downtime, customer and revenue impact should be thought of in a
time based equation - and downtime for customers can have crippling impacts into
app ratings, word of mouth limitations or worst, customer rants warding off new
customers.
Be Aware of the 100% uptime
promise! It’s often riddled with
exceptions for “emergency”
maintenance windows used to
cover up production issues.

Recommended for you

ca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebookca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebook

This document discusses the pitfalls of using ineffective IT monitoring solutions, such as outdated freeware, multiple point solutions, or costly tools that provide disjointed views of systems. It describes three common pitfalls organizations face: extended downtime from poor troubleshooting, high costs and inefficiencies from managing multiple tools, and inability to support new technologies. The document uses a fictional example of an online retailer experiencing a major outage due to these issues. It then promotes the CA Nimsoft Monitor Snap solution as a free, unified monitoring platform that avoids these pitfalls and helps organizations transform their IT monitoring.

SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070

In growing world of SaaS multi-tenancy and virtualized/shared computing resources, how are SAS 70 issues getting resolved?

sas70saasoracle
APM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New RelicAPM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New Relic

New Relic is a web application performance management tool that allows users to monitor the performance of their web applications from end to end. It provides visibility into application performance from the user experience down to the code level to help identify and address bottlenecks. Customers can see performance across their entire technology stack, including servers, databases and applications, from a single interface. New Relic's SaaS-based model provides these capabilities with minimal setup and costs compared to on-premise alternatives. It helps optimize applications and ensure high quality user experiences.

saasenterprisenew relic
The IMPACT!
Communication - Internal & External
The false belief that downtime is preventable!
Communication is about perception - and is perhaps one of the most important
aspects of both an outage and career advancement.
Communication is a balance - do we communicate that a 5 minute database slow
down to our customers and CEO? Likely no, the general rule I’ve used, is if it’s over
before the communication can be sent, or the impact is close to resolution - we’ll
typically not sent communication beyond IT leadership. Now this rule is a fair amount
of hands on learning for your organization, be flexible, validate your communication is
appropriate and accurate.
We communicate to our direct supervisor and hopefully up the chain to our CIO/CTO
because you never want these people to be asked, what’s going on - without
immediately having an answer for other members of the C-Suite. This empowers your
managers and will have a positive impact on your career.
Communication to customers is a business decision, and in this, I would bring in
marketing and legal for a discussion of pre-canned messages. This is extremely
critical for larger outages - and remember, think about how you’d feel as a customer.
When you have an outage, nothing
is better than the truth. Want to
know how to do it right? Research
Johnson and Johnson’s Tylenol
Recall.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
The idea that monitoring alerts can be done as code, pushed out through
API’s to different systems, saving both the time of manual implementation
but also reducing the cost of errors during this manual implementation.
Monitoring as code
Cloud providers, have the capability of giving you a list of all the
architecture in your account via an api call, library or command line
interface. This allows us to ask, what servers do I have?
Cloud System’s and API’s
Tagging on architecture, is typically descriptive, like “MySQL Customer
Database”. If we look at this deeper, we find example after example of
using Tags to define how other parts of architecture react together.
Tagging - the what is it
Not only can we ask a cloud provider for a list of all cloud architecture
elements, we can do the same in our monitoring system. Ensuring we
compare these lists - we can validate, all systems are being monitored.
The power of the Audit
Monitoring Auditing - Beyond Monitoring as Code
Monitoring as Code - Rolling out modifiable templates
T
e
m
p
l
a
t
e
s
S
t
e
p
t
h
r
o
u
g
h
i
t
!
R
o
l
l
I
t
O
u
t
!
R
i
n
s
e
a
n
d
R
e
p
e
a
t
Monitoring as Code
Monitoring is often rolled out based
on seperate applications and needs.
And are rolled out as multiple
independent systems.
But there is a more elegant way to
roll out automated monitoring using
code - here’s how.
Start with Monitoring Templates
Review all architecture used or
you want to use, and generate
alerts to address your business
needs. From these, you can build
a standard set of alerts.
Step through your
architecture!
Step through each object in
your cloud account -
capturing both the object
name, and tags defining
custom template
adjustments.
Rinse and Repeat
Repeating the roll out daily
ensure manual changes are
rolled back and new
architecture is monitored
quickly - and without
intervention.
Roll it out!
Now that you have the
objects, and their custom
adjustments, using the
templates, alerting can be
created or updated.
Did you know?
None of the current leaders in
Monitoring offer a system so simple
or clean - and one has to wonder
why?

Recommended for you

Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan

Prometheus is a next-generation monitoring system. It lets you see you not just what your systems look like from the outside, but also gives visibility into the internals and business aspects of your systems. This allows everyone to benefit, including both operations and developers. This talk will look at the concepts behind monitoring with Prometheus, how it's designed, why it's suitable for Cloud Native environments and how you can get involved.

monitoringservice discoveryprometheus
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019

Going cloud native in a highly regulated context presents challenges of its own. In this talk, Jez Humble will share with you the platform created by the cloud.gov team at 18F, and the benefits it brought to federal agencies seeking to use the cloud. More details: https://confengine.com/agile-india-2019/proposal/8525/cloud-native-in-the-us-federal-government Conference link: https://2019.agileindia.org

agileindia2019continuous delivery and devopscase study
Adobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation JourneyAdobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation Journey

Adobe’s eCommerce Digital Transformation Journey Digital performance is a journey, not a destination. For the eCommerce team at Adobe, their journey to change the world through digital media and digital marketing includes enabling their customers to explore and purchase products anywhere, on any screen. The creative community are tough customers, so making everything work 99.99% of the time while delivering the rich, artistic experience that Adobe's fans expect doesn't make life easy for the eCommerce group. But it's a challenge they've embraced! Adobe's Greg Thomsen, Event Management Analyst, will share the steps his team has taken to transform their eCommerce platform and processes to be more resilient and responsive. You'll hear about the cultural changes and collaboration supported by the combination of Dynatrace Synthetic Monitoring and Application Monitoring, including: Accelerating incident management through automation Driving business alignment with management views Successful cloud migration Learn about the hurdles cleared, the lessons learned and Adobe's vision for the future of their digital performance management strategy.

application performancecustomer experience managementapplication performance management
Routing your Monitoring Alerts
Ensure your alerts include meta
data that can be used to determine
importance and team ownership
Alerts Generated
Processing Alerts can be complex,
routing alerts to teams and setting
hours for specific levels of alerting.
Alert Processing (PagerDuty)
Alert Fatigue, the cry wolf of IT, is when alerts which don’t require
immediate resolution, are routed incorrectly or importance is not in
alignment with the company's needs. I.e. when you get alerts that you
shouldn’t be and they wake you up at 3am continuously, those linkedin
invites for a new job - they become more and more tempting.
Alert Fatigue - the fastest way to start searching for new Team Mates
When you reach out to employees after hours, you
should have an escalation policy to reach out to the back
up person automatically.
It’s very important that Expectations are set with the
entire team, does a laptop travel with on call engineers?
Do engineers come in late if they’ve been up all night?
When do you wake up your entire team for a team
response?
Set Paging Escalation and Expectations
Alerts vs Warnings! Consider adding
warnings for most alerts, for example hard
drive space. This allows working hour based
support which can prevent the 3am wake up.
We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda
Getting Started
A
B
C
Monitoring and Alerting has the highest return on investment you’ll likely see in
your business, if you keep the costs under control.
Starting with the alerting capabilities of the cloud provider is a quick and least
expensive start up, and it helps you understand the data available from your
systems when and if you purchase a third party tool.
Generating a script to apply templates, should be quick and simple - adding tag
based rule exceptions, or even entire template versioning - provides a level of
coverage quickly and at a low cost.
Ensuring when monitoring is triggered, you are treating the alert appropriately,
i.e. dev servers shouldn’t wake you at 3am in the morning. Ensuring we keep
production alerts with direct and immediate impact going to engineers, while
capturing other alerts for processing the next business day will not only keep
your staff happy, but prevent churn in an industry where demand is very high.
Finally, talk about outages and downtime as a business decision. Like a fire drill,
know what the response will be, know who is required to do what - and who their
backup is and ultimately, determine what and how is to be released to who.
Getting Started!

More Related Content

Similar to Building Reliability - The Realities of Observability

Solving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 PeopleSolving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 People
Dynatrace
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
Dynatrace
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
RTTS
 
2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction
Abigail Bangser
 
Defect Tracking Tool
Defect Tracking ToolDefect Tracking Tool
Defect Tracking Tool
ncct
 
The Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web ServiceThe Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web Service
Poornima Vijayashanker
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
Kaizenlogcom
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
Kaizenlogcom
 
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
BoTree Technologies
 
ca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebookca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebook
Tiffany Hamilton
 
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
retheauditors
 
APM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New RelicAPM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New Relic
New Relic
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
Brian Brazil
 
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Agile India
 
Adobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation JourneyAdobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation Journey
Dynatrace
 
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
RohitBhandari66
 
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Software India
 
Server Monitoring Battles
Server Monitoring BattlesServer Monitoring Battles
Server Monitoring Battles
CA Technologies
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Html web design_software
Html web design_softwareHtml web design_software
Html web design_software
pickettc_70
 

Similar to Building Reliability - The Realities of Observability (20)

Solving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 PeopleSolving 21st Century App Performance Problems Without 21 People
Solving 21st Century App Performance Problems Without 21 People
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction2020 10-08 measuring-qualityinproduction
2020 10-08 measuring-qualityinproduction
 
Defect Tracking Tool
Defect Tracking ToolDefect Tracking Tool
Defect Tracking Tool
 
The Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web ServiceThe Evolution of a Scrappy Startup to a Successful Web Service
The Evolution of a Scrappy Startup to a Successful Web Service
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
 
Cloud investment buyers guide
Cloud investment buyers guideCloud investment buyers guide
Cloud investment buyers guide
 
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies6 Ways To Leverage RPA in IT Operations - BoTree Technologies
6 Ways To Leverage RPA in IT Operations - BoTree Technologies
 
ca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebookca_nimsoft_monitor_snap_ebook
ca_nimsoft_monitor_snap_ebook
 
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
SAS 70 in a Post-Sarbanes, SaaS World: Quest Session 52070
 
APM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New RelicAPM for Enterprise WhitePaper from New Relic
APM for Enterprise WhitePaper from New Relic
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
Cloud Native in the US Federal Government by Jez Humble at #AgileIndia2019
 
Adobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation JourneyAdobe’s eCommerce Digital Transformation Journey
Adobe’s eCommerce Digital Transformation Journey
 
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...Accelerate and Streamline Performance Testing with AI-powered Test Automation...
Accelerate and Streamline Performance Testing with AI-powered Test Automation...
 
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
IBM Solutions Connect 2013 - Increase Efficiency by Automating IT Asset & Ser...
 
Server Monitoring Battles
Server Monitoring BattlesServer Monitoring Battles
Server Monitoring Battles
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Html web design_software
Html web design_softwareHtml web design_software
Html web design_software
 

More from All Things Open

Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
All Things Open
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
All Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
All Things Open
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
All Things Open
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
All Things Open
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
All Things Open
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
All Things Open
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
All Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
All Things Open
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
All Things Open
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
All Things Open
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
All Things Open
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
All Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
All Things Open
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
All Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
All Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
All Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
All Things Open
 
Building AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources codeBuilding AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources code
All Things Open
 

More from All Things Open (20)

Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 
Building AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources codeBuilding AlmaLinux OS without RHEL sources code
Building AlmaLinux OS without RHEL sources code
 

Recently uploaded

Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 

Recently uploaded (20)

Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 

Building Reliability - The Realities of Observability

  • 1. True Observability Why, What & Architecture Jeremy Proffitt Ally Financial Director DevOps & SRE C3
  • 2. Even the most dependable employee, is not 100% Reliable
  • 3. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 4. Logs vs Metrics Imagine collecting information such as when your application throws an error, recording time required to perform actions or client activity such as logging in. Logs allow us to capture events, in detail, and store them for analysis. One of the most common log are access logs for web services, i.e. a record of who has visited your website, what page they visited, browser type and even how long it took to return information to a user. Metrics are the aggregation of events in a time period, for example, 1,045 web site visits in the last minute.
  • 5. Logging The raw truth The individualism of logs allow us to pick out singlar events, such as a single error or delayed response and are often chained together with other logs to form a story to see how users and information flow through your system. We don’t just look for errors, the absence of logs from a source, either looked at individually or as a count by server can provide us valuable insight when a server or service goes offline. Logs Provide us raw data to aggregate in differing ways to allow us to generate queries to answer specific questions. By watching for specific events in logs, for example, high response times, errors, we can build alerting to inform us of when systems are not performing correctly. Not only can we look for the absence of logs, but we can track users, page to page to determine where users leaving, or even if a website isn’t allowing users to advance to the next step.
  • 6. Metrics The truth in aggregate When we look at logs, the majority of the time, we’re looking at an aggregate view, how many of what in what amount of time. Multiple cloud providers now provide information is an easy to use, metric. For example, the number of visitors, average response time or even number of times error pages are displayed. In the cloud, the same metric is often available over multiple provider offerings, for example, 500 web errors are a metric common to Load Balancers, CloudFront and other web services in AWS. Unlike logs, aggregated metrics are specifically set to only count in a singular, boilerplate, out of the box way. Alerting on metrics provide us with a quick and simple method of saying, do I have any errors? Is my web server running slow? Am I using too much memory, cpu or disk space? Metrics are less expensive to generate and keep than logs, they are a low cost method of wrapping our systems in layers of alerting, with the downside of not being able to drill into a singular event.
  • 7. Watching Infrastructure - aka “The Servers”. 90% 90% 20% Pass / Fail 90% Memory Usage on a server should be monitoring, some systems will automatically reset your container or application when out of memory, and yet others simply provide unpredictable results. Hard Drives, or where the data often goes, are critical, imagine not being able to write a new customer’s order to a database because you ran out of space Host Health - is the physical box healthy, is the operating system working? CPU on a server should be monitored, low CPU availability can lead to longer wait times for customers and even timeouts or failure to process items in a set amount of time
  • 8. Serverless architecture regardless of provider or advertised features, must still run on a server. Lambda’s still require memory and processing time - metrics which must be kept in check. Because you pay for what you use, if your Lambda’s take twice as long now as they did yesterday, they will cost you twice as much. And serverless databases are often constrained by queries per minute, concurrent connectivity and can automatically scale without your consent. While there are many advantages to Serverless - in some ways, it’s like giving a teenager a cell phone - be careful, the overages can kill you!! Monitor Memory usage for your serverless applications. Time is literally money, the longer your process takes, the more it costs. Be very mindful of the limitations and cost of serverless data storage mechanism When traffic increases, the more serverless resources increase, which means the more you pay. Memory Processing Time Connections & Queries Scaling up is Paying Up Hey! But ... I’m SERVERLESS! YO! It still runs on a server.
  • 9. APM’s often will not capture any error a developer has already captured in a try catch statement! Imagine the frustration of not seeing all the errors you were promised! Most APM’s provide a method of making a code change to send the exception to your APM - but it increases substantially the roll out cost of an APM. APM Monitoring Your Code WARNING!!!! APM, or Application Performance Monitoring is fast becoming a new tool in increasing reliability and development cycles. With the ability to simply drop modules into your code, server or container, APM’s promise the visibility to track errors and slow downs in your code along with inter system operability. What? APM often hooks into the “back door” of code, providing us essential information such as how long an application waits for an API call or database call takes to complete, the memory usage, often broken out by different types or heaps of memory and of course, application exceptions and errors. APM can also allow you to see transactions as they flow from and to different applications if they all have the same APM instrumentation, allow you to see a map or path traveled and assist in pinpointing issues in larger systems. It should be noted, APM’s are often Metric based systems, and are often known to only using a sampling or subset of the actual data to represent information on the screen. While still useful, this can introduce settle differences in APMs making on the fence or hard to find issues, even more difficult to track.
  • 10. Active Monitoring Active Monitoring Synthetic Monitoring Actively accessing your site, simulating a user experience. Often includes a robotic login, and clicking about to validate functionality of a site Watch Dogs To ensure a system is alive, we often send an outbound signal, or make an http web call, every x minutes. If no call comes in after a period of time, an alert is fired. Simple Web Monitoring Simply accessing a web page or API and checking for specific results or result codes (like 200 success http responses) provides a simple way of saying, is my website or api alive. Certificate Validation Actively checking the HTTPS secure certificates and providing notification 15 to 30 days before they expire, will save the embarrassment of expired certificates causing revenue and user experience impacts.
  • 11. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 12. The IMPACT! Revenue & Customer Experience The false belief that downtime is preventable! Repeat after me, “Downtime will happen.”, say it again, “Downtime will happen!” There is only one promise I can make as we go through this presentation today - at some point, systems have downtime - it’s absolutely unavoidable. How we plan to recover from those outages, the decisions to build out redundancy and impact not just in revenue but reputation are all part of a complex equation. Beyond the discussion of architecture of application and systems, redundancy through the use of multiple servers, resources, data centers and networking layers are all part of a larger business discussion. I’ll say that again, for the most part, reliability due to redundant resources, is a business, not a technical decision. When reviewing downtime, customer and revenue impact should be thought of in a time based equation - and downtime for customers can have crippling impacts into app ratings, word of mouth limitations or worst, customer rants warding off new customers. Be Aware of the 100% uptime promise! It’s often riddled with exceptions for “emergency” maintenance windows used to cover up production issues.
  • 13. The IMPACT! Communication - Internal & External The false belief that downtime is preventable! Communication is about perception - and is perhaps one of the most important aspects of both an outage and career advancement. Communication is a balance - do we communicate that a 5 minute database slow down to our customers and CEO? Likely no, the general rule I’ve used, is if it’s over before the communication can be sent, or the impact is close to resolution - we’ll typically not sent communication beyond IT leadership. Now this rule is a fair amount of hands on learning for your organization, be flexible, validate your communication is appropriate and accurate. We communicate to our direct supervisor and hopefully up the chain to our CIO/CTO because you never want these people to be asked, what’s going on - without immediately having an answer for other members of the C-Suite. This empowers your managers and will have a positive impact on your career. Communication to customers is a business decision, and in this, I would bring in marketing and legal for a discussion of pre-canned messages. This is extremely critical for larger outages - and remember, think about how you’d feel as a customer. When you have an outage, nothing is better than the truth. Want to know how to do it right? Research Johnson and Johnson’s Tylenol Recall.
  • 14. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 15. The idea that monitoring alerts can be done as code, pushed out through API’s to different systems, saving both the time of manual implementation but also reducing the cost of errors during this manual implementation. Monitoring as code Cloud providers, have the capability of giving you a list of all the architecture in your account via an api call, library or command line interface. This allows us to ask, what servers do I have? Cloud System’s and API’s Tagging on architecture, is typically descriptive, like “MySQL Customer Database”. If we look at this deeper, we find example after example of using Tags to define how other parts of architecture react together. Tagging - the what is it Not only can we ask a cloud provider for a list of all cloud architecture elements, we can do the same in our monitoring system. Ensuring we compare these lists - we can validate, all systems are being monitored. The power of the Audit Monitoring Auditing - Beyond Monitoring as Code
  • 16. Monitoring as Code - Rolling out modifiable templates T e m p l a t e s S t e p t h r o u g h i t ! R o l l I t O u t ! R i n s e a n d R e p e a t Monitoring as Code Monitoring is often rolled out based on seperate applications and needs. And are rolled out as multiple independent systems. But there is a more elegant way to roll out automated monitoring using code - here’s how. Start with Monitoring Templates Review all architecture used or you want to use, and generate alerts to address your business needs. From these, you can build a standard set of alerts. Step through your architecture! Step through each object in your cloud account - capturing both the object name, and tags defining custom template adjustments. Rinse and Repeat Repeating the roll out daily ensure manual changes are rolled back and new architecture is monitored quickly - and without intervention. Roll it out! Now that you have the objects, and their custom adjustments, using the templates, alerting can be created or updated. Did you know? None of the current leaders in Monitoring offer a system so simple or clean - and one has to wonder why?
  • 17. Routing your Monitoring Alerts Ensure your alerts include meta data that can be used to determine importance and team ownership Alerts Generated Processing Alerts can be complex, routing alerts to teams and setting hours for specific levels of alerting. Alert Processing (PagerDuty) Alert Fatigue, the cry wolf of IT, is when alerts which don’t require immediate resolution, are routed incorrectly or importance is not in alignment with the company's needs. I.e. when you get alerts that you shouldn’t be and they wake you up at 3am continuously, those linkedin invites for a new job - they become more and more tempting. Alert Fatigue - the fastest way to start searching for new Team Mates When you reach out to employees after hours, you should have an escalation policy to reach out to the back up person automatically. It’s very important that Expectations are set with the entire team, does a laptop travel with on call engineers? Do engineers come in late if they’ve been up all night? When do you wake up your entire team for a team response? Set Paging Escalation and Expectations Alerts vs Warnings! Consider adding warnings for most alerts, for example hard drive space. This allows working hour based support which can prevent the 3am wake up.
  • 18. We’ll discuss logs vs metrics, Active Checks, infrastructure beyond just memory, cpu and hard drive - and APM monitoring of applications What We Monitor & Why 01 Discussion on how uptime and outages impacts customer satisfaction and revenue. Production Reliability 02 The largest gap in the industry today is the tools required to bring together our ability to inventory our systems and validate monitoring. Auditing Your Monitoring 03 The reality of getting started, tools and what they give and finally, what is the best return on your investment. How Do I Get Started? 04 Agenda
  • 19. Getting Started A B C Monitoring and Alerting has the highest return on investment you’ll likely see in your business, if you keep the costs under control. Starting with the alerting capabilities of the cloud provider is a quick and least expensive start up, and it helps you understand the data available from your systems when and if you purchase a third party tool. Generating a script to apply templates, should be quick and simple - adding tag based rule exceptions, or even entire template versioning - provides a level of coverage quickly and at a low cost. Ensuring when monitoring is triggered, you are treating the alert appropriately, i.e. dev servers shouldn’t wake you at 3am in the morning. Ensuring we keep production alerts with direct and immediate impact going to engineers, while capturing other alerts for processing the next business day will not only keep your staff happy, but prevent churn in an industry where demand is very high. Finally, talk about outages and downtime as a business decision. Like a fire drill, know what the response will be, know who is required to do what - and who their backup is and ultimately, determine what and how is to be released to who. Getting Started!