Building Reliability - The Realities of Observability

After It’s Live
Observability
Jeremy Proffitt
Ally Financial
Director DevOps & SRE C3

Even the most
dependable employee,
is not 100% Reliable

We’ll discuss logs vs metrics, Active Checks, infrastructure beyond
just memory, cpu and hard drive - and APM monitoring of
applications
What We Monitor & Why
01
Discussion on how uptime and outages impacts customer
satisfaction and revenue.
Production Reliability
02
The largest gap in the industry today is the tools required to bring
together our ability to inventory our systems and validate monitoring.
Auditing Your Monitoring
03
The reality of getting started, tools and what they give and finally,
what is the best return on your investment.
How Do I Get Started?
04
Agenda

Logs vs
Metrics Imagine collecting information such as when your
application throws an error, recording time required to
perform actions or client activity such as logging in.
Logs allow us to capture events, in detail, and store
them for analysis. One of the most common log are
access logs for web services, i.e. a record of who has
visited your website, what page they visited, browser
type and even how long it took to return information to a
user.
Metrics are the aggregation of events in a time period,
for example, 1,045 web site visits in the last minute.

Logging
The raw truth
The individualism of logs allow us
to pick out singlar events, such
as a single error or delayed
response and are often chained
together with other logs to form a
story to see how users and
information flow through your
system.
We don’t just look for errors, the absence of logs from a source,
either looked at individually or as a count by server can provide us
valuable insight when a server or service goes offline.
Logs Provide us raw data to aggregate in differing ways to
allow us to generate queries to answer specific questions.
By watching for specific events in logs, for example, high response times,
errors, we can build alerting to inform us of when systems are not
performing correctly.
Not only can we look for the absence of logs, but we can track
users, page to page to determine where users leaving, or even if a
website isn’t allowing users to advance to the next step.

Metrics
The truth in
aggregate
When we look at logs, the
majority of the time, we’re looking
at an aggregate view, how many
of what in what amount of time.
Multiple cloud providers now
provide information is an easy to
use, metric. For example, the
number of visitors, average
response time or even number of
times error pages are displayed.
In the cloud, the same metric is often available over multiple provider
offerings, for example, 500 web errors are a metric common to Load
Balancers, CloudFront and other web services in AWS.
Unlike logs, aggregated metrics are specifically set to only
count in a singular, boilerplate, out of the box way.
Alerting on metrics provide us with a quick and simple method of saying, do
I have any errors? Is my web server running slow? Am I using too much
memory, cpu or disk space?
Metrics are less expensive to generate and keep than logs, they are
a low cost method of wrapping our systems in layers of alerting, with
the downside of not being able to drill into a singular event.

Watching Infrastructure - aka “The Servers”.
90% 90% 20% Pass / Fail 90%
Memory Usage on a server should be monitoring,
some systems will automatically reset your
container or application when out of memory, and
yet others simply provide unpredictable results.
Hard Drives, or
where the data often
goes, are critical,
imagine not being
able to write a new
customer’s order to
a database because
you ran out of space
Host Health - is the
physical box
healthy, is the
operating system
working?
CPU on a server should be monitored, low
CPU availability can lead to longer wait
times for customers and even timeouts or
failure to process items in a set amount of
time

Serverless architecture regardless of provider or advertised features, must still run
on a server. Lambda’s still require memory and processing time - metrics which
must be kept in check. Because you pay for what you use, if your Lambda’s take
twice as long now as they did yesterday, they will cost you twice as much. And
serverless databases are often constrained by queries per minute, concurrent
connectivity and can automatically scale without your consent.
While there are many advantages to Serverless - in some ways, it’s like giving a
teenager a cell phone - be careful, the overages can kill you!!
Monitor Memory usage for your
serverless applications.
Time is literally money, the longer
your process takes, the more it
costs.
Be very mindful of the limitations
and cost of serverless data
storage mechanism
When traffic increases, the more
serverless resources increase,
which means the more you pay.
Memory Processing Time Connections & Queries Scaling up is Paying Up
Hey! But ...
I’m SERVERLESS!
YO! It still runs on a server.

APM’s often will not capture any error a
developer has already captured in a try
catch statement! Imagine the frustration of
not seeing all the errors you were promised!
Most APM’s provide a method of making a
code change to send the exception to your
APM - but it increases substantially the roll
out cost of an APM.
APM
Monitoring Your Code
WARNING!!!!
APM, or Application Performance Monitoring is fast becoming a
new tool in increasing reliability and development cycles. With
the ability to simply drop modules into your code, server or
container, APM’s promise the visibility to track errors and slow
downs in your code along with inter system operability.
What?
APM often hooks into the “back door” of code, providing us
essential information such as how long an application waits for
an API call or database call takes to complete, the memory
usage, often broken out by different types or heaps of memory
and of course, application exceptions and errors.
APM can also allow you to see transactions as they flow from
and to different applications if they all have the same APM
instrumentation, allow you to see a map or path traveled and
assist in pinpointing issues in larger systems.
It should be noted, APM’s are often Metric based systems, and
are often known to only using a sampling or subset of the actual
data to represent information on the screen. While still useful,
this can introduce settle differences in APMs making on the
fence or hard to find issues, even more difficult to track.

Active Monitoring
Active
Monitoring
Synthetic Monitoring
Actively accessing your site, simulating a user
experience. Often includes a robotic login,
and clicking about to validate functionality of a
site
Watch Dogs
To ensure a system is alive, we often send
an outbound signal, or make an http web
call, every x minutes. If no call comes in
after a period of time, an alert is fired.
Simple Web Monitoring
Simply accessing a web page or API and
checking for specific results or result codes (like
200 success http responses) provides a simple
way of saying, is my website or api alive.
Certificate Validation
Actively checking the HTTPS secure
certificates and providing notification 15 to 30
days before they expire, will save the
embarrassment of expired certificates causing
revenue and user experience impacts.

The IMPACT!
Revenue & Customer Experience
The false belief that downtime is preventable!
Repeat after me,
“Downtime will happen.”,
say it again,
“Downtime will happen!”
There is only one promise I can make as we go through this presentation today - at
some point, systems have downtime - it’s absolutely unavoidable. How we plan to
recover from those outages, the decisions to build out redundancy and impact not just
in revenue but reputation are all part of a complex equation.
Beyond the discussion of architecture of application and systems, redundancy through
the use of multiple servers, resources, data centers and networking layers are all part
of a larger business discussion. I’ll say that again, for the most part, reliability due to
redundant resources, is a business, not a technical decision.
When reviewing downtime, customer and revenue impact should be thought of in a
time based equation - and downtime for customers can have crippling impacts into
app ratings, word of mouth limitations or worst, customer rants warding off new
customers.
Be Aware of the 100% uptime
promise! It’s often riddled with
exceptions for “emergency”
maintenance windows used to
cover up production issues.

The IMPACT!
Communication - Internal & External
The false belief that downtime is preventable!
Communication is about perception - and is perhaps one of the most important
aspects of both an outage and career advancement.
Communication is a balance - do we communicate that a 5 minute database slow
down to our customers and CEO? Likely no, the general rule I’ve used, is if it’s over
before the communication can be sent, or the impact is close to resolution - we’ll
typically not sent communication beyond IT leadership. Now this rule is a fair amount
of hands on learning for your organization, be flexible, validate your communication is
appropriate and accurate.
We communicate to our direct supervisor and hopefully up the chain to our CIO/CTO
because you never want these people to be asked, what’s going on - without
immediately having an answer for other members of the C-Suite. This empowers your
managers and will have a positive impact on your career.
Communication to customers is a business decision, and in this, I would bring in
marketing and legal for a discussion of pre-canned messages. This is extremely
critical for larger outages - and remember, think about how you’d feel as a customer.
When you have an outage, nothing
is better than the truth. Want to
know how to do it right? Research
Johnson and Johnson’s Tylenol
Recall.

The idea that monitoring alerts can be done as code, pushed out through
API’s to different systems, saving both the time of manual implementation
but also reducing the cost of errors during this manual implementation.
Monitoring as code
Cloud providers, have the capability of giving you a list of all the
architecture in your account via an api call, library or command line
interface. This allows us to ask, what servers do I have?
Cloud System’s and API’s
Tagging on architecture, is typically descriptive, like “MySQL Customer
Database”. If we look at this deeper, we find example after example of
using Tags to define how other parts of architecture react together.
Tagging - the what is it
Not only can we ask a cloud provider for a list of all cloud architecture
elements, we can do the same in our monitoring system. Ensuring we
compare these lists - we can validate, all systems are being monitored.
The power of the Audit
Monitoring Auditing - Beyond Monitoring as Code

Monitoring as Code - Rolling out modifiable templates
T
e
m
p
l
a
t
e
s
S
t
e
p
t
h
r
o
u
g
h
i
t
!
R
o
l
l
I
t
O
u
t
!
R
i
n
s
e
a
n
d
R
e
p
e
a
t
Monitoring as Code
Monitoring is often rolled out based
on seperate applications and needs.
And are rolled out as multiple
independent systems.
But there is a more elegant way to
roll out automated monitoring using
code - here’s how.
Start with Monitoring Templates
Review all architecture used or
you want to use, and generate
alerts to address your business
needs. From these, you can build
a standard set of alerts.
Step through your
architecture!
Step through each object in
your cloud account -
capturing both the object
name, and tags defining
custom template
adjustments.
Rinse and Repeat
Repeating the roll out daily
ensure manual changes are
rolled back and new
architecture is monitored
quickly - and without
intervention.
Roll it out!
Now that you have the
objects, and their custom
adjustments, using the
templates, alerting can be
created or updated.
Did you know?
None of the current leaders in
Monitoring offer a system so simple
or clean - and one has to wonder
why?

Routing your Monitoring Alerts
Ensure your alerts include meta
data that can be used to determine
importance and team ownership
Alerts Generated
Processing Alerts can be complex,
routing alerts to teams and setting
hours for specific levels of alerting.
Alert Processing (PagerDuty)
Alert Fatigue, the cry wolf of IT, is when alerts which don’t require
immediate resolution, are routed incorrectly or importance is not in
alignment with the company's needs. I.e. when you get alerts that you
shouldn’t be and they wake you up at 3am continuously, those linkedin
invites for a new job - they become more and more tempting.
Alert Fatigue - the fastest way to start searching for new Team Mates
When you reach out to employees after hours, you
should have an escalation policy to reach out to the back
up person automatically.
It’s very important that Expectations are set with the
entire team, does a laptop travel with on call engineers?
Do engineers come in late if they’ve been up all night?
When do you wake up your entire team for a team
response?
Set Paging Escalation and Expectations
Alerts vs Warnings! Consider adding
warnings for most alerts, for example hard
drive space. This allows working hour based
support which can prevent the 3am wake up.

Getting Started
A
B
C
Monitoring and Alerting has the highest return on investment you’ll likely see in
your business, if you keep the costs under control.
Starting with the alerting capabilities of the cloud provider is a quick and least
expensive start up, and it helps you understand the data available from your
systems when and if you purchase a third party tool.
Generating a script to apply templates, should be quick and simple - adding tag
based rule exceptions, or even entire template versioning - provides a level of
coverage quickly and at a low cost.
Ensuring when monitoring is triggered, you are treating the alert appropriately,
i.e. dev servers shouldn’t wake you at 3am in the morning. Ensuring we keep
production alerts with direct and immediate impact going to engineers, while
capturing other alerts for processing the next business day will not only keep
your staff happy, but prevent churn in an industry where demand is very high.
Finally, talk about outages and downtime as a business decision. Like a fire drill,
know what the response will be, know who is required to do what - and who their
backup is and ultimately, determine what and how is to be released to who.
Getting Started!

Want More?
Read my
New Book
Becoming a RockStar SRE
Will be Published By Packt & O’Reilly in Early 2023

Building Reliability - The Realities of Observability

More Related Content

Building Reliability - The Realities of Observability