Production Readiness Strategies in an Automated World

Production Readiness
Strategies in an
Automated World

Sean Chittenden
Engineering, HashiCorp
@SeanChittenden
https://keybase.io/seanc

Software Life Cycle
Time
Prod
1) Idea!
R&D

Software Life Cycle
Time
Prod
1) Idea!
2) Production Ready
R&D

Software Life Cycle
Time
Readiness
1) Idea!
2) Production Ready 3) End of Life
2.9) "It’ll be time to wind this service down
when ___ happens and ___ comes online."
R&D

Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
3) End of Life
"Production Supported"
"Oops"
R&D

Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
4) End of Life
3) "Oops"
R&D

Software Life Cycle
Time
Production
1) Idea!
N) End of Life
Forced to ﬁx code or docs.
R&D

Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Drug feet to produce docs."
[3,M) "Oops"
R&D
N-1) "That’s it, we’ve had enough…"

Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
N) End of Life
[3,M) "Oops"
R&D
N-2) "That’s it, we’ve had enough…"
N-1) "Just support it until
the next version is out"

Operations in the "Real World"

Complexity Abound
The Echo Service: Stateless HTTP Echo
$ go get github.com/hashicorp/http-echo
$ http-echo -text foo
$ curl http://127.0.0.1:5678/
foo

Echo as a Service
Components:
• Echo Service
• Load Balancer
• "Hardware" / OS
• Metrics Agent
• Logs Management
• Reproducible Builds
$ cd $GOPATH/src/github.com/hashicorp/http-echo/
$ git checkout 87ee38c517094993932bd76b37af03980e8c4151
$ go build

Complexity In The Simple Case
Simple Example: The Echo Service
Minimum of 6x dimensions to be concerned about
No downstream services: only request + response

Echo as a Service
Dimensions of Work to
measure:
• CPU
• RAM usage
• Network Usage
• TCP accept/connection rate
• Disk Capacity
• Disk IO (maybe?)
• Stability
• Request volume
• Request Latency

"Can't Escape the Signal, Mal"
The Echo Service: Stateless HTTP Echo
2016/11/18 03:29:58 Server is listening on :5678
2016/11/18 03:30:00 127.0.0.1:5678 127.0.0.1:61932 "GET / HTTP/1.1" 200 4 "curl/7.51.0" 15.94µs

Echo as a Service
Complexity Factor: ~10

Echo's Operational Concerns
Loss Aversion
• Uptime
• Secrets
• Planned Failure Modes: failure on a probability curve
• Server Uptime (e.g. OS or Hardware)
• Unplanned Failure Modes (e.g. DC or AZ fails)

Entropy and Failure: Best Friends

Loss Aversion
• Uptime
• Secrets
• Unplanned Failure Modes (e.g. DC or AZ fails in an earthquake)
• Success Failure Modes
Randall A. Lewis and David H. Reiley. 2013. Down-to-the-minute effects of
super bowl advertising on online search behavior. 
http://dx.doi.org/10.1145/2482540.2482600

Loss Aversion
• Uptime
• Secrets
• Unplanned Failure Modes (e.g. DC or AZ fails)
• Success Failure Modes
• Known Architectural Limits
• Unknown Architectural Limits

Performance Spelunking
Exciting, but not very fun

Lurking Signiﬁcant Details
Imagine a more complex service:
• an API server that fans out to ~20 downstream services
• Uses async scatter/gather to fan out requests
• Transient failures become the norm

Stateful Complexity
Database-as-a-Service: PostgreSQL Edition

SQL
WAL Files Log Files
PostgreSQL as a Service
Components:
• PostgreSQL
• Connection Pooler (pgbouncer)
• PITR Manager (WAL-E, omnipitr,
pgBackRest)
• Logs Analyzer (pgbadger, pgfouine)
• Metrics Agent
• Failover Manager (Connections, State, Data
Continuity/Self-Healing)
• SchemaVersioning

SQL
WAL Files Log Files
Dimensions of Work to measure:
• CPU
• RAM usage
• Network Usage
• TCP accept/connection rate
• Disk Capacity
• Maybe disk IO (read, write)
• Stability
• Request volume
• Request Latency
• Query performance
• Kernel Lock Contention
• Userland buffer eviction rate
• Cache-miss rate
• Size of blast radius
• ... etc.

SQL
WAL Files Log Files
Complexity Factor:
~30 x (number of tables x metrics per
table)

SQL
WAL Files Log Files
Database PSATangent:
• Don't confuse complexity with value.
• Databases are amazingly useful things
because of their productivity and value
as a network service.
• Databases assume the lions share of
complexity burden: centralized
complexity is easier than distributed
complexity.

How do you systematically
address inherent,
necessary complexity?

Checklists
• Identify Problems
• Read - Do Checklists
• Ensure critical steps hit
• Useful in emergencies (plane on ﬁre? Do X,Y, and Z...)
• Do - Conﬁrm Checklists
• Verify muscle memory
• Combats atrophy and fatigue

Building a Modern
Operations Checklist

Who uses checklists?
Astronauts
Surgeons
Pilots
Inspectors
Military
IT/Operations?

Good Checklists
• Have a clear purpose
• Are brief: 10-20 items, ﬁt on a single page
• Focus on what's essential/mandatory
• Enumerate what must be done (and frequently forgotten)
• Don't replace personal judgement or skill
• Enforce discipline
• Provide tools for collaboration and communication
• Establish protocol or enforce a norm

Good Checklists
• Have a clear purpose
• Are brief: 10-20 items, ﬁt on a single page
• Focus on what's essential/mandatory
• Enumerate what must be done (and frequently forgotten)
• Don't replace personal judgement
• Enforce discipline
• Provide tools for collaboration and communication
• Establish protocol or enforce a norm

Building a Modern
Operations
Checkli^WAudit

Production Ready
SQL
WAL Files Log Files

Production Ready
SQL
WAL Files Log Files
Organizational Challenges Technical Challenges

Organizational Prerequisites
Standardized Jargon (e.g. SEV1 vs SEV2, client vs consumer)
Policy for Unique Service namespaces (app1 vs appN vs dbN)
# Deny registration access to services prefixed
# "app1-". Discovery of the service is still
# allowed in read mode.
service "app1-" {
policy = "read"
}
service "app2-" {
policy = "write"
}

Naming conventions established within a service (app1-api1 vs app1-dbN)
Rules of Engagement outlining how outage is:
1. Identiﬁed
2. Responded to
3. Recovery is conducted
4. Prevention
5. Preparation
6. GOTO step #1

Rules of Engagement outlining how outage is handled
Centralized documentation
Establish a culture of systems thinking

Establish a culture of systems thinking:
•a system is composed of parts
•a system is greater than the sum of its parts
•all the parts of a system must be related (directly or indirectly),
else there are really two or more distinct systems
•a system is encapsulated (has a boundary)
•a system can be nested inside another system
•a system can overlap with another system
•a system consists of processes that transform inputs into outputs
•a system is autonomous in fulﬁlling its purpose: 
 
A car is not a system. A car with a driver is a system.

Rules of Engagement outlining how outage is handled
Centralized documentation
Establish a culture of SystemsThinking
Establish end-to-end ownership
Decoupled service names from team names

Why do we care?
• We aren't always going to be working on our code.
• We need to establish a culture of maintenance and the necessary
supporting systems, both organizational and technical.

Audit Reduced to a Checklist
High-level summary of the service?
Stateful or Stateless
List of important consumers
Release Process
On-Call Instructions / Incident Response
Health Deﬁned
Customer Service Endpoint?
Backups
Geographic Redundancy

Audit back to Checklist
Release Process
Health Deﬁned
Backups
=> Organizational Concern
=>Technical Concern
=>Tech and Org Concern
=>Technical Concern

Plan, Doc, Vet, and Decide Starting Here...
Time
Prod
1) Idea!
2) Production Ready
R&D

... ideally before here...
Time
Production
1) Idea!
N) End of Life
R&D

... but NO later than here!!!
Time
Production
1) Idea!
N) End of Life
R&D

(It's good to reﬁne here when this happens)
Time
Production
1) Idea!
N) End of Life
R&D

Value from Checklists
Release Process
Health Deﬁned
Backups
=> FasterTraining / Fungible Skills
=> Universal / Consistent / Standard
=> Faster Understanding andTraining
=> Faster Resolution / Fungible Skills
=> Larger Pool / Increased Sympathy
=> Standardized Resolution
=> One Source ofTruth
=> Standard Procedures
=> Unplanned Disasters Mitigated

Summary: Vertical Places to Look
SQL
WAL Files Log Files
Organizational Challenges Technical Challenges

Summary: Horizontal Places to Look
Time
Prod
1) Idea!
2) Production Ready
R&D

Questions?
Thank the audience for their time.
Name: Sean Chittenden
Twitter: @SeanChittenden

Service Checklist: Overview
Service Overview
• Description and relevance to the business
• Short explanation of how the service ﬁts into the eco system
of micro services
• Pointers to more detailed documentation
• Pointers to the current team owners
Stateful or Stateless service
Does the service employ any internal caching
Dependency management: e.g. embedded libraries that have been
vendor/'ed (not necessary with Go, this is self-evident)

Service Overview
$ head my-service.job
# This declares a job named "service123". There can be exactly one
# job declaration per job file.
job "service123" {
# Specify this job should run in the region named "us". Regions
# are defined by the Nomad servers' configuration.
region = "us"
# Spread the tasks in this job between us-west-2 and us-east-1.
datacenters = ["us-west-2", "us-east-1"]
# Run this job as a "service" type. Each job type has different
# properties. See the documentation below for more examples.
type = "service"

Service Overview
$ head my-docs.job
# This declares a job named "docs". There can be exactly one
# job declaration per job file.
job "docs" {
meta {
owner = "https://github.com/myorg/myproject/blob/master/owners.md"
docs-url = "https://github.com/myorg/myproject"
system-summary = "https://github.com/myorg/myproject/blob/master/system-summary.md"
}

Service Overview
• Auditable via the API: 
http://nomad.service.consul:4646/v1/job/<ID>

List of high-level consumers
• API consumed by other services within the organization
• Public Internet
• Marketing (a/b testing?)
• Customer Service
Service Confidentiality Classification
Sales Information
• Unofficial docs that can be used by sales or marketing.
Authoritative information comes from the team writing the
service. Doesn't need to be final copy, but should include useful
figures about this service.

Release Process
On-call - what's the fallback strategy for a small service with a team
of two?
How is the service installed?
How is the service conﬁgured?
How is the service's process managed?
• How is it started?
• How is it stopped?
• Is there a graceful shutdown procedure vs a rapid shutdown
procedure?
• Can you send a SIGKILL signal to the process?
Incident Response

Release Process
On-call - what's the fallback strategy for a small service with a team
of two?
How is the service installed?
How is the service conﬁgured?
How is the service's process managed?
Is the process management platform-speciﬁc?
Is there a table mapping each signal to the effect of the signal
Process Management
Is Process Management hooked into the monitoring and alerting
framework?
Incident Response

Health
Health of the Service
What is the deﬁnition of healthy?
TIP: Use Consul Health Checks for Break/Fix
{
"service": {
"name": "redis",
"tags": ["master"],
"address": "127.0.0.1",
"port": 8000,
"enableTagOverride": false,
"checks": [
{
"script": "/usr/local/bin/check_redis.py",
"interval": "10s"
}
]
}
}

Health of the Service
What is the deﬁnition of healthy?
Is there any Seasonality to the deﬁnition of healthy?
How do you observe the service?
Is there any automated capacity planning attached to the service?
Health

Customer Service
How does customer service interact with this service?
Does CS have direct access to PII or other sensitive material?
Customer Service

Quality Metrics
What are the important KPIs coming out of this service?
• If you don't measure it, you won't optimize for it.
• If you don't measure it, you can't manage it.
• You can only succeed at what you can measure.
• You can't improve what you don't measure.

Quality Metrics
Measuring the number of round-trips between Support and
Customers/Users
Measuring the number of round-trips between Support and
Engineering
Measuring the "level of effort" or amount of input a person has
to submit in order to receive support.
Accuracy of information provided by customers?
Measure the "rate of access" to PII information.

Quality Metrics
Strategy: Centralize and poll for number of tagged issues out of
GitHub.

Organization Prerequisites
Define the gradients in an outage
• SEV1 - Hard outage, complete loss of service or "major impact to
business value/revenue".
• SEV2 - Partial outage or impaired service (SLA violation).
• SEV3 - Integrity of service issue (bugs).
• SEV4 - Non-critical issue that needs to be prioritized 9-5 M-F.
• SEV5 - Janitorial work that needs to happen on a routine schedule.
Define what it means to follow through with an outage.
• What level of follow through is required?
• Postmortems?
• Who patches it and who receives time to actually fix it permanently?

Outage Consequences
Revenue Impact User Impact Systems Impact Escalation
SEV1
SEV2
SEV3
SEV4
SEV5

Outage Consequences
Deﬁne the gradients in an outage
Sketch out the direct and indirect consequences on the system

Tracing
Is there a tracing token sent by upstream? If not, why not?
Is this service at the boundary of HTTP and RPC?
Is there an API library available that will automatically inject the
tracing token into downstream calls?
Can tracing only be used in aggregate or can it be used for
individual problems?

Is the service geographically redundant or not? If not, why not?
If yes:
Does this happen automatically?

{
"Name": "my-query",
"Session": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
"Token": "",
"Near": "node1",
"Service": {
"Service": "redis",
"Failover": {
"NearestN": 3,
"Datacenters": ["dc1", "dc2"]
},
"OnlyPassing": false,
"Tags": ["master", "!experimental"]
},
"DNS": {
"TTL": "10s"
}
}

Is the service geographically redundant or not? If not, why not?
If yes:
Does this happen automatically?
What mechanisms handle this?
Are there any regulatory concerns that come into play?
Is the failover process manual?
Does this happen at human timescale or on a machine
timescale?
Is the geographically redundant path continually tested?

Active-Active
Can this service be active-active?
If not, why not?
If yes, what kind of locking concerns or information sharing
concerns need to be factored in?

Data Classification
Does the service come in contact with any sensitive data?
If yes:
What type of data? (PII, passwords, keys, financial information,
credit cards,ACH, etc.)
What regulatory compliance applicable to this service?
(SafeHarbor, PCI, SOx?)
Is the data stored, or just passed in transit?
Can any sensitive data end up in log files?
Can sensitive, but necessary data use a proxy token instead?
Can this information leave the organization and goto a third
party?

SPOFs
What SPOFs exist, if any?
What's the timescale for this SPOF?
What's the timescale for transition from leader to follower or
follower to leader?
If stateful, is "split brain" possible?
NOTE: State is a SPOF: failing over state takes time.

Escalation Path
What's the escalation path inside of the organization?
What's the escalation path outside of the organization? Open
Source community or commercial support?
Is there semi-regular training on how to triage and escalate?
Is there a playbook for relevant low-level debugging tools available
for use?
TIP: Use automatic escalations within PagerDuty or OpsGenie.
TIP: Use standardized service techniques to create fungible support
resources.

Quantiles of Health
Can health be deﬁned in terms of quantiles vs binary up/down?
What are the upper and lower bounds for healthy?
What system is authoritative for determining if something is
healthy?
How can an external actor verify if the system is healthy? Is there
a command-line tool or API?

Canary
Does the request have a "canary request mode?"
Can this be enabled per customer?
Is the canary mode used in monitoring to validate end-to-end
functionality?

Downstream Services
How does this service respond upstream to failures in its downstream
dependencies?
Is there a metric to indicate timed-out requests?
Is there a feature-ﬂag that enables a circuit-breaker?
How are connectivity problems retried in the system?
Retry the same backend?
Retry a different backend?
Timeout?
Is there a deadline timer passed in?
Is a header added to indicate partial failure of downstream services?
Are response codes standardized?

Architectural Limits
What are the expected limits of this system?
How often is "peak-load" deﬁned?
Is there 3x capacity for the service in order to absorb reasonable
bustiness?
Is the band of nominal resource usage deﬁned?
• "At 10K RPS, network utilization should be between
200-300Mbps, using two cores at ~60% utilization, 50MB of
RAM, and doing an average of 5-10 disk IOPs. All values are
+/- 25%."

Logging
How is logging setup?
What gets logged?
What is the minimum log retention?
How often are logs rotated? By size or by ﬁxed interval?
Are logs shipped off box?
Are they streamed without hitting disk?
Is there any sensitive data in the logs?

Load Shedding
How can you load-shed?
Are there any feature ﬂags that enable circuit breakers that
reduce expensive functionality?

Prepare For the Worst
Assume the service can't come back online, what's the impact?

Backup and Restore
Does this system have a reproducible build?
How often are backups taken?
How often are the restores executed?
What's the recovery point objective?
What's the mean time to recovery?
What's the deﬁnition of acceptable data loss in the event of
failure?

Deployment
How is this service tested and deployed?
Is the deployment in prod any different than test?
How can you roll back?
Is the application part of a CI/CD pipeline?
How is production data scrubbed and used in staging/UAT in
order to simulate production-like loads without using production
data?

Production Readiness Strategies in an Automated World

Related slideshows

More Related Content

Production Readiness Strategies in an Automated World