HowTo DR

HowTo DR

Josh Berkus
PostgreSQL Experts
SCALE 2014

Disaster Recovery
“The process, policies and
procedures that are related to
preparing for recovery or
continuation of technology
infrastructure which are vital to an
organization after a natural or
human-induced disaster.”
Wikipedia, February 2014

Disaster Recovery
Restoring services after the
unexpected.

Disaster Recovery
Limiting:
1. Downtime
2. Data Loss

server
failure

getting
hacked
natural
disaster

server
failure

getting
hacked
natural
disaster

storage
failure
traffic
spike

network
failure

admin
error

bad
update

OS / VM
problem

software
bugs

The Nines
Nines
99.9%
99.99%
99.999%

Down/Year
9 hours
1 hour
5 minutes

The Nines
●

Treats all downtime causes as
identical
–

●
●
●

except the ones it ignores

Doesn't address data loss
Really “Business Continuity”
also unrealistic

Disaster
Server Failure
Network Failure
Admin Error
Bad Update
Storage Failure
Getting Hacked
Natural Disaster

Downtime Data Loss

Detect

Disaster

Downtime Data Loss

Server Failure

0

0

Network Failure

0

0

Admin Error

0

0

Bad Update

0

0

Storage Failure

0

0

Getting Hacked

0

0

Natural Disaster

0

0

Detect

10 yrs

10 yrs

Disaster

Downtime Data Loss

Server Failure

5min

1min

Network Failure

3hrs

10min

Admin Error

1hr

1hr

Bad Update

1hr

1hr

Storage Failure

5min

30min

Getting Hacked

1hr

1hr

Natural Disaster

6hrs

1hr

Detect

3 mo

3 mo

Elements of a Plan
1. Backups/Replicas
2. Replacements
3. Procedures
4. People

Backups
●
●
●
●
●

pg_dump
mysqldump
rsync
zfs snapshot
SAN exportable snapshot

Backups++
●
●
●
●

Periodic
Portable
Simple
Recover point-in-time

Backups-●
●

Slow to restore
Data loss interval

Backups
●

Good for:
natural disaster
– admin error, bad update
– software bugs
– getting hacked
–

●

Bad for everything else

Replication
●
●
●
●
●
●

Postgres binary replication
MySQL replication
Redundant HBase nodes
Redis clustering
DRBD
GlusterFS etc.

Replication++
●
●
●

Continuous
Fast failover
Low data loss

Replication-●
●
●
●
●

Extra hardware
Complex
High-maintenance
Can hurt performance
Can replicate failures

Replication
●

Good For:
–

●

server, storage, network failure

Bad For:
admin error, getting hacked
– software bugs
–

Continuous Backup
●
●
●

Also “PITR”
Continous like replication
Partial recovery like backups

… where you gonna
restore those backups
to?

Replacing Services
●
●
●
●
●

servers
network
storage
OS image
software reversion

3AM
is not the time
to improvise

Procedures
… for each recovery step
… for deciding what steps

Database Server
Does Not Respond
1. Determine if physical server is
down
a. if network is down, use plan N1.
2. If not, try to restart database
using command …
3. Still down? Fail over to replica
using command …
4. Check replica.
5. Not working? Restore backup to
test server 1 using command ...

Good: detailed written
procedures
Better: written procedures with
pastable commands
Best: tested single-command
scripts

Fallback Procedures
●
●
●

Sometimes recovery fails
Have fallback procedures
If the fallback fails
… time for a meeting!

Never
ever
ever
ever
improvise.

Know who to call
●
●
●
●
●

on call staff
experts in each service
consultants/contractors
vendors
required authorizations

Contact Book
●

●

Include as much contact
information as possible
Put copies in more than one
place
–

●

including paper!

Keep it up to date

Test Your DR
Good: when you create the
procedure
Better: quarterly
Best: as part of daily/weekly
provisioning

An untested
backup
is one which
doesn't work.

“It's a cloud, right?
That means it's
redundant, right?”

… not necessarily for
your servers

unless you pay for it!

Some new problems
●
●
●
●

Instance failure
Resource overcommit
Zone failures
Admin error at scale

Some new solutions
●

Redundant services
–

●
●

RDS, VIP, S3

Rapid server deployment
Cheap replicas

… otherwise
pretty much
the same.

backup locations
●

shared instance storage (EBS)
–

●

fast failover for instance fail

long-term storage API (S3)
redundant
– large
–

Use your rapid deploy!
●
●

Continuous backup to S3
Deploy scripts + server images
–

●

Chef/Salt/Puppet/etc. helps here

= fast recovery
–

with low running costs

DR Tips
●

Have multiple copies of your plan
–

●
●

in multiple locations

A SAN is not a DR solution
One form of backup is seldom
enough

Questions?
●

Josh Berkus
www.databasesoup.com
– www.pgexperts.com
–

●

Coming up:
NYC pgDay April 3-4
– pgCon May 21-24
–

Copyright 2013 PostgreSQL Experts Inc. Released under the Creative Commons
Share-Alike 3.0 License. All images, logos and trademarks are the property of their
respective owners and are used under principles of fair use unless otherwise noted.

HowTo DR

Related slideshows

More Related Content

HowTo DR