SlideShare a Scribd company logo
Using Service Brokers to Manage Data
Lifecycle
Josh Kruck | @krujos
jkruck@pivotal.io
github.com/krujos
2
What are the some operational
problems with data?
3
Primary
Primary
DR Backup
Snapshots
Business Critical Data Lifecycle
RTO 00:05 RPO 01:00
First 12 hours
Replica
Backup
4
Primary
Backup Backup
Primary
Snapshots
Replica
Backup
Business Critical Data Lifecycle
RTO 00:05 RPO 01:00
First 24 hours
DR
5
525,600
minutes
6
5476
copies
7
8
(capex is easy, just buy more stuff)
copies aren’t
really the
problem!
9
The real
problem is
5476 copies
are…
10
managed by
3 systems
[“storage”, “backup”, “rdbms”]
11
and 5 teams.
[
“storage”,
“backup”,
“offsite provider”,
“app owner”,
“dba”
]
12
(you shouldn't buy more people)
opex is the
problem
13
what’s the
read/write
load on the
copy?
14
0
5475 copies
doing nothing
for your
business
15
Why all this talk about backups and stuff?
?
16
Good code needs good tests.
Good tests need good data.
Good data needs… a copy.
A play in 3 acts
so lets get one!
17
“I don’t think we
have any copies of
that”
18
“I not allowed to have
prod logs, much less
the db”
19
we can do it, this
one time: file a
ticket.
20
Solved!
But did we create
another problem?
21
Once you find a copy, it needs a curator
Sizing (don’t use all of 10 TB of prod to test)
But your sample must represent the entirety of
the dataset.
Representative curation is futile with most
datasets (unknown unknowns).
Sizing means you restrict your tests to what you
left in.
Sizing hides performance issues (missing index)
So maybe it’s not worth it….
22
Once you find a copy, it needs a curator
Sanitize it!
Can’t have SSN’s and
CC in test
23
Once you find a copy, it needs a curator
Delete!
old data smells funny.
24
Once you find a copy, it needs a curator
Refresh!
GOTO 10
25
hard|complex
manual
infrequent
error prone
handoffs
deletion
ownership
Curation is
expensive
26
A manual process
that starts with a
ticket is the wrong
solution
27
The sum of the mess is worth more than its parts
There’s 5475 secondary copies with
no load, can we leverage them for
testing?
Fix: Let CF manage
your data.
28
How?
29
most copies do nothing, but when the sky is falling you need them
first do no harm
30
cf create-service
Copy Data
Sanitize Data
cf push <app>
Test
cf delete app -r -f
cf delete-service
Pattern:
31
How do you fill in
that hand
wavy part in the
middle?
32
Putting the E in Enterprise
Buy a CDM Product
Actifio, Delphix, ViPR
Great if they support your workloads!
And you can consume the form factors they
deliver
33
Based on technology to allow layered writes
Layered FS (Docker, Docker, Docker)?
Clones, Linked Clones, VM Snaps
Writeable Snapshots (FlexClone, XtremIO,
LVM Snaps)
Building is harder than buying
BYO
34
cf create-service
Snap Prod VM
Spin up VM
Allocate IP
Sanitize Data in PG
cf push demo
Test
Dispose
AMI and Postgres Demo
35
https://github.com/krujos/data-lifecycle-service-broker
please help!

More Related Content

Cloud Foundry Summit 2015: Using Service Brokers to Manage Data Lifecycle

Editor's Notes

  1. First, act, how do I get the copies?
  2. much sleuthing and failed attempts to generate legit test data later…
  3. Act II
  4. ACT III I have a customer who hasn’t refreshed test data in three years.
  5. ACT III I have a customer who hasn’t refreshed test data in three years.
  6. Represent the entirety of the dataset means things like previous schemas. Rows with missing additive fields, FK’s etc. Is selecting those records going to cause issues? What about formats assumed in the data itself (but surely no one stores encoded information in their database). Everyone knows the data well enough to know what representative is? (no)
  7. Represent the entirety of the dataset means things like previous schemas. Rows with missing additive fields, FK’s etc. Is selecting those records going to cause issues? What about formats assumed in the data itself (but surely no one stores encoded information in their database). Everyone knows the data well enough to know what representative is? (no)
  8. Represent the entirety of the dataset means things like previous schemas. Rows with missing additive fields, FK’s etc. Is selecting those records going to cause issues? What about formats assumed in the data itself (but surely no one stores encoded information in their database). Everyone knows the data well enough to know what representative is? (no)
  9. Represent the entirety of the dataset means things like previous schemas. Rows with missing additive fields, FK’s etc. Is selecting those records going to cause issues? What about formats assumed in the data itself (but surely no one stores encoded information in their database). Everyone knows the data well enough to know what representative is? (no)