SlideShare a Scribd company logo
Integrating Multiple CDN Providers
Our experiences at Etsy

@lozzd • @ickymettle
Marcus Barczak

Laurie Denness

Staff Operations Engineers
Integrating multiple CDNs at Etsy
@lozzd • @ickymettle
@lozzd • @ickymettle
Beginning of 2010

Today
@lozzd • @ickymettle
Background
▪ First started using a single CDN in 2008
▪ Exponential Growth
▪ Start of 2012 began investigation into running

multiple CDNs

@lozzd • @ickymettle
Why use a CDN?
▪ Goal: Consistently fast user experience globally
▪ Improve last mile performance by caching content

close to the user
▪ Offload content delivery from origin infrastructure
to the CDN provider

@lozzd • @ickymettle
Why use more than one CDN?

@lozzd • @ickymettle
Why use more than one CDN?
▪ Resilience
-

Eliminate single point of failure

@lozzd • @ickymettle
Why use more than one CDN?
▪ Resilience
-

Eliminate single point of failure

▪ Flexibility
-

Balance traffic based on business requirements

@lozzd • @ickymettle
Why use more than one CDN?
▪ Resilience
-

Eliminate single point of failure

▪ Flexibility
-

Balance traffic based on business requirements

▪ Cost
-

Manage provider costs
@lozzd • @ickymettle
The Plan

http://www.flickr.com/photos/malloy/195204215
The Plan
1. Establish evaluation criteria
2. Initial configuration and testing
3. Test with production traffic
4. Operationalising

@lozzd • @ickymettle
Evaluation Criteria

@lozzd • @ickymettle
http://www.flickr.com/photos/49212595@N00/5646403386
Evaluation Criteria
▪ Performance
▪ Configuration
▪ Reporting, Metrics and Logging
▪ Culture

@lozzd • @ickymettle
Performance

@lozzd • @ickymettle
Performance
▪ Baseline Response Times
-

Should be within ±5% of our existing CDN provider’s
response times

@lozzd • @ickymettle
Performance
▪ Baseline Response Times
-

Should be within ±5% of our existing CDN provider’s
response times

▪ Hit Ratios and Origin Offload
-

Provider should achieve equivalent or better origin offload
performance and hit ratios

@lozzd • @ickymettle
Configuration

@lozzd • @ickymettle
Configuration
▪ Complexity
-

how complex is the providers configuration system

@lozzd • @ickymettle
Configuration
▪ Complexity
-

how complex is the providers configuration system

▪ Self service
-

can you make changes directly or do they require
professional services or other intervention

@lozzd • @ickymettle
Configuration
▪ Complexity
-

how complex is the providers configuration system

▪ Self service
-

can you make changes directly or do they require
professional services or other intervention

▪ Latency for changes
-

how quickly do changes take to propagate
@lozzd • @ickymettle
Reporting, Metrics and Logging
▪ Resolution
▪ Latency
▪ Delivery
▪ Customisation

@lozzd • @ickymettle
Culture
▪ Understand our culture
▪ Postmortems
▪ Access to technical staff
▪ Shared success

@lozzd • @ickymettle
Initial
Configuration
and Testing

http://www.flickr.com/photos/7269902@N07/4592239326
Clean the house
http://www.flickr.com/photos/mastergeorge/8562623590
Clean the house
▪ Managing caching TTLs from origin
-

CDNs honour the origin cache-control headers!

@lozzd • @ickymettle
Clean the house
▪ Managing caching TTLs from origin
-

CDNs honour the origin cache-control headers!

<LocationMatch ".(gif|jpg|jpeg|png|css|js)$">
Header set Cache-Control "max-age=94670800"
</LocationMatch>

@lozzd • @ickymettle
Clean the house
▪ Manage gzip compression from origin
-

Honoured by CDNs

-

Compression from origin to CDN

@lozzd • @ickymettle
Clean the house
▪ Manage gzip compression from origin
-

Honoured by CDNs

-

Compression from origin to CDN

## mod_deflate compression - see OPS-1537 ##
AddOutputFilterByType DEFLATE text/html text/plain
text/css application/x-javascript [..]

@lozzd • @ickymettle
Clean the house

@lozzd • @ickymettle
Clean the house
If you can do it at origin,
do it at origin

@lozzd • @ickymettle
Mean Time To Curl
http://www.flickr.com/photos/wwarby/3297205226
curl -i -H 'Host: img0.etsystatic.com' 
global-ssl.fastly.net/someimage.jpg
curl -i -H 'Host: img0.etsystatic.com' 
global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT
Cache-Control: max-age=94670800
[...]
X-Served-By: cache-lo82-LHR
X-Cache: MISS
X-Cache-Hits: 0
curl -i -H 'Host: img0.etsystatic.com' 
global-ssl.fastly.net/someimage.jpg
curl -i -H 'Host: img0.etsystatic.com' 
global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT
Cache-Control: max-age=94670800
[...]
X-Served-By: cache-lo82-LHR
X-Cache: HIT
X-Cache-Hits: 1
Mean Time To Curl = Done
https://www.etsy.com/listing/99871278
Mean Time To Curl
▪ No need to touch existing infrastructure
▪ Smoke test of functionality
▪ 10 minutes from configuration to curl
▪ New providers should be plug and play

@lozzd • @ickymettle
Testing In Production
http://www.flickr.com/photos/solarnu/10646426865
Testing with Production Traffic
▪ Images only at first
▪ Good test of caching performance
▪ Easy to test by swapping hostnames
▪ Made even easier with our A/B testing framework

@lozzd • @ickymettle
A/B Test Framework
▪ Fine grained control
▪ Enable test for specific users or groups
▪ Percentage of users
▪ All controlled via configuration in code
▪ Rapid and complete rollback

@lozzd • @ickymettle
Configure Mappings to CDNs
$server_config["image"] = array(
'akamai' => array(
'img0-ak.etsystatic.com',
'img1-ak.etsystatic.com',
),
'edgecast' => array(
'img0-ec.etsystatic.com',
'img1-ec.etsystatic.com',
),
'fastly' => array(
'img0-f.etsystatic.com',
'img1-f.etsystatic.com',
),
);

@lozzd • @ickymettle
Test Controls
$server_config['ab']['cdn'] = array(
'enabled' => 'on',
'weights' => array(
'akamai'
=> 0.0,
'edgecast' => 0.0,
'fastly'
=> 0.0,
'origin'
=> 100.0,
),
'override' => 'cdn_diversity',
);

@lozzd • @ickymettle
Metrics and Monitoring

@lozzd • @ickymettle
http://www.flickr.com/photos/nicolasfleury/6073151084
Metrics and Monitoring

@lozzd • @ickymettle
Metrics and Monitoring

Even if it doesn’t move, graph it anyway
@lozzd • @ickymettle
Metrics and Monitoring
Simplest approach: Provider’s dashboards

@lozzd • @ickymettle
Metrics and Monitoring
Simplest approach: Provider’s dashboards

@lozzd • @ickymettle
Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data

@lozzd • @ickymettle
Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data

@lozzd • @ickymettle
Metrics and Monitoring

@lozzd • @ickymettle
Metrics and Monitoring

@lozzd • @ickymettle
Testing Plan
1. for c in $cdns; do rampup $c; done;
2. Deliberately slow and steady
3. Watch traffic increase
4. Watch origin offload increase
5. Watch performance

@lozzd • @ickymettle
Downsides of this approach
▪ AB testing can’t be used for main site
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern

@lozzd • @ickymettle
Downsides of this approach
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern

@lozzd • @ickymettle
How do you know it’s broke?
▪ Check the graphs!
▪ Check with your community
▪ Keep support in the loop

@lozzd • @ickymettle
Operationalising

http://www.flickr.com/photos/98047351@N05/9706165200
Content Partitioning

@lozzd • @ickymettle
Etsy’s site partitioning
Dynamic HTML Content
www.etsy.com

@lozzd • @ickymettle
Etsy’s site partitioning

Static Assets (js, css, fonts)
site.etsystatic.com

@lozzd • @ickymettle
Etsy’s site partitioning
Listing Images, Avatars
imgX.etsystatic.com

@lozzd • @ickymettle
Etsy’s site partitioning
Dynamic HTML Content
www.etsy.com
Static Assets (js, css, fonts)
site.etsystatic.com
Listing Images, Avatars
imgX.etsystatic.com

@lozzd • @ickymettle
Balancing Traffic in
Production

http://www.flickr.com/photos/wok_design/2499217405
Balancing Traffic Using DNS
▪ Traffic Manager
▪ Extends DNS to dynamically return records based

on rules
▪ Weighted round robin

@lozzd • @ickymettle
Balancing Traffic Using DNS
[2589:~] $ dig +short www.etsy.com
www.etsy.com.edgekey.net.
e2463.b.akamaiedge.net.
23.74.122.37
[2589:~] $ dig +short www.etsy.com
[2589:~] $ dig +short www.etsy.com
etsy.com.
cs34.adn.edgecastcdn.net.
38.123.123.123
93.184.219.54
[2589:~] $ dig +short www.etsy.com
global-ssl.fastly.net.
185.31.19.184

@lozzd • @ickymettle
Balancing Traffic Using DNS
[2589:~] $ dig +short www.etsy.com
etsy.com.
[2589:~] $ dig +short www.etsy.com
38.123.123.123
www.etsy.com.edgekey.net.
e2463.b.akamaiedge.net.
23.74.122.37
[2589:~] $ dig +short www.etsy.com
cs34.adn.edgecastcdn.net.
93.184.219.54
[2589:~] $ dig +short www.etsy.com
global-ssl.fastly.net.
185.31.19.184

@lozzd • @ickymettle
Balancing Traffic Using DNS
▪ Rule updates typically made via web UI
▪ Can be slow and error prone
▪ Changes need to be applied to all three domains
▪ API available to make changes programmatically

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
http://www.flickr.com/photos/foshydog/4441105829
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
cdncontrol

@lozzd • @ickymettle
DNS balancing downsides
▪ Low TTLs for fast convergence

@lozzd • @ickymettle
DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money

@lozzd • @ickymettle
DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money
▪ More DNS lookups for users

@lozzd • @ickymettle
DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money
▪ More DNS lookups for users
▪ Not 100% instant or deterministic

@lozzd • @ickymettle
50% within 1
minute

@lozzd • @ickymettle
50% within 1
minute
Long Tail is Loooong

@lozzd • @ickymettle
Monitoring in Production
@lozzd • @ickymettle
http://www.flickr.com/photos/9229426@N05/5160787240
Whoopsie Page
▪ Static HTML delivered for 5xx errors
-

Branding

-

Translated error messages

-

Links to status page

@lozzd • @ickymettle
Whoopsie Page
▪ Static HTML delivered for 5xx errors
-

Branding

-

Translated error messages

-

Links to status page

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
[...]
<img src="//failure.etsy.com/status/images/beacon.gif?
beacon_source=fastly_origin_failure-etsy.com">
</body>
</html>

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
self.reg = re.compile('^S+(s:)? (?P<remote_addr>[0-9.]+),?
[0-9.,- ]+ [[^]]+] "GET /status/images/beacon.gif?
(beacon_)?source=(?P<source>S+) HTTP/1.d" d+ [d-]+ "(?
P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" .*$')

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios

@lozzd • @ickymettle
Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios

@lozzd • @ickymettle
Failure Beacons
▪ Client IP address can be geolocated

@lozzd • @ickymettle
Failure Beacons
▪ Client IP address can be geolocated

@lozzd • @ickymettle
Failure Beacons
▪ Optional extra debugging information
[31/Oct/2013:07:06:42 +0000] "GET /status/images/
beacon.gif?beacon_source=fastly_origin_failure-etsy.com
&provider_error=Connection%20timed%20out
&server_identity=cache-ny57-NYC HTTP/1.1"

@lozzd • @ickymettle
Failure Beacons
▪ Optional extra debugging information

@lozzd • @ickymettle
Tracking Requests to Origin
GET / HTTP/1.1
User-Agent: curl/7.24.0
Accept: */*
X-Forwarded-Host: www.etsy.com
[...]
X-CDN-Provider: edgecast
[...]
Host: www.etsy.com

@lozzd • @ickymettle
Tracking Requests to Origin
GET / HTTP/1.1
User-Agent: curl/7.24.0
Accept: */*
X-Forwarded-Host: www.etsy.com
[...]
X-CDN-Provider: edgecast
[...]
Host: www.etsy.com

@lozzd • @ickymettle
Backend Monitoring
▪ Vendor APIs to bring data in house

@lozzd • @ickymettle
Backend Monitoring
▪ Vendor APIs to bring data in house

@lozzd • @ickymettle
Backend Monitoring
▪ Logster on CDN provider header
▪ Vendor APIs to bring data in house

@lozzd • @ickymettle
Backend Monitoring
▪ Vendor APIs to bring data in house
▪ Data in-house benefits include
-

Integration with our anomaly detection systems

-

Consistent and unified view of all CDN metrics

-

We control data retention period

@lozzd • @ickymettle
Awareness
▪ Over 100 engineers
▪ Deploying 60 times a day
▪ Correlating external and internal services

@lozzd • @ickymettle
Awareness

@lozzd • @ickymettle
Awareness

@lozzd • @ickymettle
Awareness

@lozzd • @ickymettle
Awareness

@lozzd • @ickymettle
Awareness

@lozzd • @ickymettle
Awareness
Deploy lines

@lozzd • @ickymettle
Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ Real User Monitoring on key pages to track page

performance

@lozzd • @ickymettle
Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ SOASTA mPulse on key pages to track real user

page performance

@lozzd • @ickymettle
Downsides
http://www.flickr.com/photos/39272170@N00/3841286802
Debugging: What broke?

@lozzd • @ickymettle
Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this

system

@lozzd • @ickymettle
Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this

system
▪ But not always

@lozzd • @ickymettle
Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this

system
▪ But not always

@lozzd • @ickymettle
Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this

system
▪ But not always

@lozzd • @ickymettle
Debugging: What broke?
▪ Non technical member base
▪ Confusing and time consuming
▪ Amazing support team
▪ Log as much information as possible

@lozzd • @ickymettle
http://www.flickr.com/photos/sk8geek/4649776194

Conclusions/Takeaways
Great success
▪ 12 months in the benefits have far outweighed the

few downsides
▪ We’re continuing to evolve the system
▪ We’ll be sure to share our experience with the

community along the way

@lozzd • @ickymettle
Links/Open Source
▪ cdncontrol
http://github.com/etsy/cdncontrol
http://github.com/etsy/cdncontrol_ui

▪ logster
http://github.com/etsy/logster

▪ CDN API to Graphite scripts
http://github.com/lozzd/cdn_scripts
@lozzd • @ickymettle
Thanks!
Questions?
@lozzd • @ickymettle
Integrating Multiple CDN Providers
Our experiences at Etsy

@lozzd • @ickymettle
Integrating multiple CDNs at Etsy

More Related Content

Integrating multiple CDNs at Etsy